git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: 胡哲宁 <adlternative@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Git List <git@vger.kernel.org>, Eric Sunshine <sunshine@sunshineco.com>
Subject: Re: [PATCH v3] ls-files.c: add --dedup option
Date: Sun, 17 Jan 2021 11:45:25 +0800	[thread overview]
Message-ID: <CAOLTT8RQWm-tpkj1aO1rPTCJApP5i+niQtNm_zMycSo5YT0B_w@mail.gmail.com> (raw)
In-Reply-To: <xmqqczy7vwub.fsf@gitster.c.googlers.com>

Junio, thank you for your patience to review
my patch and guide me how to modify it.

Junio C Hamano <gitster@pobox.com> 于2021年1月15日周五 上午8:59写道:
>
> "阿德烈 via GitGitGadget"  <gitgitgadget@gmail.com> writes:
>
> > From: ZheNing Hu <adlternative@gmail.com>
> >
> > In order to provide users a better experience
> > when viewing information about files in the index
> > and the working tree, the `--dedup` option will suppress
> > some duplicate options under some conditions.
> >
> > In a merge conflict, one item of "git ls-files" output may
> > appear multiple times. For example,now the file `a.c` has
> > a conflict,`a.c` will appear three times in the output of
> > "git ls-files".We can use "git ls-files --dedup" to output
> > `a.c` only one time.(unless `--stage` or `--unmerged` is
> > used to view all the detailed information in the index)
>
> Unlike these option names we see in the description, "dedup" is not
> a full word.  Perhaps spell it fully "--deduplicate" while letting
> parse-options machinery to accept unique prefix (including
> "--dedup"?
>
Ok i have modified "--dedup" to "--deduplicate".
> > In addition, if you use both `--delete` and `--modify` in
> > the same time, The `--dedup` option can also suppress modified
>
> "at the same time", I think.
>
My poor English grammar :-)
> > entries output.
>
> [let's call this point "point A"]
>
> > `--dedup` option relevant descriptions in
> > `Documentation/git-ls-files.txt`,
>
> I am not sure what this means.
>
> > the test script in `t/t3012-ls-files-dedup.sh`
> > prove the correctness of the `--dedup` option.
>
> No amount of tests "proves" any correctness, but that is OK.  I
> think you meant to say "a few tests have been added to t3012 to
> protect the new feature from future breakage" or something like
> that.
>
Alright, I understand!
> In any case, I think everything after "point A" and before your sign
> off does not belong to the log message.  The diffstat shows that
> documentation and tests have been added already.
>
> > +--dedup::
> > +     Suppress duplicate entries when conflict happen
>
> "conflict happen" -> "there are unmerged paths", as the term
> "unmerged" is already shown to readers of "ls-files --help".
>
Well, maybe I'm not good enough with these proper nouns.
> > +     or `--deleted` and `--modified` are combined.
>
> I somehow thought that you refrained from deduping when you are
> showing the stages with "ls-files -u" and "ls-files -s", or you are
> showing status with "ls-files -t", because you will otherwise lose
> information.  In other words, showing only one cache entry out of
> many that share the same name makes sense only when we are showing
> name and nothing else.
>
You are right, "--deduplicate" should only work on duplicate file names,
so "ls-files -t" also needs to be corrected.
Well,This is true a bug I haven't notice.
> Having said all that, I suspect that we may be much better off if we
> can somehow merge the two loops into one.  You may be dedup adjacent
> entries in each loop separately with the approach taken by this
> patch, but I do not think the patch would work to deduplicate across
> two loops.  For example, what happens if you do this?
>
>     $ git reset --hard
>     $ echo >>builtin/ls-files.c
>     $ git ls-files -c -m builtin/ls-files.c
>     $ git ls-files -t -c -m builtin/ls-files.c
>
> I think you see the path twice in the output, with or without your
> --dedup option (remember what I said about proving, by the way? ;-)).
>
Yeah,This is because I may have missed the -c option with other options
at the same time.
Here I may disagree with your point of view:
                 if (errno != E_NOENT)
                         error_errno("cannot lstat '%s'", fullname.buf);
With this sentence included, the patch will fail the test:
t/t3010-ls-files-killed-modified.sh.
the errno maybe ENOTDIR when you try to lstat a file`r` with `lstat("r/f",&st);`
So I temporarily removed the judgment of errno.
>  #2: consolidate two for loops into one.
>
>      The two loops have slightly different condition to skip a ce,
>      and different logic on what tag each path is shown with.  When
>      --cached and --modified or --deleted are asked for at the same
>      time, we'd show them multiple times (this is done inside the
>      loop for each ce)
>
>         if (show_cached || show_stage)
>                 show_ce(... ce_stage(ce) ? tag_unmerged : ...);
>         err = lstat(fullname.buf, &st);
>         if (err) {
>                 /* deleted? */
>                 ... code that corresponds to the
>                 ... illustration in #1 above come here.
>         } else if (...)
>                 show_ce(..., tag_modified);
>
>      This changes the semantics.  The original iterates the index
>      twice, so you may see the same entry from --cached once and
>      then again from --modified.  The updated one still will show
>      the same entry twice but next to each other.
>
Well,This does change the semantics. I think people who used two
for loops before may want to separate different outputs.
Now, if you don’t use "--deduplicate", You may see six consecutive
items under a combination of multiple options.
>  #3: optionally deduplicate.
>
>      Once we have a single loop, deduplicationg based on names is
>      trivial, as we seen before.
>
>
Indeed so.
> Hmm?
THANKS.

Junio C Hamano <gitster@pobox.com> 于2021年1月15日周五 上午8:59写道:
>
> "阿德烈 via GitGitGadget"  <gitgitgadget@gmail.com> writes:
>
> > From: ZheNing Hu <adlternative@gmail.com>
> >
> > In order to provide users a better experience
> > when viewing information about files in the index
> > and the working tree, the `--dedup` option will suppress
> > some duplicate options under some conditions.
> >
> > In a merge conflict, one item of "git ls-files" output may
> > appear multiple times. For example,now the file `a.c` has
> > a conflict,`a.c` will appear three times in the output of
> > "git ls-files".We can use "git ls-files --dedup" to output
> > `a.c` only one time.(unless `--stage` or `--unmerged` is
> > used to view all the detailed information in the index)
>
> Unlike these option names we see in the description, "dedup" is not
> a full word.  Perhaps spell it fully "--deduplicate" while letting
> parse-options machinery to accept unique prefix (including
> "--dedup"?
>
> > In addition, if you use both `--delete` and `--modify` in
> > the same time, The `--dedup` option can also suppress modified
>
> "at the same time", I think.
>
> > entries output.
>
> [let's call this point "point A"]
>
> > `--dedup` option relevant descriptions in
> > `Documentation/git-ls-files.txt`,
>
> I am not sure what this means.
>
> > the test script in `t/t3012-ls-files-dedup.sh`
> > prove the correctness of the `--dedup` option.
>
> No amount of tests "proves" any correctness, but that is OK.  I
> think you meant to say "a few tests have been added to t3012 to
> protect the new feature from future breakage" or something like
> that.
>
> In any case, I think everything after "point A" and before your sign
> off does not belong to the log message.  The diffstat shows that
> documentation and tests have been added already.
>
> > +--dedup::
> > +     Suppress duplicate entries when conflict happen
>
> "conflict happen" -> "there are unmerged paths", as the term
> "unmerged" is already shown to readers of "ls-files --help".
>
> > +     or `--deleted` and `--modified` are combined.
>
> I somehow thought that you refrained from deduping when you are
> showing the stages with "ls-files -u" and "ls-files -s", or you are
> showing status with "ls-files -t", because you will otherwise lose
> information.  In other words, showing only one cache entry out of
> many that share the same name makes sense only when we are showing
> name and nothing else.
>
> Has that been changed from the previous rounds?
>
> > diff --git a/builtin/ls-files.c b/builtin/ls-files.c
> > index c8eae899b82..bc4eded19ab 100644
> > --- a/builtin/ls-files.c
> > +++ b/builtin/ls-files.c
> > @@ -316,6 +318,20 @@ static void show_files(struct repository *repo, struct dir_struct *dir)
> >               for (i = 0; i < repo->index->cache_nr; i++) {
> >                       const struct cache_entry *ce = repo->index->cache[i];
> >
> > +                     if (show_cached && delete_dup) {
> > +                             switch (ce_stage(ce)) {
> > +                             case 0:
> > +                             default:
> > +                                     break;
>
> This part looks somewhat strange for two reasons:
>
>  - The code enumerates ALL the possible stage numbers from 0 to 3;
>    if we were to have "default", I'd expect it would be a separate
>    switch arm from the possible values that calls out an programming
>    error, e.g. BUG("at stage #%d???", ce_stage(ce)).  Simply removing
>    the "default" arm would be another way out of this strangeness.
>
>  - When we see a stage #0 entry, we know we will not have higher
>    stage entries with the same name.  Not clearing last_stage here
>    feels wrong, as the primary reason why last_stage variable is
>    used is to remember the last ce that was shown, so that other
>    entries with the same name can be skipped.
>
> By the way, "last_shown_ce" may be a much better name for the
> variable, as you do not really care what stage number the ce you
> showed last was at (you care about its name).
>
> Also, I do not see a good reason why the last_shown_ce variable
> should have lifetime longer than the block that contains this for()
> loop (and the other for loop for deleted/modified codepath we see
> later).  Especially since you initialize the variable that you made
> visible to the entire function to NULL before entering the first for
> loop, but you do not set it back to NULL before entering the second
> for loop, it is inviting a subtle bug.  You may have been given
> show_cached and show_modified at the same time, so you enter the
> first loop and have shown the first stage of the last conflicted
> path, whose cache entry is left in the last_stage variable.  Since
> the variable has longer lifespan than necessary, when the second
> loop is entered, it still points at the cache entry for the highest
> stage of the last conflicted path.  That is because the code forgets
> to clear it to NULL before entering the second for loop.
>
> Having said all that, I suspect that we may be much better off if we
> can somehow merge the two loops into one.  You may be dedup adjacent
> entries in each loop separately with the approach taken by this
> patch, but I do not think the patch would work to deduplicate across
> two loops.  For example, what happens if you do this?
>
>     $ git reset --hard
>     $ echo >>builtin/ls-files.c
>     $ git ls-files -c -m builtin/ls-files.c
>     $ git ls-files -t -c -m builtin/ls-files.c
>
> I think you see the path twice in the output, with or without your
> --dedup option (remember what I said about proving, by the way? ;-)).
>
> Once we successfully merged two loops into one, the part that shows
> tracked paths in the function would have only one loop, and it would
> become a lot cleaner to add the logic to "skip showing the ce if it
> has the same name as the previously shown one, only when doing so
> won't lose information", by doing something like this:
>
>         static void show_files(....)
>         {
>                 /* show_others || show_killed done here */
>                 ...
>
>                 /* leave early if not showing anything */
>                 if (! (show_cached || show_stage || show_deleted || show_modified))
>                         return;
>
>                 last_shown_ce = NULL;
>                 for (i = 0; i < repo->index->cache_nr; i++) {
>                         const struct cache_entry *ce = repo->index->cache[i];
>
>                         if (skipping_duplicates && last_shown_ce)
>                                 if (!strcmp(ce->name, last_shown_ce->name))
>                                         continue;
>
>                         construct_fullname();
>
>                         /* various reasons to skip the entry tested */
>                         if (showing ignored directory and ce is excluded)
>                                 continue;
>                         if (show_unmerged && !ce_stage(ce))
>                                 continue;
>                         if (ce->ce_flags & CE_UPDATE)
>                                 continue;
>                         ... other reasons may appear here ...
>
>                         /* now we are committed to show it */
>                         last_shown_ce = ce;
>
>                         ... various different ways to show ce come here ...
>                         show_ce(...);
>                 }
>         }
>
> where "skipping_duplicates" would be set when "--deduplicate" is
> asked and we are not showing information other than the pathname
> via various options e.g. the tags (-t) or stages (-s/-u).
>
> > +                     if (delete_dup && show_deleted && show_modified && err)
> >                               show_ce(repo, dir, ce, fullname.buf, tag_removed);
>
> I actually think the original code that is still shown here ...
>
> > +                     else {
> > +                             if (show_deleted && err)
> > +                                     show_ce(repo, dir, ce, fullname.buf, tag_removed);
> > +                             if (show_modified && ie_modified(repo->index, ce, &st, 0))
>
> ... about modified file is buggy.  If lstat() failed, then &st has
> no useful information, so it is wrong to feed it to ie_modified().
>
> Perhaps a three-patch series that is structured like this may be in
> order?
>
>  #1: bugfix for --deleted and --modified.
>
>         err = lstat(fullname.buf, &st);
>         if (err) {
>                 /* deleted? */
>                 if (errno != E_NOENT)
>                         error_errno("cannot lstat '%s'", fullname.buf);
>                 else {
>                         if (show_deleted)
>                                 show_ce(..., tag_removed);
>                         if (show_modified)
>                                 show_ce(..., tag_modified);
>                 }
>         } else if (show_modified && ie_modified(...))
>                 show_ce(..., tag_modified);
>
>      This hopefully should not change the semantics.  If you ask
>      --deleted and --modified, a deleted path would be listed twice.
>
>  #2: consolidate two for loops into one.
>
>      The two loops have slightly different condition to skip a ce,
>      and different logic on what tag each path is shown with.  When
>      --cached and --modified or --deleted are asked for at the same
>      time, we'd show them multiple times (this is done inside the
>      loop for each ce)
>
>         if (show_cached || show_stage)
>                 show_ce(... ce_stage(ce) ? tag_unmerged : ...);
>         err = lstat(fullname.buf, &st);
>         if (err) {
>                 /* deleted? */
>                 ... code that corresponds to the
>                 ... illustration in #1 above come here.
>         } else if (...)
>                 show_ce(..., tag_modified);
>
>      This changes the semantics.  The original iterates the index
>      twice, so you may see the same entry from --cached once and
>      then again from --modified.  The updated one still will show
>      the same entry twice but next to each other.
>
>  #3: optionally deduplicate.
>
>      Once we have a single loop, deduplicationg based on names is
>      trivial, as we seen before.
>
>
> Hmm?

  reply	other threads:[~2021-01-17  3:45 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-06  8:53 [PATCH] builtin/ls-files.c:add git ls-file --dedup option 阿德烈 via GitGitGadget
2021-01-07  6:10 ` Eric Sunshine
2021-01-07  6:40   ` Junio C Hamano
2021-01-08 14:36 ` [PATCH v2 0/2] " 阿德烈 via GitGitGadget
2021-01-08 14:36   ` [PATCH v2 1/2] " ZheNing Hu via GitGitGadget
2021-01-08 14:36   ` [PATCH v2 2/2] builtin:ls-files.c:add " ZheNing Hu via GitGitGadget
2021-01-14  6:38     ` Eric Sunshine
2021-01-14  8:17       ` 胡哲宁
2021-01-14 12:22   ` [PATCH v3] ls-files.c: add " 阿德烈 via GitGitGadget
2021-01-15  0:59     ` Junio C Hamano
2021-01-17  3:45       ` 胡哲宁 [this message]
2021-01-17  4:37         ` Junio C Hamano
2021-01-16  7:13     ` Eric Sunshine
2021-01-17  3:49       ` 胡哲宁
2021-01-17  5:11         ` Eric Sunshine
2021-01-17 23:04           ` Junio C Hamano
2021-01-18 14:59             ` Eric Sunshine
2021-01-17  4:02     ` [PATCH v4 0/3] builtin/ls-files.c:add git ls-file " 阿德烈 via GitGitGadget
2021-01-17  4:02       ` [PATCH v4 1/3] ls_files.c: bugfix for --deleted and --modified ZheNing Hu via GitGitGadget
2021-01-17  6:22         ` Junio C Hamano
2021-01-17  4:02       ` [PATCH v4 2/3] ls_files.c: consolidate two for loops into one ZheNing Hu via GitGitGadget
2021-01-17  4:02       ` [PATCH v4 3/3] ls-files: add --deduplicate option ZheNing Hu via GitGitGadget
2021-01-17  6:25         ` Junio C Hamano
2021-01-17 23:34         ` Junio C Hamano
2021-01-18  4:09           ` 胡哲宁
2021-01-18  6:05             ` 胡哲宁
2021-01-18 21:31               ` Junio C Hamano
2021-01-19  2:56                 ` 胡哲宁
2021-01-19  6:30       ` [PATCH v5 0/3] builtin/ls-files.c:add git ls-file --dedup option 阿德烈 via GitGitGadget
2021-01-19  6:30         ` [PATCH v5 1/3] ls_files.c: bugfix for --deleted and --modified ZheNing Hu via GitGitGadget
2021-01-20 20:26           ` Junio C Hamano
2021-01-21 10:02             ` 胡哲宁
2021-01-19  6:30         ` [PATCH v5 2/3] ls_files.c: consolidate two for loops into one ZheNing Hu via GitGitGadget
2021-01-20 20:27           ` Junio C Hamano
2021-01-21 11:05             ` 胡哲宁
2021-01-19  6:30         ` [PATCH v5 3/3] ls-files.c: add --deduplicate option ZheNing Hu via GitGitGadget
2021-01-20 21:26           ` Junio C Hamano
2021-01-21 11:00             ` 胡哲宁
2021-01-21 20:45               ` Junio C Hamano
2021-01-22  9:50                 ` 胡哲宁
2021-01-22 16:04                   ` Johannes Schindelin
2021-01-22 18:02                     ` Junio C Hamano
2021-03-19 13:54                       ` GitGitGadget and `next`, was " Johannes Schindelin
2021-03-19 18:11                         ` Junio C Hamano
2021-01-23  8:20                     ` 胡哲宁
2021-01-22 15:46               ` [PATCH v6] " ZheNing Hu
2021-01-22 20:52                 ` Junio C Hamano
2021-01-23  8:27                   ` 胡哲宁
2021-01-23 10:20         ` [PATCH v6 0/3] builtin/ls-files.c:add git ls-file --dedup option 阿德烈 via GitGitGadget
2021-01-23 10:20           ` [PATCH v6 1/3] ls_files.c: bugfix for --deleted and --modified ZheNing Hu via GitGitGadget
2021-01-23 17:55             ` Junio C Hamano
2021-01-23 10:20           ` [PATCH v6 2/3] ls_files.c: consolidate two for loops into one ZheNing Hu via GitGitGadget
2021-01-23 19:50             ` Junio C Hamano
2021-01-23 10:20           ` [PATCH v6 3/3] ls-files.c: add --deduplicate option ZheNing Hu via GitGitGadget
2021-01-23 19:51             ` Junio C Hamano
2021-01-23 19:53           ` [PATCH v7 1/3] ls_files.c: bugfix for --deleted and --modified Junio C Hamano
2021-01-23 19:53             ` [PATCH v7 2/3] ls_files.c: consolidate two for loops into one Junio C Hamano
2021-01-23 19:53             ` [PATCH v7 3/3] ls-files.c: add --deduplicate option Junio C Hamano
2021-01-24 10:54           ` [PATCH v7 0/3] builtin/ls-files.c:add git ls-file --dedup option 阿德烈 via GitGitGadget
2021-01-24 10:54             ` [PATCH v7 1/3] ls_files.c: bugfix for --deleted and --modified ZheNing Hu via GitGitGadget
2021-01-24 22:04               ` Junio C Hamano
2021-01-25  6:05                 ` 胡哲宁
2021-01-25 19:05                   ` Junio C Hamano
2021-01-24 10:54             ` [PATCH v7 2/3] ls_files.c: consolidate two for loops into one ZheNing Hu via GitGitGadget
2021-01-24 10:54             ` [PATCH v7 3/3] ls-files.c: add --deduplicate option ZheNing Hu via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAOLTT8RQWm-tpkj1aO1rPTCJApP5i+niQtNm_zMycSo5YT0B_w@mail.gmail.com \
    --to=adlternative@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=sunshine@sunshineco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).