Feature Request: Option to make "git rev-list --objects" output duplicate objects

All of lore.kernel.org
 help / color / mirror / Atom feed

* Feature Request: Option to make "git rev-list --objects" output duplicate objects
@ 2023-03-24 15:51 Baumann, Moritz
  2023-03-24 16:50 ` Junio C Hamano
  2023-03-24 19:28 ` Jeff King
  0 siblings, 2 replies; 9+ messages in thread
From: Baumann, Moritz @ 2023-03-24 15:51 UTC (permalink / raw)
  To: git

Good afternoon,

While rewriting a slow pre-receive hook to (hopefully) speed it up, I ran into
the following problem:

We want to perform a bunch of checks on all newly pushed files, some of which
are based on the file contents and some of which are based on the file name.
I started with…

  git rev-list --pretty=format:'' --no-commit-header \
  --objects --filter=object:type=blob "$new_head" --not --all

…and then used the resulting list for all subsequent checks. After writing some
unit tests, I noticed that the returned list is not sufficient: If you generate
the exact same file twice, once with a "bad" name and once with a "good" name,
you will only see one of those names and therefore the hook will mistakenly
allow the push.

So, what I would want/need is an option that forces "git rev-list --objects"
to output the object multiple times if it has multiple names in the commit
range. Admittedly, such an option would likely only be useful for hooks that
validate file names.

Would it be feasible to implement such an option? If so, does it sound like a
good or bad idea?

Is there any alternative for my use case that doesn't involve walking the
commits one-by-one? (That's what we previously did and what turned out to be
quite slow on our repository.)

Best regards,
Moritz Baumann

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Feature Request: Option to make "git rev-list --objects" output duplicate objects
  2023-03-24 15:51 Feature Request: Option to make "git rev-list --objects" output duplicate objects Baumann, Moritz
@ 2023-03-24 16:50 ` Junio C Hamano
  2023-03-27  7:02   ` Baumann, Moritz
  2023-03-24 19:28 ` Jeff King
  1 sibling, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2023-03-24 16:50 UTC (permalink / raw)
  To: Baumann, Moritz; +Cc: git

"Baumann, Moritz" <moritz.baumann@sap.com> writes:

> Is there any alternative for my use case that doesn't involve walking the
> commits one-by-one?

I do not think there currently is such an option, and not showing
the same object twice is pretty much fundamental in the operation of
the command, so it is unclear what the new feature should look like.

Should it also show the same commit it encounters during its history
walk (remember, a history can have forks and merges in it) in
duplicates if it encounters it more than once?  Should it show all
the subtrees and blobs in the tree of each commit, even if most of
them do not change from commit to commit?  How does the user control
which ones are duplicated and which ones are to be visited only
once?  How does the command line option used by the user to tell the
command to give such a choice look like?

Once there things are designed well, implementation of such a mode
would not be too hard.

By the way, "rev-list" internally uses walking the commits
one-by-one, anyway ;-).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Feature Request: Option to make "git rev-list --objects" output duplicate objects
  2023-03-24 15:51 Feature Request: Option to make "git rev-list --objects" output duplicate objects Baumann, Moritz
  2023-03-24 16:50 ` Junio C Hamano
@ 2023-03-24 19:28 ` Jeff King
  2023-03-28  8:08   ` Baumann, Moritz
  1 sibling, 1 reply; 9+ messages in thread
From: Jeff King @ 2023-03-24 19:28 UTC (permalink / raw)
  To: Baumann, Moritz; +Cc: git

On Fri, Mar 24, 2023 at 03:51:21PM +0000, Baumann, Moritz wrote:

> …and then used the resulting list for all subsequent checks. After writing some
> unit tests, I noticed that the returned list is not sufficient: If you generate
> the exact same file twice, once with a "bad" name and once with a "good" name,
> you will only see one of those names and therefore the hook will mistakenly
> allow the push.
> 
> So, what I would want/need is an option that forces "git rev-list --objects"
> to output the object multiple times if it has multiple names in the commit
> range. Admittedly, such an option would likely only be useful for hooks that
> validate file names.

Another problem you might not have run into yet: the names given by
rev-list are not quoted in any way, and will just omit newlines. So if
your hook is trying to avoid malicious garbage like "foo\nbar", it won't
work.

Those names are really just intended as hints for pack-objects. I
suspect the documentation could be more clear about these limitations.

> Would it be feasible to implement such an option? If so, does it sound like a
> good or bad idea?
> 
> Is there any alternative for my use case that doesn't involve walking the
> commits one-by-one? (That's what we previously did and what turned out to be
> quite slow on our repository.)

I'm not sure what you mean by "one by one", since that is inherently
what rev-list is doing under the hood. If you mean "running a separate
process for each commit", then yes, that will be slow. But if you want
to know all of the names touched in a set of commits, I have used
something like this before:

  git rev-list $new --not --all |
  git diff-tree --stdin --format= -r -c --name-only

A few notes:

  - the names may be quoted if they have metacharacters; use "-z" if
    your reading side can handle it to make things simpler

  - merges are always tricky. I think "-c" will give you what you want
    (showing names which differed from any parent), but I didn't think
    too hard.  Using "-m" definitely would work, but may produce extra
    names (ones where the merge just brought together two lines of
    history, even though the commit where one of those lines touched the
    file may have been excluded via "--not --all").

  - if you are assuming the existing names are good, then probably
    --diff-filter=A would be useful, as it would show only
    newly-introduced names.

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Feature Request: Option to make "git rev-list --objects" output duplicate objects
  2023-03-24 16:50 ` Junio C Hamano
@ 2023-03-27  7:02   ` Baumann, Moritz
  2023-03-27 16:07     ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Baumann, Moritz @ 2023-03-27  7:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

> I do not think there currently is such an option, and not showing
> the same object twice is pretty much fundamental in the operation of
> the command, so it is unclear what the new feature should look like.
>
> Should it also show the same commit it encounters during its history
> walk (remember, a history can have forks and merges in it) in
> duplicates if it encounters it more than once?  Should it show all
> the subtrees and blobs in the tree of each commit, even if most of
> them do not change from commit to commit?  How does the user control
> which ones are duplicated and which ones are to be visited only
> once?  How does the command line option used by the user to tell the
> command to give such a choice look like?

I was a bit brief in my description on Friday evening, due to the day & time.
My naïve idea would be as follows:

The option would be called something like "--include-all-object-names" and
belong to the category of options that only make sense in combination with
"--objects". That name (hopefully) already explains the intended behavior:

 * commits are not duplicated
 * as before, only changed blobs / subtrees are shown, however:
 * blobs / subtrees are duplicated in the output if they were previously shown
   with a different name

> By the way, "rev-list" internally uses walking the commits
> one-by-one, anyway ;-).

I really should not write emails on Friday evening… What I meant was that I'd
rather use a single call to git than one per commit.

(The previous implementation of my hook then also queried all blobs for each
commit, not just the changed ones, which contributed to the bad performance.)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Feature Request: Option to make "git rev-list --objects" output duplicate objects
  2023-03-27  7:02   ` Baumann, Moritz
@ 2023-03-27 16:07     ` Junio C Hamano
  0 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2023-03-27 16:07 UTC (permalink / raw)
  To: Baumann, Moritz; +Cc: git

"Baumann, Moritz" <moritz.baumann@sap.com> writes:

> The option would be called something like "--include-all-object-names" and
> belong to the category of options that only make sense in combination with
> "--objects". That name (hopefully) already explains the intended behavior:
>
>  * commits are not duplicated

Didn't the sentence above just say "all object names"?  Why not commits?

>  * as before, only changed blobs / subtrees are shown, however:

We are not showing "changed" things at all.  If you want that, you'd
need "git rev-list | git diff-tree --stdin" instead.

The difference matters if you are aiming for automation (i.e. not
casual browsing).  If you had a blob A, changed it in a commit to B,
and then put the original one A back in another commit, you should
see (in reverse history traversal) A and then B and then A again if
we were showing "changed" things.  But rev-list is not about showing
changes.  We show A in the latest commit, B in the previous commit,
and that is the end.  Original needs not to be shown, because rev-list
is enumerating the objects in the history and it knows that it showed
A already.

If you did not have a blob C, and added two instances of the same
blob C to different paths, you should see C twice if we were showing
"changed" things.  But rev-list is not about showing changes.  We
show C once for a path, and it is not shown again for the other
path.  The other one needs not to be shown, because rev-list is
enumerating the objects in the history and it knows that it showed C
already.

>  * blobs / subtrees are duplicated in the output if they were previously shown
>    with a different name

I suspect that adding "if they were ... with a DIFFERENT name" would
probably make it prohibitively more expensive.  The traversal done
by rev-list is fundamentally driven by "have we shown this object
(1-bit)?" to avoid duplicated work, and not "at which path did this
object appear (unbounded amount of information)?".

As I told you in the message you are responding to, I think I am OK
with the idea of showing a bit more from "rev-list --objects", but
such an enhancement may need to be designed a bit better, I am
afraid.

I suspect that "rev-list | diff-tree --stdin" without "--objects"
added to the upstream "rev-list" might be closer to what you are
looking for in this case, but I dunno.

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Feature Request: Option to make "git rev-list --objects" output duplicate objects
  2023-03-24 19:28 ` Jeff King
@ 2023-03-28  8:08   ` Baumann, Moritz
  2023-03-28 18:26     ` [PATCH] docs: document caveats of rev-list's object-name output Jeff King
  2023-03-28 18:32     ` Feature Request: Option to make "git rev-list --objects" output duplicate objects Jeff King
  0 siblings, 2 replies; 9+ messages in thread
From: Baumann, Moritz @ 2023-03-28  8:08 UTC (permalink / raw)
  To: Jeff King; +Cc: git

> Another problem you might not have run into yet: the names given by
> rev-list are not quoted in any way, and will just omit newlines. So if
> your hook is trying to avoid malicious garbage like "foo\nbar", it won't
> work.

Thanks for that warning. I was not aware that rev-list didn't quote file names.

> Those names are really just intended as hints for pack-objects. I
> suspect the documentation could be more clear about these limitations.

That would indeed be great and would have likely prevented the obvious
misconceptions on my side.

> I'm not sure what you mean by "one by one", since that is inherently
> what rev-list is doing under the hood. If you mean "running a separate
> process for each commit", then yes, that will be slow.

Yes, that's what I meant to say.

> But if you want
> to know all of the names touched in a set of commits, I have used
> something like this before:
>
>   git rev-list $new --not --all |
>   git diff-tree --stdin --format= -r -c --name-only

Thanks, that looks promising and solves at least one of my use cases. The only
minor problem is that there seems to be no way to pipe the diff-tree output to
cat-file without massaging it with awk first.

I have three uses cases in my pre-receive hooks:

1. Filters solely based on the file name
   ? your suggestions works perfectly here
2. Filters based only on file contents
   ? git rev-list --objects + git cat-file provide everything I need
3. One filter based on file size and name (forbid large files, with exceptions)
   ? I'm guessing "git rev-list | git diff-tree --stdin | awk |
     git cat-file --batch-check" is the best solution to extract the necessary
     information from git in this case?

-- Moritz

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] docs: document caveats of rev-list's object-name output
  2023-03-28  8:08   ` Baumann, Moritz
@ 2023-03-28 18:26     ` Jeff King
  2023-03-30 10:32       ` Baumann, Moritz
  2023-03-28 18:32     ` Feature Request: Option to make "git rev-list --objects" output duplicate objects Jeff King
  1 sibling, 1 reply; 9+ messages in thread
From: Jeff King @ 2023-03-28 18:26 UTC (permalink / raw)
  To: Baumann, Moritz; +Cc: Junio C Hamano, git

On Tue, Mar 28, 2023 at 08:08:02AM +0000, Baumann, Moritz wrote:

> > Those names are really just intended as hints for pack-objects. I
> > suspect the documentation could be more clear about these limitations.
> 
> That would indeed be great and would have likely prevented the obvious
> misconceptions on my side.

Here's what I came up with.

-- >8 --
Subject: docs: document caveats of rev-list's object-name output

At first glance, the names given by "rev-list --objects" seem like a
good way to see which paths are present in a set of commits. But there
are some subtle gotchas there. We do not document the format of the
names at all, so let's do so, along with warning of these problems.

I intentionally did not document the exact format of the names here, as
I don't think it's something we want people to rely on (though I doubt
in practice that we'd change it at this point).

Though all of this is historically tied to "--objects", these days we
have a separate "--object-names" flag which can turn the names off or
on. So I put the detailed documentation there, but added a note from
--objects (which did not otherwise mention the names at all, even though
they are on by default).

Signed-off-by: Jeff King <peff@peff.net>
---
I also considered adding a specific "if you want the names of each file
in a range of commits, pipe to diff-tree" example. But it seemed like it
would clutter up this section. It might be OK as a stand-alone in the
EXAMPLES section, but should probably be done as a separate patch if
anyone is interested.

 Documentation/rev-list-options.txt | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index 90c73d6708b..3000888a908 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -890,7 +890,7 @@ ifdef::git-rev-list[]
 	Print the object IDs of any object referenced by the listed
 	commits.  `--objects foo ^bar` thus means ``send me
 	all object IDs which I need to download if I have the commit
-	object _bar_ but not _foo_''.
+	object _bar_ but not _foo_''. See also `--object-names` below.
 
 --in-commit-order::
 	Print tree and blob ids in order of the commits. The tree
@@ -920,7 +920,12 @@ ifdef::git-rev-list[]
 
 --object-names::
 	Only useful with `--objects`; print the names of the object IDs
-	that are found. This is the default behavior.
+	that are found. This is the default behavior. Note that the
+	"name" of each object is ambiguous, and mostly intended as a
+	hint for packing objects. In particular: no distinction is made between
+	the names of tags, trees, and blobs; path names may be modified
+	to remove newlines; and if an object would appear multiple times
+	with different names, only one name is shown.
 
 --no-object-names::
 	Only useful with `--objects`; does not print the names of the object
-- 
2.40.0.616.gf524ec75088


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Feature Request: Option to make "git rev-list --objects" output duplicate objects
  2023-03-28  8:08   ` Baumann, Moritz
  2023-03-28 18:26     ` [PATCH] docs: document caveats of rev-list's object-name output Jeff King
@ 2023-03-28 18:32     ` Jeff King
  1 sibling, 0 replies; 9+ messages in thread
From: Jeff King @ 2023-03-28 18:32 UTC (permalink / raw)
  To: Baumann, Moritz; +Cc: git

On Tue, Mar 28, 2023 at 08:08:02AM +0000, Baumann, Moritz wrote:

> > But if you want
> > to know all of the names touched in a set of commits, I have used
> > something like this before:
> >
> >   git rev-list $new --not --all |
> >   git diff-tree --stdin --format= -r -c --name-only
> 
> Thanks, that looks promising and solves at least one of my use cases. The only
> minor problem is that there seems to be no way to pipe the diff-tree output to
> cat-file without massaging it with awk first.
> 
> I have three uses cases in my pre-receive hooks:
> 
> 1. Filters solely based on the file name
>    ? your suggestions works perfectly here
> 2. Filters based only on file contents
>    ? git rev-list --objects + git cat-file provide everything I need
> 3. One filter based on file size and name (forbid large files, with exceptions)
>    ? I'm guessing "git rev-list | git diff-tree --stdin | awk |
>      git cat-file --batch-check" is the best solution to extract the necessary
>      information from git in this case?

Yes, that's how I would do all of those. Having to massage the output
between diff-tree and cat-file is a little annoying, but at least can
still be done in O(1) processes. And you really need some language
capable of parsing cat-file output anyway, so it's not too big of a
lift.

One thing we did at GitHub is teach index-pack to collect a list of
too-big paths (since naturally it knows the size of every blob as it
indexes it), and write them out to a special path. Then our pre-receive
hook could quickly check the list and act on it (warning above a certain
size, rejecting above another), without having to traverse again.

I haven't sent those patches upstream, though, for a few reasons:

  - we only hooked index-pack, not unpack-objects (we only use the
    former, not the latter, but in stock Git you might see either)

  - getting just the object names is kind of awkward. You have to then
    invoke rev-list to find the names of hits (though at least you only
    do so when there is a problem object; the happy case remains fast).

  - it's actually not that helpful to avoid the traversal if you have
    other stuff you want to check anyway (like file contents, or names).
    It was one of those things we optimized a long time ago, and I kind
    of doubt is doing much good these days.

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] docs: document caveats of rev-list's object-name output
  2023-03-28 18:26     ` [PATCH] docs: document caveats of rev-list's object-name output Jeff King
@ 2023-03-30 10:32       ` Baumann, Moritz
  0 siblings, 0 replies; 9+ messages in thread
From: Baumann, Moritz @ 2023-03-30 10:32 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

> > > Those names are really just intended as hints for pack-objects. I
> > > suspect the documentation could be more clear about these limitations.
> >
> > That would indeed be great and would have likely prevented the obvious
> > misconceptions on my side.
>
> Here's what I came up with.

Thanks, this is one half of what I would have needed to read.

> I also considered adding a specific "if you want the names of each file
> in a range of commits, pipe to diff-tree" example. But it seemed like it
> would clutter up this section. It might be OK as a stand-alone in the
> EXAMPLES section, but should probably be done as a separate patch if
> anyone is interested.

That would be the other half. Since e.g. GitHub's own "best practice" examples
do not use this pattern [0], I would assume that I'm not the only one who
didn't know about it.

[0] https://github.com/github/platform-samples/blob/master/pre-receive-hooks/block_file_extensions.sh

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-03-30 10:32 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-24 15:51 Feature Request: Option to make "git rev-list --objects" output duplicate objects Baumann, Moritz
2023-03-24 16:50 ` Junio C Hamano
2023-03-27  7:02   ` Baumann, Moritz
2023-03-27 16:07     ` Junio C Hamano
2023-03-24 19:28 ` Jeff King
2023-03-28  8:08   ` Baumann, Moritz
2023-03-28 18:26     ` [PATCH] docs: document caveats of rev-list's object-name output Jeff King
2023-03-30 10:32       ` Baumann, Moritz
2023-03-28 18:32     ` Feature Request: Option to make "git rev-list --objects" output duplicate objects Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.