All of lore.kernel.org
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Stefan Beller <sbeller@google.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 1/1] diffcore: add a filter to find a specific blob
Date: Fri, 24 Nov 2017 16:43:49 +0900	[thread overview]
Message-ID: <xmqqpo88m896.fsf@gitster.mtv.corp.google.com> (raw)
In-Reply-To: <20171120222529.24995-2-sbeller@google.com> (Stefan Beller's message of "Mon, 20 Nov 2017 14:25:29 -0800")

Stefan Beller <sbeller@google.com> writes:

> Sometimes users are given a hash of an object and they want to
> identify it further (ex.: Use verify-pack to find the largest blobs,
> but what are these? or [1])
>
> One might be tempted to extend git-describe to also work with blobs,
> such that `git describe <blob-id>` gives a description as
> '<commit-ish>:<path>'.  This was implemented at [2]; as seen by the sheer
> number of responses (>110), it turns out this is tricky to get right.
> The hard part to get right is picking the correct 'commit-ish' as that
> could be the commit that (re-)introduced the blob or the blob that
> removed the blob; the blob could exist in different branches.
>
> Junio hinted at a different approach of solving this problem, which this
> patch implements. Teach the diff machinery another flag for restricting
> the information to what is shown. For example:
>
>   $ ./git log --oneline --blobfind=v2.0.0:Makefile
>   b2feb64309 Revert the whole "ask curl-config" topic for now
>   47fbfded53 i18n: only extract comments marked with "TRANSLATORS:"
>
> we observe that the Makefile as shipped with 2.0 was introduced in
> v1.9.2-471-g47fbfded53 and replaced in v2.0.0-rc1-5-gb2feb64309 by
> a different blob.
>
> [1] https://stackoverflow.com/questions/223678/which-commit-has-this-blob
> [2] https://public-inbox.org/git/20171028004419.10139-1-sbeller@google.com/
>
> Signed-off-by: Stefan Beller <sbeller@google.com>
> ---
>
> On playing around with this, trying to find more interesting cases, I observed:
>
>     git log --oneline --blobfind=HEAD:COPYING
>     703601d678 Update COPYING with GPLv2 with new FSF address
>     
>     git log --oneline --blobfind=703601d678^:COPYING
>     459b8d22e5 tests: do not borrow from COPYING and README from the real source
>     703601d678 Update COPYING with GPLv2 with new FSF address
>     075b845a85 Add a COPYING notice, making it explicit that the license is GPLv2.
>
>     t/diff-lib/COPYING may need an update of the adress of the FSF,
>     # leftoverbits I guess.

I do not think so.  See tz/fsf-address-update topic for details.

Please do not contaminate the list archive with careless mention of 
"hash-mark plus left over bits", as it will make searching the real
good bits harder.  Thanks.

> Another interesting case that I found was
>    git log --oneline --blobfind=v2.14.0:Makefile
>    3921a0b3c3 perf: add test for writing the index
>    36f048c5e4 sha1dc: build git plumbing code more explicitly
>    2118805b92 Makefile: add style build rule
>
> all of which were after v2.14, such that the introduction of that blob doesn't
> show up; I suspect it came in via a merge as unrelated series may have updated
> the Makefile in parallel, though git-log should have told me?

If that is the case, shouldn't we make this new mode imply
--full-history to forbid history simplification?  "git log" is a
tool to find _an_ explanation of the current state, and the usual
history simplification makes tons of sense there, but blobfind is
run most likely in order to find _all_ mention of the set of blobs
given.

> diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
> index dd0dba5b1d..252a21cc19 100644
> --- a/Documentation/diff-options.txt
> +++ b/Documentation/diff-options.txt
> @@ -500,6 +500,10 @@ information.
>  --pickaxe-regex::
>  	Treat the <string> given to `-S` as an extended POSIX regular
>  	expression to match.
> +--blobfind=<blob-id>::
> +	Restrict the output such that one side of the diff
> +	matches the given blob-id.
> +
>  endif::git-format-patch[]

Can we have a blank line between these enumerations to make the
source easier to read?  Thanks.

> diff --git a/diffcore-blobfind.c b/diffcore-blobfind.c
> new file mode 100644
> index 0000000000..5d222fc336
> --- /dev/null
> +++ b/diffcore-blobfind.c
> @@ -0,0 +1,51 @@
> +/*
> + * Copyright (c) 2017 Google Inc.
> + */
> +#include "cache.h"
> +#include "diff.h"
> +#include "diffcore.h"
> +
> +static void diffcore_filter_blobs(struct diff_queue_struct *q,
> +				  struct diff_options *options)
> +{
> +	int i, j = 0, c = q->nr;
> +
> +	if (!options->blobfind)
> +		BUG("blobfind oidset not initialized???");
> +
> +	for (i = 0; i < q->nr; i++) {
> +		struct diff_filepair *p = q->queue[i];
> +
> +		if (DIFF_PAIR_UNMERGED(p) ||
> +		    (DIFF_FILE_VALID(p->one) &&
> +		     oidset_contains(options->blobfind, &p->one->oid)) ||
> +		    (DIFF_FILE_VALID(p->two) &&
> +		     oidset_contains(options->blobfind, &p->two->oid)))
> +			continue;

So, we keep an unmerged pair, a pair that mentions a sought-blob on
one side or the other side?  I am not sure if we want to keep the
unmerged pair for the purpose of this one.

> +		diff_free_filepair(p);
> +		q->queue[i] = NULL;
> +		c--;

Also, if you are doing the in-place shrinking and have already
introduced another counter 'j' that is initialized to 0, I think it
makes more sense to do the shrinking in-place.  'i' will stay to be
the source-scan pointer that runs 0 thru q->nr, while 'j' can be
used in this loop (where you have 'continue') to move the current
one that is determined to survive from q->queue[i] to q->queue[j++].

Then you do not need 'c'; when the loop ends, 'j' would be the
number of surviving entries and q->nr can be adjusted to it.  Unlike
the usual pattern taken by the other diffcore transformations where
a new queue is populated and the old one discarded, this would leave
the q->queue[] over-allocated, but I do not think it is too bad.

  reply	other threads:[~2017-11-24  7:43 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-20 22:25 [PATCH 0/1] Teaching the diff machinery about blobfind [WAS: git describe <blob>] Stefan Beller
2017-11-20 22:25 ` [PATCH 1/1] diffcore: add a filter to find a specific blob Stefan Beller
2017-11-24  7:43   ` Junio C Hamano [this message]
2017-11-25  4:59     ` Junio C Hamano
2017-12-07 21:40     ` Junio C Hamano
2017-12-08  0:24 [PATCH 0/1] diffcore-blobfind Stefan Beller
2017-12-08  0:24 ` [PATCH 1/1] diffcore: add a filter to find a specific blob Stefan Beller
2017-12-08  9:34   ` Jeff King
2017-12-08 16:28     ` Ramsay Jones
2017-12-08 20:19       ` Jeff King
2017-12-08 20:39         ` Stefan Beller
2017-12-08 21:38           ` Jeff King
2017-12-08 15:04   ` Junio C Hamano
2017-12-08 17:21     ` Junio C Hamano
2017-12-08 21:11     ` Stefan Beller
2017-12-08 21:15       ` Junio C Hamano
2017-12-11 19:58 ` [PATCH 0/1] diff-core blobfind Stefan Beller
2017-12-11 19:58   ` [PATCH 1/1] diffcore: add a filter to find a specific blob Stefan Beller
2017-12-11 23:17     ` Junio C Hamano
2017-12-12  0:21       ` Stefan Beller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqpo88m896.fsf@gitster.mtv.corp.google.com \
    --to=gitster@pobox.com \
    --cc=git@vger.kernel.org \
    --cc=sbeller@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.