Git Mailing List Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/3] fast-export: allow dumping anonymization mappings
@ 2020-06-19 13:23 Jeff King
  2020-06-19 13:25 ` [PATCH 1/3] fast-export: allow dumping the refname mapping Jeff King
                   ` (4 more replies)
  0 siblings, 5 replies; 64+ messages in thread
From: Jeff King @ 2020-06-19 13:23 UTC (permalink / raw)
  To: git; +Cc: Johannes Schindelin

The "master" branch is special in fast-export in that we don't anonymize
it.  And the reason is that it helps to have some known reference point
between the original and anonymized repo, so you can find the same
commits.

This series gives an alternate way to achieve the same effect, but much
better in that it works for _any_ ref (so if you are trying to reproduce
the effect of "rev-list origin/foo..bar" in the anonymized repo, you can
easily do so). Ditto for paths, so that "rev-list -- foo.c" can be
reproduced in the anonymized repo.

And then we can drop the specialness of "master", which in turn is one
less thing to worry about in Dscho's series to make the default branch
configurable.

  [1/3]: fast-export: allow dumping the refname mapping
  [2/3]: fast-export: anonymize "master" refname
  [3/3]: fast-export: allow dumping the path mapping

 Documentation/git-fast-export.txt | 32 +++++++++++++++++
 builtin/fast-export.c             | 58 +++++++++++++++++++++++++++----
 t/t9351-fast-export-anonymize.sh  | 30 ++++++++++++----
 3 files changed, 106 insertions(+), 14 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 1/3] fast-export: allow dumping the refname mapping
  2020-06-19 13:23 [PATCH 0/3] fast-export: allow dumping anonymization mappings Jeff King
@ 2020-06-19 13:25 ` Jeff King
  2020-06-19 15:51   ` Eric Sunshine
  2020-06-19 13:26 ` [PATCH 2/3] fast-export: anonymize "master" refname Jeff King
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 64+ messages in thread
From: Jeff King @ 2020-06-19 13:25 UTC (permalink / raw)
  To: git; +Cc: Johannes Schindelin

After you anonymize a repository, it can be hard to find which commits
correspond between the original and the result, and thus hard to
reproduce commands that triggered bugs in the original.

Let's make it possible to dump the mapping separate from the output
stream. This can be used by a bug reporter to modify their reproduction
recipe without revealing the original names (see the example in the
documentation).

The implementation is slightly non-obvious. There's no point in the
program where we know the complete set of refs we're going to anonymize.
Nor do we have a complete set of anonymized refs after finishing (we
have a set of anonymized ref path components, but no knowledge of how
those are assembled into complete refs). So we lazily write to the dump
file as we anonymize each name, and keep a list of ones that we've
output in order to avoid duplicates.

Some possible alternatives:

  - we could just output the mapping of anonymized components (e.g.,
    that "foo" became "ref123"). That works OK when you have short
    refnames (e.g., "refs/heads/foo" becomes "refs/heads/ref123"), but
    longer names would require the user to look up each component to
    assemble the result. For example, "refs/remotes/origin/jk/foo" might
    become "refs/remotes/refs37/refs56/refs102".

  - instead of dumping the mapping, the same problem could be solved by
    allowing the user to leave some refs alone. So if you want to
    reproduce "git rev-list branch~17..HEAD" in the anonymized repo, we
    could allow something like:

      git tag anon-one branch
      git tag anon-two HEAD
      git fast-export --anonymize --all \
                      --no-anonymize-ref=anon-one \
		      --no-anonymize-ref=anon-two \
		      >stream

    and then presumably "git rev-list anon-one~17..anon-two" would
    behave the same in the re-imported repository. This is more
    convenient in some ways, but it does require modifying the
    original repository. And the concept doesn't easily extend to
    other fields (e.g., pathnames, which will be addressed in a
    subsequent patch).

  - we could dump before/after commit hashes; combined with rev-parse,
    that could convert these cases (as well as ones using raw hashes).
    But we don't actually know the anonymized commit hashes; we're just
    generating a stream that will produce them in the anonymized repo.

  - likewise, we probably could insert object names or other markers
    into commit messages, blob contents, etc, in order to let a user
    with the original repo figure out which parts correspond. But using
    this gets complicated (I have to find my commits in the result with
    "git log --all --grep" or similar). It also makes it less clear that
    the anonymized repo didn't leak any information (because we are
    relying on object ids being unguessable).

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/git-fast-export.txt | 22 +++++++++++++++++
 builtin/fast-export.c             | 39 +++++++++++++++++++++++++++++++
 t/t9351-fast-export-anonymize.sh  | 12 ++++++++++
 3 files changed, 73 insertions(+)

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index e8950de3ba..e809bb3f18 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -119,6 +119,12 @@ by keeping the marks the same across runs.
 	the shape of the history and stored tree.  See the section on
 	`ANONYMIZING` below.
 
+--dump-anonymized-refnames=<file>::
+	Output the mapping of real refnames to anonymized refnames to
+	<file>. The output will contain one line per ref that appears in
+	the output stream, with the original refname, a space, and its
+	anonymized counterpart. See the section on `ANONYMIZING` below.
+
 --reference-excluded-parents::
 	By default, running a command such as `git fast-export
 	master~5..master` will not include the commit master{tilde}5
@@ -238,6 +244,22 @@ collapse "User 0", "User 1", etc into "User X"). This produces a much
 smaller output, and it is usually easy to quickly confirm that there is
 no private data in the stream.
 
+Reproducing some bugs may require referencing particular commits, which
+becomes challenging after the refnames have all been anonymized. You can
+use `--dump-anonymized-refnames` to output the mapping, and then alter
+your reproduction recipe to use the anonymized names. E.g., if you find
+a bug with `git rev-list v1.0..v2.0` in the private repository, you can
+run:
+
+---------------------------------------------------
+$ git fast-export --anonymize --all --dump-anonymized-refnames=refs.out >stream
+$ grep '^refs/tags/v[12].0' refs.out
+refs/tags/v1.0 refs/tags/ref31
+refs/tags/v2.0 refs/tags/ref50
+---------------------------------------------------
+
+which tells you that `git rev-list ref31..ref50` may produce the same
+bug in the re-imported anonymous repository.
 
 LIMITATIONS
 -----------
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 85868162ee..6caea6f290 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -24,6 +24,7 @@
 #include "remote.h"
 #include "blob.h"
 #include "commit-slab.h"
+#include "khash.h"
 
 static const char *fast_export_usage[] = {
 	N_("git fast-export [rev-list-opts]"),
@@ -45,6 +46,7 @@ static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
 static struct string_list tag_refs = STRING_LIST_INIT_NODUP;
 static struct refspec refspecs = REFSPEC_INIT_FETCH;
 static int anonymize;
+static FILE *anonymized_refnames_handle;
 static struct revision_sources revision_sources;
 
 static int parse_opt_signed_tag_mode(const struct option *opt,
@@ -118,6 +120,32 @@ static int has_unshown_parent(struct commit *commit)
 	return 0;
 }
 
+KHASH_INIT(strset, const char *, int, 0, kh_str_hash_func, kh_str_hash_equal);
+
+struct seen_set {
+	kh_strset_t *set;
+};
+
+static int check_and_mark_seen(struct seen_set *seen, const char *str)
+{
+	int hashret;
+	if (!seen->set)
+		seen->set = kh_init_strset();
+	if (kh_get_strset(seen->set, str) < kh_end(seen->set))
+		return 1;
+	kh_put_strset(seen->set, xstrdup(str), &hashret);
+	return 0;
+}
+
+static void maybe_dump_anon(FILE *out, struct seen_set *seen,
+			    const char *orig, const char *anon)
+{
+	if (!out)
+		return;
+	if (!check_and_mark_seen(seen, orig))
+		fprintf(out, "%s %s\n", orig, anon);
+}
+
 struct anonymized_entry {
 	struct hashmap_entry hash;
 	const char *orig;
@@ -515,6 +543,8 @@ static const char *anonymize_refname(const char *refname)
 	};
 	static struct hashmap refs;
 	static struct strbuf anon = STRBUF_INIT;
+	static struct seen_set seen;
+	const char *full_refname = refname;
 	int i;
 
 	/*
@@ -533,6 +563,8 @@ static const char *anonymize_refname(const char *refname)
 	}
 
 	anonymize_path(&anon, refname, &refs, anonymize_ref_component);
+	maybe_dump_anon(anonymized_refnames_handle, &seen,
+			full_refname, anon.buf);
 	return anon.buf;
 }
 
@@ -1144,6 +1176,7 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 	char *export_filename = NULL,
 	     *import_filename = NULL,
 	     *import_filename_if_exists = NULL;
+	const char *anonymized_refnames_file = NULL;
 	uint32_t lastimportid;
 	struct string_list refspecs_list = STRING_LIST_INIT_NODUP;
 	struct string_list paths_of_changed_objects = STRING_LIST_INIT_DUP;
@@ -1177,6 +1210,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_STRING_LIST(0, "refspec", &refspecs_list, N_("refspec"),
 			     N_("Apply refspec to exported refs")),
 		OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")),
+		OPT_STRING(0, "dump-anonymized-refnames",
+			   &anonymized_refnames_file, N_("file"),
+			   N_("output anonymized refname mapping to <file>")),
 		OPT_BOOL(0, "reference-excluded-parents",
 			 &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by object id")),
 		OPT_BOOL(0, "show-original-ids", &show_original_ids,
@@ -1213,6 +1249,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		string_list_clear(&refspecs_list, 1);
 	}
 
+	if (anonymized_refnames_file)
+		anonymized_refnames_handle = xfopen(anonymized_refnames_file, "w");
+
 	if (use_done_feature)
 		printf("feature done\n");
 
diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
index 897dc50907..0c5dd2a4fb 100755
--- a/t/t9351-fast-export-anonymize.sh
+++ b/t/t9351-fast-export-anonymize.sh
@@ -46,6 +46,18 @@ test_expect_success 'stream omits tag message' '
 	! grep "annotated tag" stream
 '
 
+test_expect_success 'refname mapping can be dumped' '
+	git fast-export --anonymize --all \
+		--dump-anonymized-refnames=refs.out >/dev/null &&
+	# we make no guarantees of the exact anonymized names,
+	# so just check that we have the right number and
+	# that a sample line looks sane.
+	# Note that master is not anonymized, and so not included
+	# in the mapping.
+	test_line_count = 6 refs.out &&
+	grep "^refs/heads/other refs/heads/" refs.out
+'
+
 # NOTE: we chdir to the new, anonymized repository
 # after this. All further tests should assume this.
 test_expect_success 'import stream to new repository' '
-- 
2.27.0.480.g4f98dbcb10


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 2/3] fast-export: anonymize "master" refname
  2020-06-19 13:23 [PATCH 0/3] fast-export: allow dumping anonymization mappings Jeff King
  2020-06-19 13:25 ` [PATCH 1/3] fast-export: allow dumping the refname mapping Jeff King
@ 2020-06-19 13:26 ` Jeff King
  2020-06-19 13:29 ` [PATCH 3/3] fast-export: allow dumping the path mapping Jeff King
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-19 13:26 UTC (permalink / raw)
  To: git; +Cc: Johannes Schindelin

Running "fast-export --anonymize" will leave "refs/heads/master"
untouched in the output, for two reasons:

  - it helped to have some known reference point between the original
    and anonymized repository

  - since it's historically the default branch name, it doesn't leak any
    information

Now that we can ask fast-export to dump the anonymized ref mapping, we
have a much better tool for the first one (because it works for _any_
ref, not just master).

For the second, the notion of "default branch name" is likely to become
configurable soon, at which point the name _does_ leak information.
Let's drop this special case in preparation.

Note that we have to adjust the test a bit, since it relied on using the
name "master" in the anonymized repos. But this gives us a good
opportunity to further test the new dumping feature.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-export.c            |  7 -------
 t/t9351-fast-export-anonymize.sh | 16 ++++++----------
 2 files changed, 6 insertions(+), 17 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 6caea6f290..cd0174d514 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -547,13 +547,6 @@ static const char *anonymize_refname(const char *refname)
 	const char *full_refname = refname;
 	int i;
 
-	/*
-	 * We also leave "master" as a special case, since it does not reveal
-	 * anything interesting.
-	 */
-	if (!strcmp(refname, "refs/heads/master"))
-		return refname;
-
 	strbuf_reset(&anon);
 	for (i = 0; i < ARRAY_SIZE(prefixes); i++) {
 		if (skip_prefix(refname, prefixes[i], &refname)) {
diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
index 0c5dd2a4fb..88847b0f60 100755
--- a/t/t9351-fast-export-anonymize.sh
+++ b/t/t9351-fast-export-anonymize.sh
@@ -26,11 +26,8 @@ test_expect_success 'stream omits path names' '
 	! grep xyzzy stream
 '
 
-test_expect_success 'stream allows master as refname' '
-	grep master stream
-'
-
-test_expect_success 'stream omits other refnames' '
+test_expect_success 'stream omits refnames' '
+	! grep master stream &&
 	! grep other stream &&
 	! grep mytag stream
 '
@@ -52,9 +49,7 @@ test_expect_success 'refname mapping can be dumped' '
 	# we make no guarantees of the exact anonymized names,
 	# so just check that we have the right number and
 	# that a sample line looks sane.
-	# Note that master is not anonymized, and so not included
-	# in the mapping.
-	test_line_count = 6 refs.out &&
+	test_line_count = 7 refs.out &&
 	grep "^refs/heads/other refs/heads/" refs.out
 '
 
@@ -69,15 +64,16 @@ test_expect_success 'import stream to new repository' '
 test_expect_success 'result has two branches' '
 	git for-each-ref --format="%(refname)" refs/heads >branches &&
 	test_line_count = 2 branches &&
-	other_branch=$(grep -v refs/heads/master branches)
+	main_branch=$(sed -ne "s,refs/heads/master ,,p" ../refs.out) &&
+	other_branch=$(sed -ne "s,refs/heads/other ,,p" ../refs.out)
 '
 
 test_expect_success 'repo has original shape and timestamps' '
 	shape () {
 		git log --format="%m %ct" --left-right --boundary "$@"
 	} &&
 	(cd .. && shape master...other) >expect &&
-	shape master...$other_branch >actual &&
+	shape $main_branch...$other_branch >actual &&
 	test_cmp expect actual
 '
 
-- 
2.27.0.480.g4f98dbcb10


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 3/3] fast-export: allow dumping the path mapping
  2020-06-19 13:23 [PATCH 0/3] fast-export: allow dumping anonymization mappings Jeff King
  2020-06-19 13:25 ` [PATCH 1/3] fast-export: allow dumping the refname mapping Jeff King
  2020-06-19 13:26 ` [PATCH 2/3] fast-export: anonymize "master" refname Jeff King
@ 2020-06-19 13:29 ` Jeff King
  2020-06-19 16:00   ` Eric Sunshine
  2020-06-19 19:24   ` Junio C Hamano
  2020-06-19 13:51 ` [PATCH 0/3] fast-export: allow dumping anonymization mappings Johannes Schindelin
  2020-06-22 21:47 ` [PATCH v2 0/4] " Jeff King
  4 siblings, 2 replies; 64+ messages in thread
From: Jeff King @ 2020-06-19 13:29 UTC (permalink / raw)
  To: git; +Cc: Johannes Schindelin

When working with an anonymized repo, it can be useful to be able to
refer to particular paths. E.g., reproducing a bug with "git rev-list --
foo.c" in the original repo would need to replace "foo.c" with its
anonymized counterpart to produce the same effect.

We recently taught fast-export to dump the refname mapping. Let's do the
same thing for paths, which can reuse most of the same infrastructure.
Note that the output format isn't unambiguous here (because paths could
contain spaces). That's OK because this is meant to be examined by a
human.

We could also just introduce a "dump mapping" file that shows every
mapping we make. But it would be a bit more awkward to work with, as the
user would have to sort through more data to find the parts they're
interested in (and there are likely to be many more paths than refnames,
making it annoying for people who just want to dump the refnames).

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/git-fast-export.txt | 10 ++++++++++
 builtin/fast-export.c             | 12 ++++++++++++
 t/t9351-fast-export-anonymize.sh  |  8 ++++++++
 3 files changed, 30 insertions(+)

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index e809bb3f18..c63f109f1d 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -125,6 +125,12 @@ by keeping the marks the same across runs.
 	the output stream, with the original refname, a space, and its
 	anonymized counterpart. See the section on `ANONYMIZING` below.
 
+--dump-anonymized-paths=<file>::
+	Output the mapping of real paths to anonymized paths to <file>.
+	The output will contain one line per path that appears in the
+	output stream, with the original path, a space, and its
+	anonymized counterpart. See the section on `ANONYMIZING` below.
+
 --reference-excluded-parents::
 	By default, running a command such as `git fast-export
 	master~5..master` will not include the commit master{tilde}5
@@ -261,6 +267,10 @@ refs/tags/v2.0 refs/tags/ref50
 which tells you that `git rev-list ref31..ref50` may produce the same
 bug in the re-imported anonymous repository.
 
+Likewise, `--dump-anonymized-paths` may be useful for a bug that
+involves pathspecs. E.g., `git rev-list v1.0..v2.0 -- foo.c` requires
+knowing the path corresponding to `foo.c` in the result.
+
 LIMITATIONS
 -----------
 
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index cd0174d514..ed1f8daa7f 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -47,6 +47,7 @@ static struct string_list tag_refs = STRING_LIST_INIT_NODUP;
 static struct refspec refspecs = REFSPEC_INIT_FETCH;
 static int anonymize;
 static FILE *anonymized_refnames_handle;
+static FILE *anonymized_paths_handle;
 static struct revision_sources revision_sources;
 
 static int parse_opt_signed_tag_mode(const struct option *opt,
@@ -211,6 +212,9 @@ static void anonymize_path(struct strbuf *out, const char *path,
 			   struct hashmap *map,
 			   void *(*generate)(const void *, size_t *))
 {
+	static struct seen_set seen;
+	const char *full_path = path;
+
 	while (*path) {
 		const char *end_of_component = strchrnul(path, '/');
 		size_t len = end_of_component - path;
@@ -220,6 +224,8 @@ static void anonymize_path(struct strbuf *out, const char *path,
 		if (*path)
 			strbuf_addch(out, *path++);
 	}
+
+	maybe_dump_anon(anonymized_paths_handle, &seen, full_path, out->buf);
 }
 
 static inline void *mark_to_ptr(uint32_t mark)
@@ -1170,6 +1176,7 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 	     *import_filename = NULL,
 	     *import_filename_if_exists = NULL;
 	const char *anonymized_refnames_file = NULL;
+	const char *anonymized_paths_file = NULL;
 	uint32_t lastimportid;
 	struct string_list refspecs_list = STRING_LIST_INIT_NODUP;
 	struct string_list paths_of_changed_objects = STRING_LIST_INIT_DUP;
@@ -1206,6 +1213,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_STRING(0, "dump-anonymized-refnames",
 			   &anonymized_refnames_file, N_("file"),
 			   N_("output anonymized refname mapping to <file>")),
+		OPT_STRING(0, "dump-anonymized-paths",
+			   &anonymized_paths_file, N_("file"),
+			   N_("output anonymized path mapping to <file>")),
 		OPT_BOOL(0, "reference-excluded-parents",
 			 &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by object id")),
 		OPT_BOOL(0, "show-original-ids", &show_original_ids,
@@ -1244,6 +1254,8 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 
 	if (anonymized_refnames_file)
 		anonymized_refnames_handle = xfopen(anonymized_refnames_file, "w");
+	if (anonymized_paths_file)
+		anonymized_paths_handle = xfopen(anonymized_paths_file, "w");
 
 	if (use_done_feature)
 		printf("feature done\n");
diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
index 88847b0f60..3607b9b972 100755
--- a/t/t9351-fast-export-anonymize.sh
+++ b/t/t9351-fast-export-anonymize.sh
@@ -53,6 +53,14 @@ test_expect_success 'refname mapping can be dumped' '
 	grep "^refs/heads/other refs/heads/" refs.out
 '
 
+test_expect_success 'path mapping can be dumped' '
+	git fast-export --anonymize --all \
+		--dump-anonymized-paths=paths.out >/dev/null &&
+	# do not assume a particular anonymization scheme or order;
+	# just sanity check that a sample line looks sensible.
+	grep "^foo " paths.out
+'
+
 # NOTE: we chdir to the new, anonymized repository
 # after this. All further tests should assume this.
 test_expect_success 'import stream to new repository' '
-- 
2.27.0.480.g4f98dbcb10

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/3] fast-export: allow dumping anonymization mappings
  2020-06-19 13:23 [PATCH 0/3] fast-export: allow dumping anonymization mappings Jeff King
                   ` (2 preceding siblings ...)
  2020-06-19 13:29 ` [PATCH 3/3] fast-export: allow dumping the path mapping Jeff King
@ 2020-06-19 13:51 ` Johannes Schindelin
  2020-06-22 16:35   ` Junio C Hamano
  2020-06-22 21:47 ` [PATCH v2 0/4] " Jeff King
  4 siblings, 1 reply; 64+ messages in thread
From: Johannes Schindelin @ 2020-06-19 13:51 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Hi Peff,

On Fri, 19 Jun 2020, Jeff King wrote:

> The "master" branch is special in fast-export in that we don't anonymize
> it.  And the reason is that it helps to have some known reference point
> between the original and anonymized repo, so you can find the same
> commits.
>
> This series gives an alternate way to achieve the same effect, but much
> better in that it works for _any_ ref (so if you are trying to reproduce
> the effect of "rev-list origin/foo..bar" in the anonymized repo, you can
> easily do so). Ditto for paths, so that "rev-list -- foo.c" can be
> reproduced in the anonymized repo.
>
> And then we can drop the specialness of "master", which in turn is one
> less thing to worry about in Dscho's series to make the default branch
> configurable.

Thank you for working on this!

I don't have any suggestions on top of Eric's and Junio's.

I'll drop the `fast-export` patches from the next round of my patch
series.

Ciao,
Dscho

>
>   [1/3]: fast-export: allow dumping the refname mapping
>   [2/3]: fast-export: anonymize "master" refname
>   [3/3]: fast-export: allow dumping the path mapping
>
>  Documentation/git-fast-export.txt | 32 +++++++++++++++++
>  builtin/fast-export.c             | 58 +++++++++++++++++++++++++++----
>  t/t9351-fast-export-anonymize.sh  | 30 ++++++++++++----
>  3 files changed, 106 insertions(+), 14 deletions(-)
>
> -Peff
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] fast-export: allow dumping the refname mapping
  2020-06-19 13:25 ` [PATCH 1/3] fast-export: allow dumping the refname mapping Jeff King
@ 2020-06-19 15:51   ` Eric Sunshine
  2020-06-19 16:01     ` Jeff King
  0 siblings, 1 reply; 64+ messages in thread
From: Eric Sunshine @ 2020-06-19 15:51 UTC (permalink / raw)
  To: Jeff King; +Cc: Git List, Johannes Schindelin

On Fri, Jun 19, 2020 at 9:25 AM Jeff King <peff@peff.net> wrote:
> [...]
> Let's make it possible to dump the mapping separate from the output
> stream. This can be used by a bug reporter to modify their reproduction
> recipe without revealing the original names (see the example in the
> documentation).
> [...]
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> @@ -118,6 +120,32 @@ static int has_unshown_parent(struct commit *commit)
> +static void maybe_dump_anon(FILE *out, struct seen_set *seen,
> +                           const char *orig, const char *anon)
> +{
> +       if (!out)
> +               return;
> +       if (!check_and_mark_seen(seen, orig))
> +               fprintf(out, "%s %s\n", orig, anon);
> +}

Nit: For a function which has only a single caller in this patch and
one more caller added in 3/3, I wonder if the, um, convenience of this
function short-circuiting as the very first thing it does outweighs
the loss of clarity of having the callers make the check themselves.
That is, have the caller do this instead:

    if (anonymized_refnames_handle)
        dump_anon(anonymized_refnames_handle, ...);

> @@ -1213,6 +1249,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
> +       if (anonymized_refnames_file)
> +               anonymized_refnames_handle = xfopen(anonymized_refnames_file, "w");

For completeness, do you want to close this file at some point?

> diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
> @@ -46,6 +46,18 @@ test_expect_success 'stream omits tag message' '
> +test_expect_success 'refname mapping can be dumped' '
> +       git fast-export --anonymize --all \
> +               --dump-anonymized-refnames=refs.out >/dev/null &&
> +       # we make no guarantees of the exact anonymized names,
> +       # so just check that we have the right number and
> +       # that a sample line looks sane.
> +       # Note that master is not anonymized, and so not included
> +       # in the mapping.
> +       test_line_count = 6 refs.out &&

This hard-coded "6" seems rather fragile, causing the test to break if
other refs are ever added or removed. Perhaps just count the refs
dynamically?

    git show-ref >refs &&
    nrefs=$(wc -l refs) &&
    # Note that master is not anonymized, and so not included
    # in the mapping.
    nrefs=$((nrefs - 1))
    test_line_count = $nrefs refs.out &&

and then drop the 'nrefs=$((nrefs - 1))' line and associated comment
in patch 2/3 which removes the "master" special case.

> +       grep "^refs/heads/other refs/heads/" refs.out
> +'

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/3] fast-export: allow dumping the path mapping
  2020-06-19 13:29 ` [PATCH 3/3] fast-export: allow dumping the path mapping Jeff King
@ 2020-06-19 16:00   ` Eric Sunshine
  2020-06-19 19:24   ` Junio C Hamano
  1 sibling, 0 replies; 64+ messages in thread
From: Eric Sunshine @ 2020-06-19 16:00 UTC (permalink / raw)
  To: Jeff King; +Cc: Git List, Johannes Schindelin

On Fri, Jun 19, 2020 at 9:29 AM Jeff King <peff@peff.net> wrote:
> [...]
> We recently taught fast-export to dump the refname mapping. Let's do the
> same thing for paths, which can reuse most of the same infrastructure.
> Note that the output format isn't unambiguous here (because paths could
> contain spaces). That's OK because this is meant to be examined by a
> human.
> [...]
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> @@ -1244,6 +1254,8 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
>         if (anonymized_refnames_file)
>                 anonymized_refnames_handle = xfopen(anonymized_refnames_file, "w");
> +       if (anonymized_paths_file)
> +               anonymized_paths_handle = xfopen(anonymized_paths_file, "w");

Same question as in my patch 1/1 review: Do you want to close this
file at some point?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] fast-export: allow dumping the refname mapping
  2020-06-19 15:51   ` Eric Sunshine
@ 2020-06-19 16:01     ` Jeff King
  2020-06-19 16:18       ` Eric Sunshine
  0 siblings, 1 reply; 64+ messages in thread
From: Jeff King @ 2020-06-19 16:01 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Git List, Johannes Schindelin

On Fri, Jun 19, 2020 at 11:51:06AM -0400, Eric Sunshine wrote:

> > +static void maybe_dump_anon(FILE *out, struct seen_set *seen,
> > +                           const char *orig, const char *anon)
> > +{
> > +       if (!out)
> > +               return;
> > +       if (!check_and_mark_seen(seen, orig))
> > +               fprintf(out, "%s %s\n", orig, anon);
> > +}
> 
> Nit: For a function which has only a single caller in this patch and
> one more caller added in 3/3, I wonder if the, um, convenience of this
> function short-circuiting as the very first thing it does outweighs
> the loss of clarity of having the callers make the check themselves.
> That is, have the caller do this instead:
> 
>     if (anonymized_refnames_handle)
>         dump_anon(anonymized_refnames_handle, ...);

<shrug> The names were long enough that I thought it was more clear not
to repeat myself. I actually wrote originally:

  if (anonymized_refnames_handle &&
      check_and_mark_seen(&seen, full_refname))
          fprintf(anonymized_refnames_handle, "%s %s\n",
	          full_refname, anon.buf));

but I wasn't sure if it was better not to duplicate the output format.
OTOH, we _could_ be using quote_path() for the path dumping, in which
case the formats would start to differ.

> > @@ -1213,6 +1249,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
> > +       if (anonymized_refnames_file)
> > +               anonymized_refnames_handle = xfopen(anonymized_refnames_file, "w");
> 
> For completeness, do you want to close this file at some point?

My thought was to just let it auto-close at exit() time, since it's
effectively process-length.

> > diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
> > @@ -46,6 +46,18 @@ test_expect_success 'stream omits tag message' '
> > +test_expect_success 'refname mapping can be dumped' '
> > +       git fast-export --anonymize --all \
> > +               --dump-anonymized-refnames=refs.out >/dev/null &&
> > +       # we make no guarantees of the exact anonymized names,
> > +       # so just check that we have the right number and
> > +       # that a sample line looks sane.
> > +       # Note that master is not anonymized, and so not included
> > +       # in the mapping.
> > +       test_line_count = 6 refs.out &&
> 
> This hard-coded "6" seems rather fragile, causing the test to break if
> other refs are ever added or removed. Perhaps just count the refs
> dynamically?
> 
>     git show-ref >refs &&
>     nrefs=$(wc -l refs) &&
>     # Note that master is not anonymized, and so not included
>     # in the mapping.
>     nrefs=$((nrefs - 1))
>     test_line_count = $nrefs refs.out &&
> 
> and then drop the 'nrefs=$((nrefs - 1))' line and associated comment
> in patch 2/3 which removes the "master" special case.

That's exactly what I wrote originally, but it failed on macos due to
extra spaces in the "wc" output. There are other tests nearby that also
count the refs (e.g., exactly 2 branches), so I didn't think it was
worth trying to deal with the portability issue.

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] fast-export: allow dumping the refname mapping
  2020-06-19 16:01     ` Jeff King
@ 2020-06-19 16:18       ` Eric Sunshine
  2020-06-19 17:45         ` Jeff King
  2020-06-19 19:20         ` Junio C Hamano
  0 siblings, 2 replies; 64+ messages in thread
From: Eric Sunshine @ 2020-06-19 16:18 UTC (permalink / raw)
  To: Jeff King; +Cc: Git List, Johannes Schindelin

On Fri, Jun 19, 2020 at 12:01 PM Jeff King <peff@peff.net> wrote:
> On Fri, Jun 19, 2020 at 11:51:06AM -0400, Eric Sunshine wrote:
> > That is, have the caller do this instead:
> >
> >   if (anonymized_refnames_handle)
> >     dump_anon(anonymized_refnames_handle, ...);
>
> <shrug> The names were long enough that I thought it was more clear not
> to repeat myself. [...]

Indeed, it's a minor point and subjective. Certainly not worth a
re-roll or even worrying about it.

> > This hard-coded "6" seems rather fragile, causing the test to break if
> > other refs are ever added or removed. Perhaps just count the refs
> > dynamically?
> >
> >   git show-ref >refs &&
> >   nrefs=$(wc -l refs) &&
> >   # Note that master is not anonymized, and so not included
> >   # in the mapping.
> >   nrefs=$((nrefs - 1))
> >   test_line_count = $nrefs refs.out &&
> >
> That's exactly what I wrote originally, but it failed on macos due to
> extra spaces in the "wc" output.

Hmph, that shouldn't have failed. Did you quote the $(wc -l refs)
invocation? Quoting it would cause it to fail.

I just applied this atop your patch and it worked fine on Mac OS:

diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
index 0c5dd2a4fb..849192b1b0 100755
--- a/t/t9351-fast-export-anonymize.sh
+++ b/t/t9351-fast-export-anonymize.sh
@@ -52,9 +52,12 @@ test_expect_success 'refname mapping can be dumped' '
 	# we make no guarantees of the exact anonymized names,
 	# so just check that we have the right number and
 	# that a sample line looks sane.
+	git show-ref >refs &&
+	nrefs=$(wc -l <refs) &&
 	# Note that master is not anonymized, and so not included
 	# in the mapping.
-	test_line_count = 6 refs.out &&
+	nrefs=$((nrefs - 1)) &&
+	test_line_count = $nrefs refs.out &&
 	grep "^refs/heads/other refs/heads/" refs.out
 '

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] fast-export: allow dumping the refname mapping
  2020-06-19 16:18       ` Eric Sunshine
@ 2020-06-19 17:45         ` Jeff King
  2020-06-19 18:00           ` Eric Sunshine
  2020-06-19 19:20         ` Junio C Hamano
  1 sibling, 1 reply; 64+ messages in thread
From: Jeff King @ 2020-06-19 17:45 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Git List, Johannes Schindelin

On Fri, Jun 19, 2020 at 12:18:16PM -0400, Eric Sunshine wrote:

> > That's exactly what I wrote originally, but it failed on macos due to
> > extra spaces in the "wc" output.
> 
> Hmph, that shouldn't have failed. Did you quote the $(wc -l refs)
> invocation? Quoting it would cause it to fail.

Nope (and indeed, I was wary of the issue and made sure I didn't use
quotes). My original was:

test_expect_success 'refname mapping can be dumped' '
       git fast-export --anonymize --all \
               --dump-anonymized-refnames=refs.out >/dev/null &&
       # we make no guarantees of the exact anonymized names,
       # so just check that we have the right number and
       # that a sample line looks sane.
       expected_count=$(git for-each-ref | wc -l) &&
       # Note that master is not anonymized, and so not included
       # in the mapping.
       expected_count=$((expected_count - 1)) &&
       test_line_count = "$expected_count" refs.out &&
       grep "^refs/heads/other refs/heads/" refs.out
'

So I guess I did quote the variable later.  It works fine on Linux, but
one of the osx ci jobs failed:

  https://github.com/peff/git/runs/787911270

The relevant log is:

++ git fast-export --anonymize --all --dump-anonymized-refnames=refs.out
+++ git for-each-ref
+++ wc -l
++ expected_count='       7'
++ test_line_count = '       7' refs.out
++ test 3 '!=' 3
+++ wc -l
++ test 7 = '       7'
++ echo 'test_line_count: line count for refs.out !=        7'
test_line_count: line count for refs.out !=        7

so the whitespace is eaten not when "wc" is run, but rather when the
variable is expanded.

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] fast-export: allow dumping the refname mapping
  2020-06-19 17:45         ` Jeff King
@ 2020-06-19 18:00           ` Eric Sunshine
  2020-06-22 21:30             ` Jeff King
  0 siblings, 1 reply; 64+ messages in thread
From: Eric Sunshine @ 2020-06-19 18:00 UTC (permalink / raw)
  To: Jeff King; +Cc: Git List, Johannes Schindelin

On Fri, Jun 19, 2020 at 1:45 PM Jeff King <peff@peff.net> wrote:
> On Fri, Jun 19, 2020 at 12:18:16PM -0400, Eric Sunshine wrote:
> > Hmph, that shouldn't have failed. Did you quote the $(wc -l refs)
> > invocation? Quoting it would cause it to fail.
>
> Nope (and indeed, I was wary of the issue and made sure I didn't use
> quotes). My original was:
>
> test_expect_success 'refname mapping can be dumped' '
>        [...]
>        expected_count=$(git for-each-ref | wc -l) &&
>        expected_count=$((expected_count - 1)) &&
>        test_line_count = "$expected_count" refs.out &&
>
> So I guess I did quote the variable later.  It works fine on Linux, but
> one of the osx ci jobs failed:
>
> ++ expected_count='       7'
> ++ test_line_count = '       7' refs.out
> ++ test 7 = '       7'
> ++ echo 'test_line_count: line count for refs.out !=        7'
>
> so the whitespace is eaten not when "wc" is run, but rather when the
> variable is expanded.

Not something that should be done by this series (more a
left-over-bitty thing, perhaps), but this almost suggests that
test_line_count() deserves a tweak to make it more robust against that
sort of thing:

    test_line_count () {
        if test $# != 3
        then
            BUG "not 3 parameters to test_line_count"
        elif ! test $(wc -l <"$3") "$1" "$2"
        then
            echo "test_line_count: line count for $3 !$1 $2"
            cat "$3"
            return 1
        fi
    }

If we drop the quotes around $2 from the 'test':

    elif ! test $(wc -l <"$3") "$1" $2

then your code would have worked as expected.

My only worry about that is that a poorly written caller would get a
weird and unhelpful error message:

    test_line_count = 4 4
    --> sh: test: too many arguments

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] fast-export: allow dumping the refname mapping
  2020-06-19 16:18       ` Eric Sunshine
  2020-06-19 17:45         ` Jeff King
@ 2020-06-19 19:20         ` Junio C Hamano
  2020-06-22 21:32           ` Jeff King
  1 sibling, 1 reply; 64+ messages in thread
From: Junio C Hamano @ 2020-06-19 19:20 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Jeff King, Git List, Johannes Schindelin

Eric Sunshine <sunshine@sunshineco.com> writes:

> On Fri, Jun 19, 2020 at 12:01 PM Jeff King <peff@peff.net> wrote:
>> On Fri, Jun 19, 2020 at 11:51:06AM -0400, Eric Sunshine wrote:
>> > That is, have the caller do this instead:
>> >
>> >   if (anonymized_refnames_handle)
>> >     dump_anon(anonymized_refnames_handle, ...);
>>
>> <shrug> The names were long enough that I thought it was more clear not
>> to repeat myself. [...]
>
> Indeed, it's a minor point and subjective. Certainly not worth a
> re-roll or even worrying about it.

It probably is subjective but fwiw I too find yours easier to
follow.

>> > This hard-coded "6" seems rather fragile, causing the test to break if
>> > other refs are ever added or removed. Perhaps just count the refs
>> > dynamically?
>> >
>> >   git show-ref >refs &&
>> >   nrefs=$(wc -l refs) &&
>> >   # Note that master is not anonymized, and so not included
>> >   # in the mapping.
>> >   nrefs=$((nrefs - 1))
>> >   test_line_count = $nrefs refs.out &&
>> >
>> That's exactly what I wrote originally, but it failed on macos due to
>> extra spaces in the "wc" output.
>
> Hmph, that shouldn't have failed. Did you quote the $(wc -l refs)
> invocation? Quoting it would cause it to fail.

> +	git show-ref >refs &&
> +	nrefs=$(wc -l <refs) &&

Yup, I've seen that workaround for macs too many times and it should
work well.

>  	# Note that master is not anonymized, and so not included
>  	# in the mapping.
> -	test_line_count = 6 refs.out &&
> +	nrefs=$((nrefs - 1)) &&
> +	test_line_count = $nrefs refs.out &&
>  	grep "^refs/heads/other refs/heads/" refs.out
>  '

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/3] fast-export: allow dumping the path mapping
  2020-06-19 13:29 ` [PATCH 3/3] fast-export: allow dumping the path mapping Jeff King
  2020-06-19 16:00   ` Eric Sunshine
@ 2020-06-19 19:24   ` Junio C Hamano
  2020-06-22 21:38     ` Jeff King
  1 sibling, 1 reply; 64+ messages in thread
From: Junio C Hamano @ 2020-06-19 19:24 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Johannes Schindelin

Jeff King <peff@peff.net> writes:

> When working with an anonymized repo, it can be useful to be able to
> refer to particular paths. E.g., reproducing a bug with "git rev-list --
> foo.c" in the original repo would need to replace "foo.c" with its
> anonymized counterpart to produce the same effect.

That would not work with "git rev-list -- 'foo*'", where there is no
counterpart, no?

It almost feels as if we want a separate knob to disable
anonymization for paths and perhaps refs while still anonymizing the
blob contents, or something like that.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/3] fast-export: allow dumping anonymization mappings
  2020-06-19 13:51 ` [PATCH 0/3] fast-export: allow dumping anonymization mappings Johannes Schindelin
@ 2020-06-22 16:35   ` Junio C Hamano
  0 siblings, 0 replies; 64+ messages in thread
From: Junio C Hamano @ 2020-06-22 16:35 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jeff King, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Hi Peff,
>
> On Fri, 19 Jun 2020, Jeff King wrote:
>
>> The "master" branch is special in fast-export in that we don't anonymize
>> it.  And the reason is that it helps to have some known reference point
>> between the original and anonymized repo, so you can find the same
>> commits.
>>
>> This series gives an alternate way to achieve the same effect, but much
>> better in that it works for _any_ ref (so if you are trying to reproduce
>> the effect of "rev-list origin/foo..bar" in the anonymized repo, you can
>> easily do so). Ditto for paths, so that "rev-list -- foo.c" can be
>> reproduced in the anonymized repo.
>>
>> And then we can drop the specialness of "master", which in turn is one
>> less thing to worry about in Dscho's series to make the default branch
>> configurable.
>
> Thank you for working on this!
>
> I don't have any suggestions on top of Eric's and Junio's.
>
> I'll drop the `fast-export` patches from the next round of my patch
> series.

Understood.  Thanks, all.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] fast-export: allow dumping the refname mapping
  2020-06-19 18:00           ` Eric Sunshine
@ 2020-06-22 21:30             ` Jeff King
  0 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-22 21:30 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Git List, Johannes Schindelin

On Fri, Jun 19, 2020 at 02:00:47PM -0400, Eric Sunshine wrote:

> > so the whitespace is eaten not when "wc" is run, but rather when the
> > variable is expanded.
> 
> Not something that should be done by this series (more a
> left-over-bitty thing, perhaps), but this almost suggests that
> test_line_count() deserves a tweak to make it more robust against that
> sort of thing:
> 
>     test_line_count () {
>         if test $# != 3
>         then
>             BUG "not 3 parameters to test_line_count"
>         elif ! test $(wc -l <"$3") "$1" "$2"
>         then
>             echo "test_line_count: line count for $3 !$1 $2"
>             cat "$3"
>             return 1
>         fi
>     }
> 
> If we drop the quotes around $2 from the 'test':
> 
>     elif ! test $(wc -l <"$3") "$1" $2
> 
> then your code would have worked as expected.
> 
> My only worry about that is that a poorly written caller would get a
> weird and unhelpful error message:
> 
>     test_line_count = 4 4
>     --> sh: test: too many arguments

I think your unhelpful-error-message case would happen only if the
length argument contains two non-whitespace tokens separated by a
whitespace (so the shell splits them into two arguments), _and_ the
caller passed that argument in quotes (otherwise the shell would split
it at the function call and we'd hit the BUG message). In which case,
what are they trying to do with passing "4 4" to "test"? And since we're
using "=" and not "-eq", I think "test" would be complaining about that.

So it seems like it might be a reasonable change to make things more
friendly.

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] fast-export: allow dumping the refname mapping
  2020-06-19 19:20         ` Junio C Hamano
@ 2020-06-22 21:32           ` Jeff King
  0 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-22 21:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Eric Sunshine, Git List, Johannes Schindelin

On Fri, Jun 19, 2020 at 12:20:38PM -0700, Junio C Hamano wrote:

> >> <shrug> The names were long enough that I thought it was more clear not
> >> to repeat myself. [...]
> >
> > Indeed, it's a minor point and subjective. Certainly not worth a
> > re-roll or even worrying about it.
> 
> It probably is subjective but fwiw I too find yours easier to
> follow.

I ended up doing it this way, since I decided to quote the pathnames
(and thus they needed their own fprintf, at which point there was
really nothing left for the two to share).

> > Hmph, that shouldn't have failed. Did you quote the $(wc -l refs)
> > invocation? Quoting it would cause it to fail.
> 
> > +	git show-ref >refs &&
> > +	nrefs=$(wc -l <refs) &&
> 
> Yup, I've seen that workaround for macs too many times and it should
> work well.

I switched to this style in the re-roll (and ditto for the path output,
which actually turned up a bug).

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/3] fast-export: allow dumping the path mapping
  2020-06-19 19:24   ` Junio C Hamano
@ 2020-06-22 21:38     ` Jeff King
  0 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-22 21:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Johannes Schindelin

On Fri, Jun 19, 2020 at 12:24:30PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > When working with an anonymized repo, it can be useful to be able to
> > refer to particular paths. E.g., reproducing a bug with "git rev-list --
> > foo.c" in the original repo would need to replace "foo.c" with its
> > anonymized counterpart to produce the same effect.
> 
> That would not work with "git rev-list -- 'foo*'", where there is no
> counterpart, no?

Correct. There's no way that can work without treating the "foo" prefix
as a token and anonymizing it consistently. We currently only treat full
path components as tokens. So "foo/" will always be "ref123/", but
"foobar" will have no relation.

> It almost feels as if we want a separate knob to disable
> anonymization for paths and perhaps refs while still anonymizing the
> blob contents, or something like that.

I think such a knob could be useful for some situations, but I don't
think it's a complete replacement for a mapping dump, since you may not
want to reveal your pathnames (or refnames). Unlike refnames, where I
can easily point a tag "foo" at a commit of interest and then ask that
"foo" not be anonymized, pathnames are harder to change.

An alternative would be to let users seed the mapping (e.g., to ask that
"foo1" and "foo2" become "anon1" and "anon2"). That requires a little
more forethought on the part of the user, but it also avoids them having
to dig through the mapping to rewrite their commands (instead they use
the mappings they already invented).

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 0/4] fast-export: allow dumping anonymization mappings
  2020-06-19 13:23 [PATCH 0/3] fast-export: allow dumping anonymization mappings Jeff King
                   ` (3 preceding siblings ...)
  2020-06-19 13:51 ` [PATCH 0/3] fast-export: allow dumping anonymization mappings Johannes Schindelin
@ 2020-06-22 21:47 ` Jeff King
  2020-06-22 21:47   ` [PATCH v2 1/4] fast-export: allow dumping the refname mapping Jeff King
                     ` (4 more replies)
  4 siblings, 5 replies; 64+ messages in thread
From: Jeff King @ 2020-06-22 21:47 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin

On Fri, Jun 19, 2020 at 09:23:04AM -0400, Jeff King wrote:

> This series gives an alternate way to achieve the same effect, but much
> better in that it works for _any_ ref (so if you are trying to reproduce
> the effect of "rev-list origin/foo..bar" in the anonymized repo, you can
> easily do so). Ditto for paths, so that "rev-list -- foo.c" can be
> reproduced in the anonymized repo.

Here's a v2 which I think addresses all of the comments. I have to admit
that after writing my last email to Junio, I am wondering whether it
would be sufficient and simpler to let the user specify a static mapping
of tokens (that could just be applied anywhere).

I'll take a look at that, but since I worked up this version, here it is
in the meantime.

The interesting changes are:

  - path output is now quoted, making it unambiguous. The intent is for
    humans to look at it, but it's not much extra work to make it
    machine readable, too.

  - the path dumping was in the wrong spot. It was happening in the
    generic function that's used for "path-like" things, including
    refnames. So the path mapping dump had extra cruft in it.

  - got rid of the maybe_dump_anon() helper

  - tests now avoid hard-coding expected counts

  - the path-dump test now checks the expected count

  [1/4]: fast-export: allow dumping the refname mapping
  [2/4]: fast-export: anonymize "master" refname
  [3/4]: fast-export: refactor path printing to not rely on stdout
  [4/4]: fast-export: allow dumping the path mapping

 Documentation/git-fast-export.txt | 34 +++++++++++++++
 builtin/fast-export.c             | 69 +++++++++++++++++++++++++------
 t/t9351-fast-export-anonymize.sh  | 44 ++++++++++++++++----
 3 files changed, 125 insertions(+), 22 deletions(-)

Range-diff from v1:

1:  82a17ae976 ! 1:  7ba5582d66 fast-export: allow dumping the refname mapping
    @@ builtin/fast-export.c: static int has_unshown_parent(struct commit *commit)
     +	kh_put_strset(seen->set, xstrdup(str), &hashret);
     +	return 0;
     +}
    -+
    -+static void maybe_dump_anon(FILE *out, struct seen_set *seen,
    -+			    const char *orig, const char *anon)
    -+{
    -+	if (!out)
    -+		return;
    -+	if (!check_and_mark_seen(seen, orig))
    -+		fprintf(out, "%s %s\n", orig, anon);
    -+}
     +
      struct anonymized_entry {
      	struct hashmap_entry hash;
    @@ builtin/fast-export.c: static const char *anonymize_refname(const char *refname)
      	}
      
      	anonymize_path(&anon, refname, &refs, anonymize_ref_component);
    -+	maybe_dump_anon(anonymized_refnames_handle, &seen,
    ++
    ++	if (anonymized_refnames_handle &&
    ++	    !check_and_mark_seen(&seen, full_refname))
    ++		fprintf(anonymized_refnames_handle, "%s %s\n",
     +			full_refname, anon.buf);
    ++
      	return anon.buf;
      }
      
    @@ t/t9351-fast-export-anonymize.sh: test_expect_success 'stream omits tag message'
     +	# we make no guarantees of the exact anonymized names,
     +	# so just check that we have the right number and
     +	# that a sample line looks sane.
    ++	expected_count=$(git for-each-ref | wc -l) &&
     +	# Note that master is not anonymized, and so not included
     +	# in the mapping.
    -+	test_line_count = 6 refs.out &&
    ++	expected_count=$((expected_count - 1)) &&
    ++	test_line_count = $expected_count refs.out &&
     +	grep "^refs/heads/other refs/heads/" refs.out
     +'
     +
2:  be56b375cc ! 2:  d88f7c83a5 fast-export: anonymize "master" refname
    @@ t/t9351-fast-export-anonymize.sh: test_expect_success 'stream omits path names'
      	! grep mytag stream
      '
     @@ t/t9351-fast-export-anonymize.sh: test_expect_success 'refname mapping can be dumped' '
    - 	# we make no guarantees of the exact anonymized names,
      	# so just check that we have the right number and
      	# that a sample line looks sane.
    + 	expected_count=$(git for-each-ref | wc -l) &&
     -	# Note that master is not anonymized, and so not included
     -	# in the mapping.
    --	test_line_count = 6 refs.out &&
    -+	test_line_count = 7 refs.out &&
    +-	expected_count=$((expected_count - 1)) &&
    + 	test_line_count = $expected_count refs.out &&
      	grep "^refs/heads/other refs/heads/" refs.out
      '
    - 
     @@ t/t9351-fast-export-anonymize.sh: test_expect_success 'import stream to new repository' '
      test_expect_success 'result has two branches' '
      	git for-each-ref --format="%(refname)" refs/heads >branches &&
-:  ---------- > 3:  164f1e1eab fast-export: refactor path printing to not rely on stdout
3:  a4e9f1f2ac ! 4:  b0aa59f07e fast-export: allow dumping the path mapping
    @@ Commit message
     
         We recently taught fast-export to dump the refname mapping. Let's do the
         same thing for paths, which can reuse most of the same infrastructure.
    -    Note that the output format isn't unambiguous here (because paths could
    -    contain spaces). That's OK because this is meant to be examined by a
    -    human.
     
         We could also just introduce a "dump mapping" file that shows every
         mapping we make. But it would be a bit more awkward to work with, as the
    @@ Documentation/git-fast-export.txt: by keeping the marks the same across runs.
     +	Output the mapping of real paths to anonymized paths to <file>.
     +	The output will contain one line per path that appears in the
     +	output stream, with the original path, a space, and its
    -+	anonymized counterpart. See the section on `ANONYMIZING` below.
    ++	anonymized counterpart. Paths may be quoted if they contain a
    ++	space, or unusual characters; see `core.quotePath` in
    ++	linkgit:git-config(1). See also `ANONYMIZING` below.
     +
      --reference-excluded-parents::
      	By default, running a command such as `git fast-export
    @@ builtin/fast-export.c: static struct string_list tag_refs = STRING_LIST_INIT_NOD
      static struct revision_sources revision_sources;
      
      static int parse_opt_signed_tag_mode(const struct option *opt,
    -@@ builtin/fast-export.c: static void anonymize_path(struct strbuf *out, const char *path,
    - 			   struct hashmap *map,
    - 			   void *(*generate)(const void *, size_t *))
    - {
    -+	static struct seen_set seen;
    -+	const char *full_path = path;
    +@@ builtin/fast-export.c: static void print_path(const char *path)
    + 		print_path_1(stdout, path);
    + 	else {
    + 		static struct hashmap paths;
    ++		static struct seen_set seen;
    + 		static struct strbuf anon = STRBUF_INIT;
    + 
    + 		anonymize_path(&anon, path, &paths, anonymize_path_component);
    ++		if (anonymized_paths_handle &&
    ++		    !check_and_mark_seen(&seen, path)) {
    ++			print_path_1(anonymized_paths_handle, path);
    ++			fputc(' ', anonymized_paths_handle);
    ++			print_path_1(anonymized_paths_handle, anon.buf);
    ++			fputc('\n', anonymized_paths_handle);
    ++		}
     +
    - 	while (*path) {
    - 		const char *end_of_component = strchrnul(path, '/');
    - 		size_t len = end_of_component - path;
    -@@ builtin/fast-export.c: static void anonymize_path(struct strbuf *out, const char *path,
    - 		if (*path)
    - 			strbuf_addch(out, *path++);
    + 		print_path_1(stdout, anon.buf);
    + 		strbuf_reset(&anon);
      	}
    -+
    -+	maybe_dump_anon(anonymized_paths_handle, &seen, full_path, out->buf);
    - }
    - 
    - static inline void *mark_to_ptr(uint32_t mark)
     @@ builtin/fast-export.c: int cmd_fast_export(int argc, const char **argv, const char *prefix)
      	     *import_filename = NULL,
      	     *import_filename_if_exists = NULL;
    @@ builtin/fast-export.c: int cmd_fast_export(int argc, const char **argv, const ch
      		printf("feature done\n");
     
      ## t/t9351-fast-export-anonymize.sh ##
    +@@ t/t9351-fast-export-anonymize.sh: test_expect_success 'setup simple repo' '
    + 	git checkout -b other HEAD^ &&
    + 	mkdir subdir &&
    + 	test_commit subdir/bar &&
    +-	test_commit subdir/xyzzy &&
    ++	test_commit quoting "subdir/this needs quoting" &&
    + 	git tag -m "annotated tag" mytag
    + '
    + 
    +@@ t/t9351-fast-export-anonymize.sh: test_expect_success 'stream omits path names' '
    + 	! grep foo stream &&
    + 	! grep subdir stream &&
    + 	! grep bar stream &&
    +-	! grep xyzzy stream
    ++	! grep quoting stream
    + '
    + 
    + test_expect_success 'stream omits refnames' '
     @@ t/t9351-fast-export-anonymize.sh: test_expect_success 'refname mapping can be dumped' '
      	grep "^refs/heads/other refs/heads/" refs.out
      '
      
     +test_expect_success 'path mapping can be dumped' '
     +	git fast-export --anonymize --all \
     +		--dump-anonymized-paths=paths.out >/dev/null &&
    -+	# do not assume a particular anonymization scheme or order;
    -+	# just sanity check that a sample line looks sensible.
    -+	grep "^foo " paths.out
    ++	# as above, avoid depending on the exact scheme, but
    ++	# but check that we have the right number of mappings,
    ++	# and spot-check one sample.
    ++	expected_count=$(
    ++		git rev-list --objects --all |
    ++		git cat-file --batch-check="%(objecttype) %(rest)" |
    ++		sed -ne "s/^blob //p" |
    ++		sort -u |
    ++		wc -l
    ++	) &&
    ++	test_line_count = $expected_count paths.out &&
    ++	grep "^\"subdir/this needs quoting\" " paths.out
     +'
     +
      # NOTE: we chdir to the new, anonymized repository

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 1/4] fast-export: allow dumping the refname mapping
  2020-06-22 21:47 ` [PATCH v2 0/4] " Jeff King
@ 2020-06-22 21:47   ` Jeff King
  2020-06-22 21:48   ` [PATCH v2 2/4] fast-export: anonymize "master" refname Jeff King
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-22 21:47 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin

After you anonymize a repository, it can be hard to find which commits
correspond between the original and the result, and thus hard to
reproduce commands that triggered bugs in the original.

Let's make it possible to dump the mapping separate from the output
stream. This can be used by a bug reporter to modify their reproduction
recipe without revealing the original names (see the example in the
documentation).

The implementation is slightly non-obvious. There's no point in the
program where we know the complete set of refs we're going to anonymize.
Nor do we have a complete set of anonymized refs after finishing (we
have a set of anonymized ref path components, but no knowledge of how
those are assembled into complete refs). So we lazily write to the dump
file as we anonymize each name, and keep a list of ones that we've
output in order to avoid duplicates.

Some possible alternatives:

  - we could just output the mapping of anonymized components (e.g.,
    that "foo" became "ref123"). That works OK when you have short
    refnames (e.g., "refs/heads/foo" becomes "refs/heads/ref123"), but
    longer names would require the user to look up each component to
    assemble the result. For example, "refs/remotes/origin/jk/foo" might
    become "refs/remotes/refs37/refs56/refs102".

  - instead of dumping the mapping, the same problem could be solved by
    allowing the user to leave some refs alone. So if you want to
    reproduce "git rev-list branch~17..HEAD" in the anonymized repo, we
    could allow something like:

      git tag anon-one branch
      git tag anon-two HEAD
      git fast-export --anonymize --all \
                      --no-anonymize-ref=anon-one \
		      --no-anonymize-ref=anon-two \
		      >stream

    and then presumably "git rev-list anon-one~17..anon-two" would
    behave the same in the re-imported repository. This is more
    convenient in some ways, but it does require modifying the
    original repository. And the concept doesn't easily extend to
    other fields (e.g., pathnames, which will be addressed in a
    subsequent patch).

  - we could dump before/after commit hashes; combined with rev-parse,
    that could convert these cases (as well as ones using raw hashes).
    But we don't actually know the anonymized commit hashes; we're just
    generating a stream that will produce them in the anonymized repo.

  - likewise, we probably could insert object names or other markers
    into commit messages, blob contents, etc, in order to let a user
    with the original repo figure out which parts correspond. But using
    this gets complicated (I have to find my commits in the result with
    "git log --all --grep" or similar). It also makes it less clear that
    the anonymized repo didn't leak any information (because we are
    relying on object ids being unguessable).

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/git-fast-export.txt | 22 ++++++++++++++++++++
 builtin/fast-export.c             | 34 +++++++++++++++++++++++++++++++
 t/t9351-fast-export-anonymize.sh  | 14 +++++++++++++
 3 files changed, 70 insertions(+)

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index e8950de3ba..e809bb3f18 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -119,6 +119,12 @@ by keeping the marks the same across runs.
 	the shape of the history and stored tree.  See the section on
 	`ANONYMIZING` below.
 
+--dump-anonymized-refnames=<file>::
+	Output the mapping of real refnames to anonymized refnames to
+	<file>. The output will contain one line per ref that appears in
+	the output stream, with the original refname, a space, and its
+	anonymized counterpart. See the section on `ANONYMIZING` below.
+
 --reference-excluded-parents::
 	By default, running a command such as `git fast-export
 	master~5..master` will not include the commit master{tilde}5
@@ -238,6 +244,22 @@ collapse "User 0", "User 1", etc into "User X"). This produces a much
 smaller output, and it is usually easy to quickly confirm that there is
 no private data in the stream.
 
+Reproducing some bugs may require referencing particular commits, which
+becomes challenging after the refnames have all been anonymized. You can
+use `--dump-anonymized-refnames` to output the mapping, and then alter
+your reproduction recipe to use the anonymized names. E.g., if you find
+a bug with `git rev-list v1.0..v2.0` in the private repository, you can
+run:
+
+---------------------------------------------------
+$ git fast-export --anonymize --all --dump-anonymized-refnames=refs.out >stream
+$ grep '^refs/tags/v[12].0' refs.out
+refs/tags/v1.0 refs/tags/ref31
+refs/tags/v2.0 refs/tags/ref50
+---------------------------------------------------
+
+which tells you that `git rev-list ref31..ref50` may produce the same
+bug in the re-imported anonymous repository.
 
 LIMITATIONS
 -----------
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 85868162ee..844726d45a 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -24,6 +24,7 @@
 #include "remote.h"
 #include "blob.h"
 #include "commit-slab.h"
+#include "khash.h"
 
 static const char *fast_export_usage[] = {
 	N_("git fast-export [rev-list-opts]"),
@@ -45,6 +46,7 @@ static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
 static struct string_list tag_refs = STRING_LIST_INIT_NODUP;
 static struct refspec refspecs = REFSPEC_INIT_FETCH;
 static int anonymize;
+static FILE *anonymized_refnames_handle;
 static struct revision_sources revision_sources;
 
 static int parse_opt_signed_tag_mode(const struct option *opt,
@@ -118,6 +120,23 @@ static int has_unshown_parent(struct commit *commit)
 	return 0;
 }
 
+KHASH_INIT(strset, const char *, int, 0, kh_str_hash_func, kh_str_hash_equal);
+
+struct seen_set {
+	kh_strset_t *set;
+};
+
+static int check_and_mark_seen(struct seen_set *seen, const char *str)
+{
+	int hashret;
+	if (!seen->set)
+		seen->set = kh_init_strset();
+	if (kh_get_strset(seen->set, str) < kh_end(seen->set))
+		return 1;
+	kh_put_strset(seen->set, xstrdup(str), &hashret);
+	return 0;
+}
+
 struct anonymized_entry {
 	struct hashmap_entry hash;
 	const char *orig;
@@ -515,6 +534,8 @@ static const char *anonymize_refname(const char *refname)
 	};
 	static struct hashmap refs;
 	static struct strbuf anon = STRBUF_INIT;
+	static struct seen_set seen;
+	const char *full_refname = refname;
 	int i;
 
 	/*
@@ -533,6 +554,12 @@ static const char *anonymize_refname(const char *refname)
 	}
 
 	anonymize_path(&anon, refname, &refs, anonymize_ref_component);
+
+	if (anonymized_refnames_handle &&
+	    !check_and_mark_seen(&seen, full_refname))
+		fprintf(anonymized_refnames_handle, "%s %s\n",
+			full_refname, anon.buf);
+
 	return anon.buf;
 }
 
@@ -1144,6 +1171,7 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 	char *export_filename = NULL,
 	     *import_filename = NULL,
 	     *import_filename_if_exists = NULL;
+	const char *anonymized_refnames_file = NULL;
 	uint32_t lastimportid;
 	struct string_list refspecs_list = STRING_LIST_INIT_NODUP;
 	struct string_list paths_of_changed_objects = STRING_LIST_INIT_DUP;
@@ -1177,6 +1205,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_STRING_LIST(0, "refspec", &refspecs_list, N_("refspec"),
 			     N_("Apply refspec to exported refs")),
 		OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")),
+		OPT_STRING(0, "dump-anonymized-refnames",
+			   &anonymized_refnames_file, N_("file"),
+			   N_("output anonymized refname mapping to <file>")),
 		OPT_BOOL(0, "reference-excluded-parents",
 			 &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by object id")),
 		OPT_BOOL(0, "show-original-ids", &show_original_ids,
@@ -1213,6 +1244,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		string_list_clear(&refspecs_list, 1);
 	}
 
+	if (anonymized_refnames_file)
+		anonymized_refnames_handle = xfopen(anonymized_refnames_file, "w");
+
 	if (use_done_feature)
 		printf("feature done\n");
 
diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
index 897dc50907..75cbc7b329 100755
--- a/t/t9351-fast-export-anonymize.sh
+++ b/t/t9351-fast-export-anonymize.sh
@@ -46,6 +46,20 @@ test_expect_success 'stream omits tag message' '
 	! grep "annotated tag" stream
 '
 
+test_expect_success 'refname mapping can be dumped' '
+	git fast-export --anonymize --all \
+		--dump-anonymized-refnames=refs.out >/dev/null &&
+	# we make no guarantees of the exact anonymized names,
+	# so just check that we have the right number and
+	# that a sample line looks sane.
+	expected_count=$(git for-each-ref | wc -l) &&
+	# Note that master is not anonymized, and so not included
+	# in the mapping.
+	expected_count=$((expected_count - 1)) &&
+	test_line_count = $expected_count refs.out &&
+	grep "^refs/heads/other refs/heads/" refs.out
+'
+
 # NOTE: we chdir to the new, anonymized repository
 # after this. All further tests should assume this.
 test_expect_success 'import stream to new repository' '
-- 
2.27.0.517.gbc32778fa3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 2/4] fast-export: anonymize "master" refname
  2020-06-22 21:47 ` [PATCH v2 0/4] " Jeff King
  2020-06-22 21:47   ` [PATCH v2 1/4] fast-export: allow dumping the refname mapping Jeff King
@ 2020-06-22 21:48   ` Jeff King
  2020-06-22 21:48   ` [PATCH v2 3/4] fast-export: refactor path printing to not rely on stdout Jeff King
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-22 21:48 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin

Running "fast-export --anonymize" will leave "refs/heads/master"
untouched in the output, for two reasons:

  - it helped to have some known reference point between the original
    and anonymized repository

  - since it's historically the default branch name, it doesn't leak any
    information

Now that we can ask fast-export to dump the anonymized ref mapping, we
have a much better tool for the first one (because it works for _any_
ref, not just master).

For the second, the notion of "default branch name" is likely to become
configurable soon, at which point the name _does_ leak information.
Let's drop this special case in preparation.

Note that we have to adjust the test a bit, since it relied on using the
name "master" in the anonymized repos. But this gives us a good
opportunity to further test the new dumping feature.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-export.c            |  7 -------
 t/t9351-fast-export-anonymize.sh | 15 +++++----------
 2 files changed, 5 insertions(+), 17 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 844726d45a..faaab6c7e9 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -538,13 +538,6 @@ static const char *anonymize_refname(const char *refname)
 	const char *full_refname = refname;
 	int i;
 
-	/*
-	 * We also leave "master" as a special case, since it does not reveal
-	 * anything interesting.
-	 */
-	if (!strcmp(refname, "refs/heads/master"))
-		return refname;
-
 	strbuf_reset(&anon);
 	for (i = 0; i < ARRAY_SIZE(prefixes); i++) {
 		if (skip_prefix(refname, prefixes[i], &refname)) {
diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
index 75cbc7b329..c726306c4d 100755
--- a/t/t9351-fast-export-anonymize.sh
+++ b/t/t9351-fast-export-anonymize.sh
@@ -26,11 +26,8 @@ test_expect_success 'stream omits path names' '
 	! grep xyzzy stream
 '
 
-test_expect_success 'stream allows master as refname' '
-	grep master stream
-'
-
-test_expect_success 'stream omits other refnames' '
+test_expect_success 'stream omits refnames' '
+	! grep master stream &&
 	! grep other stream &&
 	! grep mytag stream
 '
@@ -53,9 +50,6 @@ test_expect_success 'refname mapping can be dumped' '
 	# so just check that we have the right number and
 	# that a sample line looks sane.
 	expected_count=$(git for-each-ref | wc -l) &&
-	# Note that master is not anonymized, and so not included
-	# in the mapping.
-	expected_count=$((expected_count - 1)) &&
 	test_line_count = $expected_count refs.out &&
 	grep "^refs/heads/other refs/heads/" refs.out
 '
@@ -71,15 +65,16 @@ test_expect_success 'import stream to new repository' '
 test_expect_success 'result has two branches' '
 	git for-each-ref --format="%(refname)" refs/heads >branches &&
 	test_line_count = 2 branches &&
-	other_branch=$(grep -v refs/heads/master branches)
+	main_branch=$(sed -ne "s,refs/heads/master ,,p" ../refs.out) &&
+	other_branch=$(sed -ne "s,refs/heads/other ,,p" ../refs.out)
 '
 
 test_expect_success 'repo has original shape and timestamps' '
 	shape () {
 		git log --format="%m %ct" --left-right --boundary "$@"
 	} &&
 	(cd .. && shape master...other) >expect &&
-	shape master...$other_branch >actual &&
+	shape $main_branch...$other_branch >actual &&
 	test_cmp expect actual
 '
 
-- 
2.27.0.517.gbc32778fa3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 3/4] fast-export: refactor path printing to not rely on stdout
  2020-06-22 21:47 ` [PATCH v2 0/4] " Jeff King
  2020-06-22 21:47   ` [PATCH v2 1/4] fast-export: allow dumping the refname mapping Jeff King
  2020-06-22 21:48   ` [PATCH v2 2/4] fast-export: anonymize "master" refname Jeff King
@ 2020-06-22 21:48   ` Jeff King
  2020-06-22 21:48   ` [PATCH v2 4/4] fast-export: allow dumping the path mapping Jeff King
  2020-06-23 15:24   ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Jeff King
  4 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-22 21:48 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin

We'll be using print_path_1() in more places in a subsequent patch, so
let's teach it to take the output handle as a parameter.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-export.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index faaab6c7e9..aa7ac9761d 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -369,15 +369,15 @@ static int depth_first(const void *a_, const void *b_)
 	return (a->status == 'R') - (b->status == 'R');
 }
 
-static void print_path_1(const char *path)
+static void print_path_1(FILE *out, const char *path)
 {
 	int need_quote = quote_c_style(path, NULL, NULL, 0);
 	if (need_quote)
-		quote_c_style(path, NULL, stdout, 0);
+		quote_c_style(path, NULL, out, 0);
 	else if (strchr(path, ' '))
-		printf("\"%s\"", path);
+		fprintf(out, "\"%s\"", path);
 	else
-		printf("%s", path);
+		fprintf(out, "%s", path);
 }
 
 static void *anonymize_path_component(const void *path, size_t *len)
@@ -391,13 +391,13 @@ static void *anonymize_path_component(const void *path, size_t *len)
 static void print_path(const char *path)
 {
 	if (!anonymize)
-		print_path_1(path);
+		print_path_1(stdout, path);
 	else {
 		static struct hashmap paths;
 		static struct strbuf anon = STRBUF_INIT;
 
 		anonymize_path(&anon, path, &paths, anonymize_path_component);
-		print_path_1(anon.buf);
+		print_path_1(stdout, anon.buf);
 		strbuf_reset(&anon);
 	}
 }
-- 
2.27.0.517.gbc32778fa3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 4/4] fast-export: allow dumping the path mapping
  2020-06-22 21:47 ` [PATCH v2 0/4] " Jeff King
                     ` (2 preceding siblings ...)
  2020-06-22 21:48   ` [PATCH v2 3/4] fast-export: refactor path printing to not rely on stdout Jeff King
@ 2020-06-22 21:48   ` Jeff King
  2020-06-23 15:24   ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Jeff King
  4 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-22 21:48 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin

When working with an anonymized repo, it can be useful to be able to
refer to particular paths. E.g., reproducing a bug with "git rev-list --
foo.c" in the original repo would need to replace "foo.c" with its
anonymized counterpart to produce the same effect.

We recently taught fast-export to dump the refname mapping. Let's do the
same thing for paths, which can reuse most of the same infrastructure.

We could also just introduce a "dump mapping" file that shows every
mapping we make. But it would be a bit more awkward to work with, as the
user would have to sort through more data to find the parts they're
interested in (and there are likely to be many more paths than refnames,
making it annoying for people who just want to dump the refnames).

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/git-fast-export.txt | 12 ++++++++++++
 builtin/fast-export.c             | 16 ++++++++++++++++
 t/t9351-fast-export-anonymize.sh  | 21 +++++++++++++++++++--
 3 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index e809bb3f18..342e34fd89 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -125,6 +125,14 @@ by keeping the marks the same across runs.
 	the output stream, with the original refname, a space, and its
 	anonymized counterpart. See the section on `ANONYMIZING` below.
 
+--dump-anonymized-paths=<file>::
+	Output the mapping of real paths to anonymized paths to <file>.
+	The output will contain one line per path that appears in the
+	output stream, with the original path, a space, and its
+	anonymized counterpart. Paths may be quoted if they contain a
+	space, or unusual characters; see `core.quotePath` in
+	linkgit:git-config(1). See also `ANONYMIZING` below.
+
 --reference-excluded-parents::
 	By default, running a command such as `git fast-export
 	master~5..master` will not include the commit master{tilde}5
@@ -261,6 +269,10 @@ refs/tags/v2.0 refs/tags/ref50
 which tells you that `git rev-list ref31..ref50` may produce the same
 bug in the re-imported anonymous repository.
 
+Likewise, `--dump-anonymized-paths` may be useful for a bug that
+involves pathspecs. E.g., `git rev-list v1.0..v2.0 -- foo.c` requires
+knowing the path corresponding to `foo.c` in the result.
+
 LIMITATIONS
 -----------
 
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index aa7ac9761d..080ded92e4 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -47,6 +47,7 @@ static struct string_list tag_refs = STRING_LIST_INIT_NODUP;
 static struct refspec refspecs = REFSPEC_INIT_FETCH;
 static int anonymize;
 static FILE *anonymized_refnames_handle;
+static FILE *anonymized_paths_handle;
 static struct revision_sources revision_sources;
 
 static int parse_opt_signed_tag_mode(const struct option *opt,
@@ -394,9 +395,18 @@ static void print_path(const char *path)
 		print_path_1(stdout, path);
 	else {
 		static struct hashmap paths;
+		static struct seen_set seen;
 		static struct strbuf anon = STRBUF_INIT;
 
 		anonymize_path(&anon, path, &paths, anonymize_path_component);
+		if (anonymized_paths_handle &&
+		    !check_and_mark_seen(&seen, path)) {
+			print_path_1(anonymized_paths_handle, path);
+			fputc(' ', anonymized_paths_handle);
+			print_path_1(anonymized_paths_handle, anon.buf);
+			fputc('\n', anonymized_paths_handle);
+		}
+
 		print_path_1(stdout, anon.buf);
 		strbuf_reset(&anon);
 	}
@@ -1165,6 +1175,7 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 	     *import_filename = NULL,
 	     *import_filename_if_exists = NULL;
 	const char *anonymized_refnames_file = NULL;
+	const char *anonymized_paths_file = NULL;
 	uint32_t lastimportid;
 	struct string_list refspecs_list = STRING_LIST_INIT_NODUP;
 	struct string_list paths_of_changed_objects = STRING_LIST_INIT_DUP;
@@ -1201,6 +1212,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_STRING(0, "dump-anonymized-refnames",
 			   &anonymized_refnames_file, N_("file"),
 			   N_("output anonymized refname mapping to <file>")),
+		OPT_STRING(0, "dump-anonymized-paths",
+			   &anonymized_paths_file, N_("file"),
+			   N_("output anonymized path mapping to <file>")),
 		OPT_BOOL(0, "reference-excluded-parents",
 			 &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by object id")),
 		OPT_BOOL(0, "show-original-ids", &show_original_ids,
@@ -1239,6 +1253,8 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 
 	if (anonymized_refnames_file)
 		anonymized_refnames_handle = xfopen(anonymized_refnames_file, "w");
+	if (anonymized_paths_file)
+		anonymized_paths_handle = xfopen(anonymized_paths_file, "w");
 
 	if (use_done_feature)
 		printf("feature done\n");
diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
index c726306c4d..ef72b1ec95 100755
--- a/t/t9351-fast-export-anonymize.sh
+++ b/t/t9351-fast-export-anonymize.sh
@@ -9,7 +9,7 @@ test_expect_success 'setup simple repo' '
 	git checkout -b other HEAD^ &&
 	mkdir subdir &&
 	test_commit subdir/bar &&
-	test_commit subdir/xyzzy &&
+	test_commit quoting "subdir/this needs quoting" &&
 	git tag -m "annotated tag" mytag
 '
 
@@ -23,7 +23,7 @@ test_expect_success 'stream omits path names' '
 	! grep foo stream &&
 	! grep subdir stream &&
 	! grep bar stream &&
-	! grep xyzzy stream
+	! grep quoting stream
 '
 
 test_expect_success 'stream omits refnames' '
@@ -54,6 +54,23 @@ test_expect_success 'refname mapping can be dumped' '
 	grep "^refs/heads/other refs/heads/" refs.out
 '
 
+test_expect_success 'path mapping can be dumped' '
+	git fast-export --anonymize --all \
+		--dump-anonymized-paths=paths.out >/dev/null &&
+	# as above, avoid depending on the exact scheme, but
+	# but check that we have the right number of mappings,
+	# and spot-check one sample.
+	expected_count=$(
+		git rev-list --objects --all |
+		git cat-file --batch-check="%(objecttype) %(rest)" |
+		sed -ne "s/^blob //p" |
+		sort -u |
+		wc -l
+	) &&
+	test_line_count = $expected_count paths.out &&
+	grep "^\"subdir/this needs quoting\" " paths.out
+'
+
 # NOTE: we chdir to the new, anonymized repository
 # after this. All further tests should assume this.
 test_expect_success 'import stream to new repository' '
-- 
2.27.0.517.gbc32778fa3

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [alternative 0/10] fast-export: allow seeding the anonymized mapping
  2020-06-22 21:47 ` [PATCH v2 0/4] " Jeff King
                     ` (3 preceding siblings ...)
  2020-06-22 21:48   ` [PATCH v2 4/4] fast-export: allow dumping the path mapping Jeff King
@ 2020-06-23 15:24   ` Jeff King
  2020-06-23 15:24     ` [PATCH 01/10] t9351: derive anonymized tree checks from original repo Jeff King
                       ` (11 more replies)
  4 siblings, 12 replies; 64+ messages in thread
From: Jeff King @ 2020-06-23 15:24 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin

On Mon, Jun 22, 2020 at 05:47:46PM -0400, Jeff King wrote:

> Here's a v2 which I think addresses all of the comments. I have to admit
> that after writing my last email to Junio, I am wondering whether it
> would be sufficient and simpler to let the user specify a static mapping
> of tokens (that could just be applied anywhere).
> 
> I'll take a look at that, but since I worked up this version, here it is
> in the meantime.

So here's that alternative. I think the result is actually a bit nicer
to work with, _and_ it's a lot less code. Don't let the number of
patches fool you. Most of it is cleanup that would be worth doing even
without the final patches.

Both of these techniques _could_ live side-by-side within fast-export,
as they have slightly different strengths and weaknesses. But I'd prefer
to just go with one (this one) in the name of simplicity, and I strongly
suspect nobody will ever ask for the other.

  [01/10]: t9351: derive anonymized tree checks from original repo
  [02/10]: fast-export: use xmemdupz() for anonymizing oids

    These first two are actually a bug-fix that I noticed while writing
    it.

  [03/10]: fast-export: store anonymized oids as hex strings
  [04/10]: fast-export: tighten anonymize_mem() interface to handle only strings
  [05/10]: fast-export: stop storing lengths in anonymized hashmaps
  [06/10]: fast-export: use a flex array to store anonymized entries
  [07/10]: fast-export: move global "idents" anonymize hashmap into function
  [08/10]: fast-export: add a "data" callback parameter to anonymize_str()

    This is all cleanup and prep. More than is strictly necessary for
    this series, but it does simplify things and reduce the memory
    footprint (only a few megabytes in git.git, but more in larger
    repos).

  [09/10]: fast-export: allow seeding the anonymized mapping

    And then this is the actual feature...

  [10/10]: fast-export: anonymize "master" refname

    ...which finally lets us drop the special name rule.

 Documentation/git-fast-export.txt |  24 +++++
 builtin/fast-export.c             | 161 +++++++++++++++++++-----------
 t/t9351-fast-export-anonymize.sh  |  54 +++++++---
 3 files changed, 167 insertions(+), 72 deletions(-)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 01/10] t9351: derive anonymized tree checks from original repo
  2020-06-23 15:24   ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Jeff King
@ 2020-06-23 15:24     ` Jeff King
  2020-06-23 15:24     ` [PATCH 02/10] fast-export: use xmemdupz() for anonymizing oids Jeff King
                       ` (10 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-23 15:24 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin

Our tests of the anonymized repo just hard-code the expected set of
objects in the root and subdirectory trees. This makes them brittle to
the test setup changing (e.g., adding new paths that need tested).

Let's look at the original repo to compute our expected set of objects.
Note that this isn't completely perfect (e.g., we still rely on there
being only one tree in the root), but it does simplify later patches.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/t9351-fast-export-anonymize.sh | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
index 897dc50907..e772cf9930 100755
--- a/t/t9351-fast-export-anonymize.sh
+++ b/t/t9351-fast-export-anonymize.sh
@@ -71,22 +71,18 @@ test_expect_success 'repo has original shape and timestamps' '
 
 test_expect_success 'root tree has original shape' '
 	# the output entries are not necessarily in the same
-	# order, but we know at least that we will have one tree
-	# and one blob, so just check the sorted order
-	cat >expect <<-\EOF &&
-	blob
-	tree
-	EOF
+	# order, but we should at least have the same set of
+	# object types.
+	git -C .. ls-tree HEAD >orig-root &&
+	cut -d" " -f2 <orig-root | sort >expect &&
 	git ls-tree $other_branch >root &&
 	cut -d" " -f2 <root | sort >actual &&
 	test_cmp expect actual
 '
 
 test_expect_success 'paths in subdir ended up in one tree' '
-	cat >expect <<-\EOF &&
-	blob
-	blob
-	EOF
+	git -C .. ls-tree other:subdir >orig-subdir &&
+	cut -d" " -f2 <orig-subdir | sort >expect &&
 	tree=$(grep tree root | cut -f2) &&
 	git ls-tree $other_branch:$tree >tree &&
 	cut -d" " -f2 <tree >actual &&
-- 
2.27.0.517.gbc32778fa3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 02/10] fast-export: use xmemdupz() for anonymizing oids
  2020-06-23 15:24   ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Jeff King
  2020-06-23 15:24     ` [PATCH 01/10] t9351: derive anonymized tree checks from original repo Jeff King
@ 2020-06-23 15:24     ` Jeff King
  2020-06-23 15:24     ` [PATCH 03/10] fast-export: store anonymized oids as hex strings Jeff King
                       ` (9 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-23 15:24 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin

Our anonymize_mem() function is careful to take a ptr/len pair to allow
storing binary tokens like object ids, as well as partial strings (e.g.,
just "foo" of "foo/bar"). But it duplicates the hash key using
xstrdup()! That means that:

  - for a partial string, we'd store all bytes up to the NUL, even
    though we'd never look at anything past "len". This didn't produce
    wrong behavior, but was wasteful.

  - for a binary oid that doesn't contain a zero byte, we'd copy garbage
    bytes off the end of the array (though as long as nothing complained
    about reading uninitialized bytes, further reads would be limited by
    "len", and we'd produce the correct results)

  - for a binary oid that does contain a zero byte, we'd copy _fewer_
    bytes than intended into the hashmap struct. When we later try to
    look up a value, we'd access uninitialized memory and potentially
    falsely claim that a particular oid is not present.

The most common reason to store an oid is an anonymized gitlink, but our
test case doesn't have any gitlinks at all. So let's add one whose oid
contains a NUL and is present at two different paths. ASan catches the
memory error, but even without it we can detect the bug because the oid
is not anonymized the same way for both paths.

And of course the fix is to copy the correct number of bytes. We don't
technically need the appended NUL from xmemdupz(), but it doesn't hurt
as an extra protection against anybody treating it like a string (plus a
future patch will push us more in that direction).

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-export.c            |  2 +-
 t/t9351-fast-export-anonymize.sh | 15 +++++++++++++++
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 85868162ee..289395a131 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -162,7 +162,7 @@ static const void *anonymize_mem(struct hashmap *map,
 	if (!ret) {
 		ret = xmalloc(sizeof(*ret));
 		hashmap_entry_init(&ret->hash, key.hash.hash);
-		ret->orig = xstrdup(orig);
+		ret->orig = xmemdupz(orig, *len);
 		ret->orig_len = *len;
 		ret->anon = generate(orig, len);
 		ret->anon_len = *len;
diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
index e772cf9930..dc5d75cd19 100755
--- a/t/t9351-fast-export-anonymize.sh
+++ b/t/t9351-fast-export-anonymize.sh
@@ -10,6 +10,10 @@ test_expect_success 'setup simple repo' '
 	mkdir subdir &&
 	test_commit subdir/bar &&
 	test_commit subdir/xyzzy &&
+	fake_commit=$(echo $ZERO_OID | sed s/0/a/) &&
+	git update-index --add --cacheinfo 160000,$fake_commit,link1 &&
+	git update-index --add --cacheinfo 160000,$fake_commit,link2 &&
+	git commit -m "add gitlink" &&
 	git tag -m "annotated tag" mytag
 '
 
@@ -26,6 +30,12 @@ test_expect_success 'stream omits path names' '
 	! grep xyzzy stream
 '
 
+test_expect_success 'stream omits gitlink oids' '
+	# avoid relying on the whole oid to remain hash-agnostic; this is
+	# plenty to be unique within our test case
+	! grep a000000000000000000 stream
+'
+
 test_expect_success 'stream allows master as refname' '
 	grep master stream
 '
@@ -89,6 +99,11 @@ test_expect_success 'paths in subdir ended up in one tree' '
 	test_cmp expect actual
 '
 
+test_expect_success 'identical gitlinks got identical oid' '
+	awk "/commit/ { print \$3 }" <root | sort -u >commits &&
+	test_line_count = 1 commits
+'
+
 test_expect_success 'tag points to branch tip' '
 	git rev-parse $other_branch >expect &&
 	git for-each-ref --format="%(*objectname)" | grep . >actual &&
-- 
2.27.0.517.gbc32778fa3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 03/10] fast-export: store anonymized oids as hex strings
  2020-06-23 15:24   ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Jeff King
  2020-06-23 15:24     ` [PATCH 01/10] t9351: derive anonymized tree checks from original repo Jeff King
  2020-06-23 15:24     ` [PATCH 02/10] fast-export: use xmemdupz() for anonymizing oids Jeff King
@ 2020-06-23 15:24     ` Jeff King
  2020-06-24 11:43       ` SZEDER Gábor
  2020-06-23 15:24     ` [PATCH 04/10] fast-export: tighten anonymize_mem() interface to handle only strings Jeff King
                       ` (8 subsequent siblings)
  11 siblings, 1 reply; 64+ messages in thread
From: Jeff King @ 2020-06-23 15:24 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin

When fast-export stores anonymized oids, it does so as binary strings.
And while the anonymous mapping storage is binary-clean (at least as of
the previous commit), this will become awkward when we start exposing
more of it to the user. In particular, if we allow a method for
retaining token "foo", then users may want to specify a hex oid as such
a token.

Let's just switch to storing the hex strings. The difference in memory
usage is negligible (especially considering how infrequently we'd
generally store an oid compared to, say, path components).

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-export.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 289395a131..4a3a4c933e 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -387,16 +387,19 @@ static void *generate_fake_oid(const void *old, size_t *len)
 {
 	static uint32_t counter = 1; /* avoid null oid */
 	const unsigned hashsz = the_hash_algo->rawsz;
-	unsigned char *out = xcalloc(hashsz, 1);
-	put_be32(out + hashsz - 4, counter++);
-	return out;
+	struct object_id oid;
+	char *hex = xmallocz(GIT_MAX_HEXSZ);
+
+	oidclr(&oid);
+	put_be32(oid.hash + hashsz - 4, counter++);
+	return oid_to_hex_r(hex, &oid);
 }
 
-static const struct object_id *anonymize_oid(const struct object_id *oid)
+static const char *anonymize_oid(const char *oid_hex)
 {
 	static struct hashmap objs;
-	size_t len = the_hash_algo->rawsz;
-	return anonymize_mem(&objs, generate_fake_oid, oid, &len);
+	size_t len = strlen(oid_hex);
+	return anonymize_mem(&objs, generate_fake_oid, oid_hex, &len);
 }
 
 static void show_filemodify(struct diff_queue_struct *q,
@@ -455,9 +458,9 @@ static void show_filemodify(struct diff_queue_struct *q,
 			 */
 			if (no_data || S_ISGITLINK(spec->mode))
 				printf("M %06o %s ", spec->mode,
-				       oid_to_hex(anonymize ?
-						  anonymize_oid(&spec->oid) :
-						  &spec->oid));
+				       anonymize ?
+				       anonymize_oid(oid_to_hex(&spec->oid)) :
+				       oid_to_hex(&spec->oid));
 			else {
 				struct object *object = lookup_object(the_repository,
 								      &spec->oid);
@@ -712,9 +715,10 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 		if (mark)
 			printf(":%d\n", mark);
 		else
-			printf("%s\n", oid_to_hex(anonymize ?
-						  anonymize_oid(&obj->oid) :
-						  &obj->oid));
+			printf("%s\n",
+			       anonymize ?
+			       anonymize_oid(oid_to_hex(&obj->oid)) :
+			       oid_to_hex(&obj->oid));
 		i++;
 	}
 
-- 
2.27.0.517.gbc32778fa3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 04/10] fast-export: tighten anonymize_mem() interface to handle only strings
  2020-06-23 15:24   ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Jeff King
                       ` (2 preceding siblings ...)
  2020-06-23 15:24     ` [PATCH 03/10] fast-export: store anonymized oids as hex strings Jeff King
@ 2020-06-23 15:24     ` Jeff King
  2020-06-23 15:24     ` [PATCH 05/10] fast-export: stop storing lengths in anonymized hashmaps Jeff King
                       ` (7 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-23 15:24 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin

While the anonymize_mem() interface _can_ store arbitrary byte
sequences, none of the callers uses this feature (as of the previous
commit). We'd like to keep it that way, as we'll be exposing the
string-like nature of the anonymization routines to the user. So let's
tighten up the interface a bit:

  - don't treat "len" as an out-parameter from anonymize_mem(); this
    ensures callers treat the pointer result as a NUL-terminated string

  - likewise, don't treat "len" as an out-parameter from generator
    functions

  - swap out "void *" for "char *" as appropriate to signal that we
    don't handle arbitrary memory

  - rename the function to anonymize_str()

This will also open up some optimization opportunities in a future
patch.

Note that we can't drop the "len" parameter entirely. Some callers do
pass in partial strings (e.g., "foo/bar", len=3) to avoid copying, and
we need to handle those still.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-export.c | 53 +++++++++++++++++++++----------------------
 1 file changed, 26 insertions(+), 27 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 4a3a4c933e..d8ea067630 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -145,31 +145,30 @@ static int anonymized_entry_cmp(const void *unused_cmp_data,
  * the same anonymized string with another. The actual generation
  * is farmed out to the generate function.
  */
-static const void *anonymize_mem(struct hashmap *map,
-				 void *(*generate)(const void *, size_t *),
-				 const void *orig, size_t *len)
+static const char *anonymize_str(struct hashmap *map,
+				 char *(*generate)(const char *, size_t),
+				 const char *orig, size_t len)
 {
 	struct anonymized_entry key, *ret;
 
 	if (!map->cmpfn)
 		hashmap_init(map, anonymized_entry_cmp, NULL, 0);
 
-	hashmap_entry_init(&key.hash, memhash(orig, *len));
+	hashmap_entry_init(&key.hash, memhash(orig, len));
 	key.orig = orig;
-	key.orig_len = *len;
+	key.orig_len = len;
 	ret = hashmap_get_entry(map, &key, hash, NULL);
 
 	if (!ret) {
 		ret = xmalloc(sizeof(*ret));
 		hashmap_entry_init(&ret->hash, key.hash.hash);
-		ret->orig = xmemdupz(orig, *len);
-		ret->orig_len = *len;
+		ret->orig = xmemdupz(orig, len);
+		ret->orig_len = len;
 		ret->anon = generate(orig, len);
-		ret->anon_len = *len;
+		ret->anon_len = strlen(ret->anon);
 		hashmap_put(map, &ret->hash);
 	}
 
-	*len = ret->anon_len;
 	return ret->anon;
 }
 
@@ -181,13 +180,13 @@ static const void *anonymize_mem(struct hashmap *map,
  */
 static void anonymize_path(struct strbuf *out, const char *path,
 			   struct hashmap *map,
-			   void *(*generate)(const void *, size_t *))
+			   char *(*generate)(const char *, size_t))
 {
 	while (*path) {
 		const char *end_of_component = strchrnul(path, '/');
 		size_t len = end_of_component - path;
-		const char *c = anonymize_mem(map, generate, path, &len);
-		strbuf_add(out, c, len);
+		const char *c = anonymize_str(map, generate, path, len);
+		strbuf_addstr(out, c);
 		path = end_of_component;
 		if (*path)
 			strbuf_addch(out, *path++);
@@ -361,12 +360,12 @@ static void print_path_1(const char *path)
 		printf("%s", path);
 }
 
-static void *anonymize_path_component(const void *path, size_t *len)
+static char *anonymize_path_component(const char *path, size_t len)
 {
 	static int counter;
 	struct strbuf out = STRBUF_INIT;
 	strbuf_addf(&out, "path%d", counter++);
-	return strbuf_detach(&out, len);
+	return strbuf_detach(&out, NULL);
 }
 
 static void print_path(const char *path)
@@ -383,7 +382,7 @@ static void print_path(const char *path)
 	}
 }
 
-static void *generate_fake_oid(const void *old, size_t *len)
+static char *generate_fake_oid(const char *old, size_t len)
 {
 	static uint32_t counter = 1; /* avoid null oid */
 	const unsigned hashsz = the_hash_algo->rawsz;
@@ -399,7 +398,7 @@ static const char *anonymize_oid(const char *oid_hex)
 {
 	static struct hashmap objs;
 	size_t len = strlen(oid_hex);
-	return anonymize_mem(&objs, generate_fake_oid, oid_hex, &len);
+	return anonymize_str(&objs, generate_fake_oid, oid_hex, len);
 }
 
 static void show_filemodify(struct diff_queue_struct *q,
@@ -496,12 +495,12 @@ static const char *find_encoding(const char *begin, const char *end)
 	return bol;
 }
 
-static void *anonymize_ref_component(const void *old, size_t *len)
+static char *anonymize_ref_component(const char *old, size_t len)
 {
 	static int counter;
 	struct strbuf out = STRBUF_INIT;
 	strbuf_addf(&out, "ref%d", counter++);
-	return strbuf_detach(&out, len);
+	return strbuf_detach(&out, NULL);
 }
 
 static const char *anonymize_refname(const char *refname)
@@ -550,13 +549,13 @@ static char *anonymize_commit_message(const char *old)
 }
 
 static struct hashmap idents;
-static void *anonymize_ident(const void *old, size_t *len)
+static char *anonymize_ident(const char *old, size_t len)
 {
 	static int counter;
 	struct strbuf out = STRBUF_INIT;
 	strbuf_addf(&out, "User %d <user%d@example.com>", counter, counter);
 	counter++;
-	return strbuf_detach(&out, len);
+	return strbuf_detach(&out, NULL);
 }
 
 /*
@@ -591,9 +590,9 @@ static void anonymize_ident_line(const char **beg, const char **end)
 		size_t len;
 
 		len = split.mail_end - split.name_begin;
-		ident = anonymize_mem(&idents, anonymize_ident,
-				      split.name_begin, &len);
-		strbuf_add(out, ident, len);
+		ident = anonymize_str(&idents, anonymize_ident,
+				      split.name_begin, len);
+		strbuf_addstr(out, ident);
 		strbuf_addch(out, ' ');
 		strbuf_add(out, split.date_begin, split.tz_end - split.date_begin);
 	} else {
@@ -733,12 +732,12 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 	show_progress();
 }
 
-static void *anonymize_tag(const void *old, size_t *len)
+static char *anonymize_tag(const char *old, size_t len)
 {
 	static int counter;
 	struct strbuf out = STRBUF_INIT;
 	strbuf_addf(&out, "tag message %d", counter++);
-	return strbuf_detach(&out, len);
+	return strbuf_detach(&out, NULL);
 }
 
 static void handle_tail(struct object_array *commits, struct rev_info *revs,
@@ -808,8 +807,8 @@ static void handle_tag(const char *name, struct tag *tag)
 		name = anonymize_refname(name);
 		if (message) {
 			static struct hashmap tags;
-			message = anonymize_mem(&tags, anonymize_tag,
-						message, &message_size);
+			message = anonymize_str(&tags, anonymize_tag,
+						message, message_size);
 		}
 	}
 
-- 
2.27.0.517.gbc32778fa3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 05/10] fast-export: stop storing lengths in anonymized hashmaps
  2020-06-23 15:24   ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Jeff King
                       ` (3 preceding siblings ...)
  2020-06-23 15:24     ` [PATCH 04/10] fast-export: tighten anonymize_mem() interface to handle only strings Jeff King
@ 2020-06-23 15:24     ` Jeff King
  2020-06-23 15:24     ` [PATCH 06/10] fast-export: use a flex array to store anonymized entries Jeff King
                       ` (6 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-23 15:24 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin

Now that the anonymize_str() interface is restricted to NUL-terminated
strings, there's no need for us to keep track of the length of each
entry in the hashmap. This simplifies the code and saves a bit of
memory.

Note that we do still need to compare the stored results to partial
strings passed in by the callers. We can do that by using hashmap's
keydata feature to get the ptr/len pair into the comparison function,
and then using strncmp().

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-export.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index d8ea067630..5df2ada47d 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -121,23 +121,32 @@ static int has_unshown_parent(struct commit *commit)
 struct anonymized_entry {
 	struct hashmap_entry hash;
 	const char *orig;
-	size_t orig_len;
 	const char *anon;
-	size_t anon_len;
+};
+
+struct anonymized_entry_key {
+	struct hashmap_entry hash;
+	const char *orig;
+	size_t orig_len;
 };
 
 static int anonymized_entry_cmp(const void *unused_cmp_data,
 				const struct hashmap_entry *eptr,
 				const struct hashmap_entry *entry_or_key,
-				const void *unused_keydata)
+				const void *keydata)
 {
 	const struct anonymized_entry *a, *b;
 
 	a = container_of(eptr, const struct anonymized_entry, hash);
-	b = container_of(entry_or_key, const struct anonymized_entry, hash);
+	if (keydata) {
+		const struct anonymized_entry_key *key = keydata;
+		int equal = !strncmp(a->orig, key->orig, key->orig_len) &&
+			    !a->orig[key->orig_len];
+		return !equal;
+	}
 
-	return a->orig_len != b->orig_len ||
-		memcmp(a->orig, b->orig, a->orig_len);
+	b = container_of(entry_or_key, const struct anonymized_entry, hash);
+	return strcmp(a->orig, b->orig);
 }
 
 /*
@@ -149,23 +158,22 @@ static const char *anonymize_str(struct hashmap *map,
 				 char *(*generate)(const char *, size_t),
 				 const char *orig, size_t len)
 {
-	struct anonymized_entry key, *ret;
+	struct anonymized_entry_key key;
+	struct anonymized_entry *ret;
 
 	if (!map->cmpfn)
 		hashmap_init(map, anonymized_entry_cmp, NULL, 0);
 
 	hashmap_entry_init(&key.hash, memhash(orig, len));
 	key.orig = orig;
 	key.orig_len = len;
-	ret = hashmap_get_entry(map, &key, hash, NULL);
+	ret = hashmap_get_entry(map, &key, hash, &key);
 
 	if (!ret) {
 		ret = xmalloc(sizeof(*ret));
 		hashmap_entry_init(&ret->hash, key.hash.hash);
 		ret->orig = xmemdupz(orig, len);
-		ret->orig_len = len;
 		ret->anon = generate(orig, len);
-		ret->anon_len = strlen(ret->anon);
 		hashmap_put(map, &ret->hash);
 	}
 
-- 
2.27.0.517.gbc32778fa3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 06/10] fast-export: use a flex array to store anonymized entries
  2020-06-23 15:24   ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Jeff King
                       ` (4 preceding siblings ...)
  2020-06-23 15:24     ` [PATCH 05/10] fast-export: stop storing lengths in anonymized hashmaps Jeff King
@ 2020-06-23 15:24     ` Jeff King
  2020-06-23 15:25     ` [PATCH 07/10] fast-export: move global "idents" anonymize hashmap into function Jeff King
                       ` (5 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-23 15:24 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin

Now that we're using a separate keydata struct for hash lookups, we have
more flexibility in how we allocate anonymized_entry structs. Let's push
the "orig" key into a flex member within the struct. That should save us
a few bytes of memory per entry (a pointer plus any malloc overhead),
and may make lookups a little faster (since it's one less pointer to
chase in the comparison function).

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-export.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 5df2ada47d..99d4a2b404 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -120,8 +120,8 @@ static int has_unshown_parent(struct commit *commit)
 
 struct anonymized_entry {
 	struct hashmap_entry hash;
-	const char *orig;
 	const char *anon;
+	const char orig[FLEX_ARRAY];
 };
 
 struct anonymized_entry_key {
@@ -170,9 +170,8 @@ static const char *anonymize_str(struct hashmap *map,
 	ret = hashmap_get_entry(map, &key, hash, &key);
 
 	if (!ret) {
-		ret = xmalloc(sizeof(*ret));
+		FLEX_ALLOC_MEM(ret, orig, orig, len);
 		hashmap_entry_init(&ret->hash, key.hash.hash);
-		ret->orig = xmemdupz(orig, len);
 		ret->anon = generate(orig, len);
 		hashmap_put(map, &ret->hash);
 	}
-- 
2.27.0.517.gbc32778fa3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 07/10] fast-export: move global "idents" anonymize hashmap into function
  2020-06-23 15:24   ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Jeff King
                       ` (5 preceding siblings ...)
  2020-06-23 15:24     ` [PATCH 06/10] fast-export: use a flex array to store anonymized entries Jeff King
@ 2020-06-23 15:25     ` Jeff King
  2020-06-23 15:25     ` [PATCH 08/10] fast-export: add a "data" callback parameter to anonymize_str() Jeff King
                       ` (4 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-23 15:25 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin

All of the other anonymization functions keep their static mappings
inside the function to avoid polluting the global namespace. Let's do
the same for "idents", as nobody needs it outside of
anonymize_ident_line().

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-export.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 99d4a2b404..16a1563e49 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -555,7 +555,6 @@ static char *anonymize_commit_message(const char *old)
 	return xstrfmt("subject %d\n\nbody\n", counter++);
 }
 
-static struct hashmap idents;
 static char *anonymize_ident(const char *old, size_t len)
 {
 	static int counter;
@@ -572,6 +571,7 @@ static char *anonymize_ident(const char *old, size_t len)
  */
 static void anonymize_ident_line(const char **beg, const char **end)
 {
+	static struct hashmap idents;
 	static struct strbuf buffers[] = { STRBUF_INIT, STRBUF_INIT };
 	static unsigned which_buffer;
 
-- 
2.27.0.517.gbc32778fa3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 08/10] fast-export: add a "data" callback parameter to anonymize_str()
  2020-06-23 15:24   ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Jeff King
                       ` (6 preceding siblings ...)
  2020-06-23 15:25     ` [PATCH 07/10] fast-export: move global "idents" anonymize hashmap into function Jeff King
@ 2020-06-23 15:25     ` Jeff King
  2020-06-24 19:58       ` Junio C Hamano
  2020-06-23 15:25     ` [PATCH 09/10] fast-export: allow seeding the anonymized mapping Jeff King
                       ` (3 subsequent siblings)
  11 siblings, 1 reply; 64+ messages in thread
From: Jeff King @ 2020-06-23 15:25 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin

The anonymize_str() function takes a generator callback, but there's no
way to pass extra context to it. Let's add the usual "void *data"
parameter to the generator interface and pass it along.

This is mildly annoying for existing callers, all of which pass NULL,
but is necessary to avoid extra globals in some cases we'll add in a
subsequent patch.

While we're touching each of these callbacks, we can further observe
that none of them use the existing orig/len parameters at all. This
makes sense, since the point is for their output to have no discernable
basis in the original (my original version had some notion that we might
use a one-way function to obfuscate the names, but it was never
implemented). So let's drop those extra parameters. If a caller really
wants to do something with them, it can pass a struct through the new
data parameter.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-export.c | 27 ++++++++++++++-------------
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 16a1563e49..1cbca5b4b4 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -155,8 +155,9 @@ static int anonymized_entry_cmp(const void *unused_cmp_data,
  * is farmed out to the generate function.
  */
 static const char *anonymize_str(struct hashmap *map,
-				 char *(*generate)(const char *, size_t),
-				 const char *orig, size_t len)
+				 char *(*generate)(void *),
+				 const char *orig, size_t len,
+				 void *data)
 {
 	struct anonymized_entry_key key;
 	struct anonymized_entry *ret;
@@ -172,7 +173,7 @@ static const char *anonymize_str(struct hashmap *map,
 	if (!ret) {
 		FLEX_ALLOC_MEM(ret, orig, orig, len);
 		hashmap_entry_init(&ret->hash, key.hash.hash);
-		ret->anon = generate(orig, len);
+		ret->anon = generate(data);
 		hashmap_put(map, &ret->hash);
 	}
 
@@ -187,12 +188,12 @@ static const char *anonymize_str(struct hashmap *map,
  */
 static void anonymize_path(struct strbuf *out, const char *path,
 			   struct hashmap *map,
-			   char *(*generate)(const char *, size_t))
+			   char *(*generate)(void *))
 {
 	while (*path) {
 		const char *end_of_component = strchrnul(path, '/');
 		size_t len = end_of_component - path;
-		const char *c = anonymize_str(map, generate, path, len);
+		const char *c = anonymize_str(map, generate, path, len, NULL);
 		strbuf_addstr(out, c);
 		path = end_of_component;
 		if (*path)
@@ -367,7 +368,7 @@ static void print_path_1(const char *path)
 		printf("%s", path);
 }
 
-static char *anonymize_path_component(const char *path, size_t len)
+static char *anonymize_path_component(void *data)
 {
 	static int counter;
 	struct strbuf out = STRBUF_INIT;
@@ -389,7 +390,7 @@ static void print_path(const char *path)
 	}
 }
 
-static char *generate_fake_oid(const char *old, size_t len)
+static char *generate_fake_oid(void *data)
 {
 	static uint32_t counter = 1; /* avoid null oid */
 	const unsigned hashsz = the_hash_algo->rawsz;
@@ -405,7 +406,7 @@ static const char *anonymize_oid(const char *oid_hex)
 {
 	static struct hashmap objs;
 	size_t len = strlen(oid_hex);
-	return anonymize_str(&objs, generate_fake_oid, oid_hex, len);
+	return anonymize_str(&objs, generate_fake_oid, oid_hex, len, NULL);
 }
 
 static void show_filemodify(struct diff_queue_struct *q,
@@ -502,7 +503,7 @@ static const char *find_encoding(const char *begin, const char *end)
 	return bol;
 }
 
-static char *anonymize_ref_component(const char *old, size_t len)
+static char *anonymize_ref_component(void *data)
 {
 	static int counter;
 	struct strbuf out = STRBUF_INIT;
@@ -555,7 +556,7 @@ static char *anonymize_commit_message(const char *old)
 	return xstrfmt("subject %d\n\nbody\n", counter++);
 }
 
-static char *anonymize_ident(const char *old, size_t len)
+static char *anonymize_ident(void *data)
 {
 	static int counter;
 	struct strbuf out = STRBUF_INIT;
@@ -598,7 +599,7 @@ static void anonymize_ident_line(const char **beg, const char **end)
 
 		len = split.mail_end - split.name_begin;
 		ident = anonymize_str(&idents, anonymize_ident,
-				      split.name_begin, len);
+				      split.name_begin, len, NULL);
 		strbuf_addstr(out, ident);
 		strbuf_addch(out, ' ');
 		strbuf_add(out, split.date_begin, split.tz_end - split.date_begin);
@@ -739,7 +740,7 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 	show_progress();
 }
 
-static char *anonymize_tag(const char *old, size_t len)
+static char *anonymize_tag(void *data)
 {
 	static int counter;
 	struct strbuf out = STRBUF_INIT;
@@ -815,7 +816,7 @@ static void handle_tag(const char *name, struct tag *tag)
 		if (message) {
 			static struct hashmap tags;
 			message = anonymize_str(&tags, anonymize_tag,
-						message, message_size);
+						message, message_size, NULL);
 		}
 	}
 
-- 
2.27.0.517.gbc32778fa3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 09/10] fast-export: allow seeding the anonymized mapping
  2020-06-23 15:24   ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Jeff King
                       ` (7 preceding siblings ...)
  2020-06-23 15:25     ` [PATCH 08/10] fast-export: add a "data" callback parameter to anonymize_str() Jeff King
@ 2020-06-23 15:25     ` Jeff King
  2020-06-23 17:16       ` Eric Sunshine
  2020-06-23 18:11       ` Eric Sunshine
  2020-06-23 15:25     ` [PATCH 10/10] fast-export: anonymize "master" refname Jeff King
                       ` (2 subsequent siblings)
  11 siblings, 2 replies; 64+ messages in thread
From: Jeff King @ 2020-06-23 15:25 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin

After you anonymize a repository, it can be hard to find which commits
correspond between the original and the result, and thus hard to
reproduce commands that triggered bugs in the original.

Let's make it possible to seed the anonymization map. This lets users
either:

  - mark names to be retained as-is, if they don't consider them secret
    (in which case their original commands would just work)

  - map names to new values, which lets them adapt the reproduction
    recipe to the new names without revealing the originals

The implementation is fairly straight-forward. We already store each
anonymized token in a hashmap (so that the same token appearing twice is
converted to the same result). We can just introduce a new "seed"
hashmap which is consulted first.

This does make a few more promises to the user about how we'll anonymize
things (e.g., token-splitting pathnames). But it's unlikely that we'd
want to change those rules, even if the actual anonymization of a single
token changes. And it makes things much easier for the user, who can
unblind only a directory name without having to specify each path within
it.

One alternative to this approach would be to anonymize as we see fit,
and then dump the whole refname and pathname mappings to a file. This
does work, but it's a bit awkward to use (you have to manually dig the
items you care about out of the mapping).

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/git-fast-export.txt | 24 ++++++++++++++++
 builtin/fast-export.c             | 47 ++++++++++++++++++++++++++++++-
 t/t9351-fast-export-anonymize.sh  | 11 +++++++-
 3 files changed, 80 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index e8950de3ba..2d7b62e835 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -119,6 +119,11 @@ by keeping the marks the same across runs.
 	the shape of the history and stored tree.  See the section on
 	`ANONYMIZING` below.
 
+--seed-anonymized=<from>[:<to>]::
+	Convert token `<from>` to `<to>` in the anonymized output. If
+	`<to>` is omitted, map `<from>` to itself (i.e., do not
+	anonymize it). See the section on `ANONYMIZING` below.
+
 --reference-excluded-parents::
 	By default, running a command such as `git fast-export
 	master~5..master` will not include the commit master{tilde}5
@@ -238,6 +243,25 @@ collapse "User 0", "User 1", etc into "User X"). This produces a much
 smaller output, and it is usually easy to quickly confirm that there is
 no private data in the stream.
 
+Reproducing some bugs may require referencing particular commits or
+paths, which becomes challenging after refnames and paths have been
+anonymized. You can ask for a particular token to be left as-is or
+mapped to a new value. For example, if you have a bug which reproduces
+with `git rev-list mybranch -- foo.c`, you can run:
+
+---------------------------------------------------
+$ git fast-export --anonymize --all \
+      --seed-anonymized=foo.c:secret.c \
+      --seed-anonymized=mybranch \
+      >stream
+---------------------------------------------------
+
+After importing the stream, you can then run `git rev-list mybranch --
+secret.c` in the anonymized repository.
+
+Note that paths and refnames are split into tokens at slash boundaries.
+The command above would anonymize `subdir/foo.c` as something like
+`path123/secret.c`.
 
 LIMITATIONS
 -----------
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 1cbca5b4b4..ef82497bbf 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -45,6 +45,7 @@ static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
 static struct string_list tag_refs = STRING_LIST_INIT_NODUP;
 static struct refspec refspecs = REFSPEC_INIT_FETCH;
 static int anonymize;
+static struct hashmap anonymized_seeds;
 static struct revision_sources revision_sources;
 
 static int parse_opt_signed_tag_mode(const struct option *opt,
@@ -168,8 +169,18 @@ static const char *anonymize_str(struct hashmap *map,
 	hashmap_entry_init(&key.hash, memhash(orig, len));
 	key.orig = orig;
 	key.orig_len = len;
-	ret = hashmap_get_entry(map, &key, hash, &key);
 
+	/* First check if it's a token the user configured manually... */
+	if (anonymized_seeds.cmpfn)
+		ret = hashmap_get_entry(&anonymized_seeds, &key, hash, &key);
+	else
+		ret = NULL;
+
+	/* ...otherwise check if we've already seen it in this context... */
+	if (!ret)
+		ret = hashmap_get_entry(map, &key, hash, &key);
+
+	/* ...and finally generate a new mapping if necessary */
 	if (!ret) {
 		FLEX_ALLOC_MEM(ret, orig, orig, len);
 		hashmap_entry_init(&ret->hash, key.hash.hash);
@@ -1147,6 +1158,37 @@ static void handle_deletes(void)
 	}
 }
 
+static char *anonymize_seed(void *data)
+{
+	return xstrdup(data);
+}
+
+static int parse_opt_seed_anonymized(const struct option *opt,
+				     const char *arg, int unset)
+{
+	struct hashmap *map = opt->value;
+	const char *delim, *value;
+	size_t keylen;
+
+	BUG_ON_OPT_NEG(unset);
+
+	delim = strchr(arg, ':');
+	if (delim) {
+		keylen = delim - arg;
+		value = delim + 1;
+	} else {
+		keylen = strlen(arg);
+		value = arg;
+	}
+
+	if (!keylen || !*value)
+		return error(_("--seed-anonymized token cannot be empty"));
+
+	anonymize_str(map, anonymize_seed, arg, keylen, (void *)value);
+
+	return 0;
+}
+
 int cmd_fast_export(int argc, const char **argv, const char *prefix)
 {
 	struct rev_info revs;
@@ -1188,6 +1230,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_STRING_LIST(0, "refspec", &refspecs_list, N_("refspec"),
 			     N_("Apply refspec to exported refs")),
 		OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")),
+		OPT_CALLBACK_F(0, "seed-anonymized", &anonymized_seeds, N_("from:to"),
+			       N_("convert <from> to <to> in anonymized output"),
+			       PARSE_OPT_NONEG, parse_opt_seed_anonymized),
 		OPT_BOOL(0, "reference-excluded-parents",
 			 &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by object id")),
 		OPT_BOOL(0, "show-original-ids", &show_original_ids,
diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
index dc5d75cd19..d84eec9bab 100755
--- a/t/t9351-fast-export-anonymize.sh
+++ b/t/t9351-fast-export-anonymize.sh
@@ -6,6 +6,7 @@ test_description='basic tests for fast-export --anonymize'
 test_expect_success 'setup simple repo' '
 	test_commit base &&
 	test_commit foo &&
+	test_commit retain-me &&
 	git checkout -b other HEAD^ &&
 	mkdir subdir &&
 	test_commit subdir/bar &&
@@ -18,7 +19,10 @@ test_expect_success 'setup simple repo' '
 '
 
 test_expect_success 'export anonymized stream' '
-	git fast-export --anonymize --all >stream
+	git fast-export --anonymize --all \
+		--seed-anonymized=retain-me \
+		--seed-anonymized=xyzzy:custom-name \
+		>stream
 '
 
 # this also covers commit messages
@@ -30,6 +34,11 @@ test_expect_success 'stream omits path names' '
 	! grep xyzzy stream
 '
 
+test_expect_success 'stream contains user-specified names' '
+	grep retain-me stream &&
+	grep custom-name stream
+'
+
 test_expect_success 'stream omits gitlink oids' '
 	# avoid relying on the whole oid to remain hash-agnostic; this is
 	# plenty to be unique within our test case
-- 
2.27.0.517.gbc32778fa3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 10/10] fast-export: anonymize "master" refname
  2020-06-23 15:24   ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Jeff King
                       ` (8 preceding siblings ...)
  2020-06-23 15:25     ` [PATCH 09/10] fast-export: allow seeding the anonymized mapping Jeff King
@ 2020-06-23 15:25     ` Jeff King
  2020-06-23 19:34     ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Junio C Hamano
  2020-06-25 19:48     ` [PATCH v2 0/11] " Jeff King
  11 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-23 15:25 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin

Running "fast-export --anonymize" will leave "refs/heads/master"
untouched in the output, for two reasons:

  - it helped to have some known reference point between the original
    and anonymized repository

  - since it's historically the default branch name, it doesn't leak any
    information

Now that we can ask fast-export to retain particular tokens, we have a
much better tool for the first one (because it works for any ref, not
just master).

For the second, the notion of "default branch name" is likely to become
configurable soon, at which point the name _does_ leak information.
Let's drop this special case in preparation.

Note that we have to adjust the test a bit, since it relied on using the
name "master" in the anonymized repos. We could just use
--seed-anonymized=master to keep the same output, but then we wouldn't
know if it works because of our hard-coded master or because of the
seeding.

So let's flip the test a bit, and confirm that we anonymize "master",
but keep "other" in the output.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-export.c            |  7 -------
 t/t9351-fast-export-anonymize.sh | 12 +++++++-----
 2 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index ef82497bbf..7e0e1770cf 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -538,13 +538,6 @@ static const char *anonymize_refname(const char *refname)
 	static struct strbuf anon = STRBUF_INIT;
 	int i;
 
-	/*
-	 * We also leave "master" as a special case, since it does not reveal
-	 * anything interesting.
-	 */
-	if (!strcmp(refname, "refs/heads/master"))
-		return refname;
-
 	strbuf_reset(&anon);
 	for (i = 0; i < ARRAY_SIZE(prefixes); i++) {
 		if (skip_prefix(refname, prefixes[i], &refname)) {
diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
index d84eec9bab..6e2041865c 100755
--- a/t/t9351-fast-export-anonymize.sh
+++ b/t/t9351-fast-export-anonymize.sh
@@ -22,6 +22,7 @@ test_expect_success 'export anonymized stream' '
 	git fast-export --anonymize --all \
 		--seed-anonymized=retain-me \
 		--seed-anonymized=xyzzy:custom-name \
+		--seed-anonymized=other \
 		>stream
 '
 
@@ -45,12 +46,12 @@ test_expect_success 'stream omits gitlink oids' '
 	! grep a000000000000000000 stream
 '
 
-test_expect_success 'stream allows master as refname' '
-	grep master stream
+test_expect_success 'stream retains other as refname' '
+	grep other stream
 '
 
 test_expect_success 'stream omits other refnames' '
-	! grep other stream &&
+	! grep master stream &&
 	! grep mytag stream
 '
 
@@ -76,15 +77,16 @@ test_expect_success 'import stream to new repository' '
 test_expect_success 'result has two branches' '
 	git for-each-ref --format="%(refname)" refs/heads >branches &&
 	test_line_count = 2 branches &&
-	other_branch=$(grep -v refs/heads/master branches)
+	other_branch=refs/heads/other &&
+	main_branch=$(grep -v $other_branch branches)
 '
 
 test_expect_success 'repo has original shape and timestamps' '
 	shape () {
 		git log --format="%m %ct" --left-right --boundary "$@"
 	} &&
 	(cd .. && shape master...other) >expect &&
-	shape master...$other_branch >actual &&
+	shape $main_branch...$other_branch >actual &&
 	test_cmp expect actual
 '
 
-- 
2.27.0.517.gbc32778fa3

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 09/10] fast-export: allow seeding the anonymized mapping
  2020-06-23 15:25     ` [PATCH 09/10] fast-export: allow seeding the anonymized mapping Jeff King
@ 2020-06-23 17:16       ` Eric Sunshine
  2020-06-23 18:30         ` Jeff King
  2020-06-23 18:11       ` Eric Sunshine
  1 sibling, 1 reply; 64+ messages in thread
From: Eric Sunshine @ 2020-06-23 17:16 UTC (permalink / raw)
  To: Jeff King; +Cc: Git List, Junio C Hamano, Johannes Schindelin

On Tue, Jun 23, 2020 at 11:25 AM Jeff King <peff@peff.net> wrote:
> diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
> @@ -238,6 +243,25 @@ collapse "User 0", "User 1", etc into "User X"). This produces a much
> +[...] For example, if you have a bug which reproduces
> +with `git rev-list mybranch -- foo.c`, you can run:
> +
> +---------------------------------------------------
> +$ git fast-export --anonymize --all \
> +   --seed-anonymized=foo.c:secret.c \
> +   --seed-anonymized=mybranch \
> +   >stream
> +---------------------------------------------------
> +
> +After importing the stream, you can then run `git rev-list mybranch --
> +secret.c` in the anonymized repository.

I understand that your intention here is to demonstrate both forms of
--seed-anonymized, but I'm slightly concerned that people may
interpret this example as meaning that you are not allowed to
anonymize the refname when anonymizing a pathname. It might be less
ambiguous to avoid the "short form" in the example; people who have
read the description of --seed-anonymized will know that the short
form can be used without having to see it in an example.

> +Note that paths and refnames are split into tokens at slash boundaries.
> +The command above would anonymize `subdir/foo.c` as something like
> +`path123/secret.c`.

Confusing. This seems to be saying that anonymizing filenames in
subdirectories is pointless because you can't know how the leading
directory names will be anonymized. That leaves the reader wondering
how to deal with the situation. Does it require using
--seed-anonymized for each path component leading up to the filename?
Or can --seed-anonymized take an full pathname (leading directory
components and filename) in one shot?

> @@ -168,8 +169,18 @@ static const char *anonymize_str(struct hashmap *map,
> -    ret = hashmap_get_entry(map, &key, hash, &key);
>
> +    /* First check if it's a token the user configured manually... */
> +    if (anonymized_seeds.cmpfn)
> +        ret = hashmap_get_entry(&anonymized_seeds, &key, hash, &key);
> +    else
> +        ret = NULL;
> +
> +    /* ...otherwise check if we've already seen it in this context... */
> +    if (!ret)
> +        ret = hashmap_get_entry(map, &key, hash, &key);
> +
> +    /* ...and finally generate a new mapping if necessary */

I was a bit surprised to see that --seed-anonymized values are stored
in a separate hash map rather than simply being used to (literally)
seed the existing anonymization hash map. I guess there's a good
technical reason for doing it this way, such as the normal
anonymization hash map not yet being in existence at the time the
--seed-anonymized option is processed? (I haven't checked because I'm
too lazy, so it may not be worth spending time answering me.)

> @@ -1188,6 +1230,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
>         OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")),
> +        OPT_CALLBACK_F(0, "seed-anonymized", &anonymized_seeds, N_("from:to"),
> +               N_("convert <from> to <to> in anonymized output"),
> +               PARSE_OPT_NONEG, parse_opt_seed_anonymized),

Would it be worthwhile to add a check somewhere after the
parse_options() invocation and complain if --seed-anonymized was used
without --anonymize?  (Or should --seed-anonymized perhaps imply
--anonymize?)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 09/10] fast-export: allow seeding the anonymized mapping
  2020-06-23 15:25     ` [PATCH 09/10] fast-export: allow seeding the anonymized mapping Jeff King
  2020-06-23 17:16       ` Eric Sunshine
@ 2020-06-23 18:11       ` Eric Sunshine
  2020-06-23 18:35         ` Jeff King
  1 sibling, 1 reply; 64+ messages in thread
From: Eric Sunshine @ 2020-06-23 18:11 UTC (permalink / raw)
  To: Jeff King; +Cc: Git List, Junio C Hamano, Johannes Schindelin

On Tue, Jun 23, 2020 at 11:25 AM Jeff King <peff@peff.net> wrote:
> Let's make it possible to seed the anonymization map. This lets users
> either:
> [...]
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
> @@ -119,6 +119,11 @@ by keeping the marks the same across runs.
> +--seed-anonymized=<from>[:<to>]::
> +       Convert token `<from>` to `<to>` in the anonymized output. If
> +       `<to>` is omitted, map `<from>` to itself (i.e., do not
> +       anonymize it). See the section on `ANONYMIZING` below.

By the way (possible bikeshedding ahead), "seed anonymous" seems
overly technical. I wonder if a name such as
'--anonymize-to=<from>[:<to>]' might be clearer and easier for people
to understand.

In fact, in an earlier email, I asked whether --seed-anonymized should
imply --anonymize. Thinking further on this, I wonder if we even need
the second option name. It should be possible to overload the existing
--anonymize to handle all functions. For instance:

    '--anonymize' would anonymize everything

    '--anonymize=<from>[:<to>]' would anonymize and map <from> to <to>

So, the example you give in the documentation would become:

    git fast-export --all \
        --anonymize=foo.c:secret.c \
        --anonymize=mybranch >stream

Or is that too cryptic?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 09/10] fast-export: allow seeding the anonymized mapping
  2020-06-23 17:16       ` Eric Sunshine
@ 2020-06-23 18:30         ` Jeff King
  2020-06-23 20:30           ` Eric Sunshine
  0 siblings, 1 reply; 64+ messages in thread
From: Jeff King @ 2020-06-23 18:30 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Git List, Junio C Hamano, Johannes Schindelin

On Tue, Jun 23, 2020 at 01:16:05PM -0400, Eric Sunshine wrote:

> On Tue, Jun 23, 2020 at 11:25 AM Jeff King <peff@peff.net> wrote:
> > diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
> > @@ -238,6 +243,25 @@ collapse "User 0", "User 1", etc into "User X"). This produces a much
> > +[...] For example, if you have a bug which reproduces
> > +with `git rev-list mybranch -- foo.c`, you can run:
> > +
> > +---------------------------------------------------
> > +$ git fast-export --anonymize --all \
> > +   --seed-anonymized=foo.c:secret.c \
> > +   --seed-anonymized=mybranch \
> > +   >stream
> > +---------------------------------------------------
> > +
> > +After importing the stream, you can then run `git rev-list mybranch --
> > +secret.c` in the anonymized repository.
> 
> I understand that your intention here is to demonstrate both forms of
> --seed-anonymized, but I'm slightly concerned that people may
> interpret this example as meaning that you are not allowed to
> anonymize the refname when anonymizing a pathname. It might be less
> ambiguous to avoid the "short form" in the example; people who have
> read the description of --seed-anonymized will know that the short
> form can be used without having to see it in an example.

I'm not sure what you'd write, then. You can't mention "mybranch"
anymore if it was anonymized. Are you suggesting to make the example:

  git rev-list -- foo.c

by itself?

> > +Note that paths and refnames are split into tokens at slash boundaries.
> > +The command above would anonymize `subdir/foo.c` as something like
> > +`path123/secret.c`.
> 
> Confusing. This seems to be saying that anonymizing filenames in
> subdirectories is pointless because you can't know how the leading
> directory names will be anonymized. That leaves the reader wondering
> how to deal with the situation. Does it require using
> --seed-anonymized for each path component leading up to the filename?

You can do that, but I think it would be simpler to just find "secret.c"
in the anonymized repo (either in the checkout, or just "git ls-tree
-r").

> Or can --seed-anonymized take an full pathname (leading directory
> components and filename) in one shot?

No, it can't. Suggested wording? That's what I was trying to say with
the above sentence.

> > +    /* First check if it's a token the user configured manually... */
> > +    if (anonymized_seeds.cmpfn)
> > +        ret = hashmap_get_entry(&anonymized_seeds, &key, hash, &key);
> > +    else
> > +        ret = NULL;
> > +
> > +    /* ...otherwise check if we've already seen it in this context... */
> > +    if (!ret)
> > +        ret = hashmap_get_entry(map, &key, hash, &key);
> > +
> > +    /* ...and finally generate a new mapping if necessary */
> 
> I was a bit surprised to see that --seed-anonymized values are stored
> in a separate hash map rather than simply being used to (literally)
> seed the existing anonymization hash map. I guess there's a good
> technical reason for doing it this way, such as the normal
> anonymization hash map not yet being in existence at the time the
> --seed-anonymized option is processed? (I haven't checked because I'm
> too lazy, so it may not be worth spending time answering me.)

The reason is that there isn't one anonymization hash map. There's a
separate one for each generator (so refs become "refs/heads/ref123" and
paths become "path123/path456").

> > @@ -1188,6 +1230,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
> >         OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")),
> > +        OPT_CALLBACK_F(0, "seed-anonymized", &anonymized_seeds, N_("from:to"),
> > +               N_("convert <from> to <to> in anonymized output"),
> > +               PARSE_OPT_NONEG, parse_opt_seed_anonymized),
> 
> Would it be worthwhile to add a check somewhere after the
> parse_options() invocation and complain if --seed-anonymized was used
> without --anonymize?  (Or should --seed-anonymized perhaps imply
> --anonymize?)

I thought about implying, but I have a slight preference to err on the
side of making things less magical. I don't mind triggering a warning or
error, but it's not like anything _bad_ happens if you don't say
--anonymize. It just doesn't do anything, which seems like a perfectly
logical outcome.

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 09/10] fast-export: allow seeding the anonymized mapping
  2020-06-23 18:11       ` Eric Sunshine
@ 2020-06-23 18:35         ` Jeff King
  2020-06-23 20:35           ` Eric Sunshine
  0 siblings, 1 reply; 64+ messages in thread
From: Jeff King @ 2020-06-23 18:35 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Git List, Junio C Hamano, Johannes Schindelin

On Tue, Jun 23, 2020 at 02:11:51PM -0400, Eric Sunshine wrote:

> On Tue, Jun 23, 2020 at 11:25 AM Jeff King <peff@peff.net> wrote:
> > Let's make it possible to seed the anonymization map. This lets users
> > either:
> > [...]
> > Signed-off-by: Jeff King <peff@peff.net>
> > ---
> > diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
> > @@ -119,6 +119,11 @@ by keeping the marks the same across runs.
> > +--seed-anonymized=<from>[:<to>]::
> > +       Convert token `<from>` to `<to>` in the anonymized output. If
> > +       `<to>` is omitted, map `<from>` to itself (i.e., do not
> > +       anonymize it). See the section on `ANONYMIZING` below.
> 
> By the way (possible bikeshedding ahead), "seed anonymous" seems
> overly technical. I wonder if a name such as
> '--anonymize-to=<from>[:<to>]' might be clearer and easier for people
> to understand.

I wrestled with the name, and I agree "seed" is overly technical. And I
came up with many similar variations of "anonymize-to", but they all
seemed ambiguous (e.g., it could be "to" a file that we're storing the
data in).

Perhaps "--anonymize-map" would be less technical?

> In fact, in an earlier email, I asked whether --seed-anonymized should
> imply --anonymize. Thinking further on this, I wonder if we even need
> the second option name. It should be possible to overload the existing
> --anonymize to handle all functions. For instance:
> 
>     '--anonymize' would anonymize everything
> 
>     '--anonymize=<from>[:<to>]' would anonymize and map <from> to <to>
> 
> So, the example you give in the documentation would become:
> 
>     git fast-export --all \
>         --anonymize=foo.c:secret.c \
>         --anonymize=mybranch >stream
> 
> Or is that too cryptic?

Yeah, that was another one I considered, but it both seemed cryptic
(after all, we're saying what _not_ to anonymize), and it squats on the
"anonymize" option. So imagine we had another option later, like
"anonymize blobs and paths, but not refs", that could easily be
"--anonymize=blobs,path" or "--anonymize=!refs". I'd rather not paint
ourselves in a corner.

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [alternative 0/10] fast-export: allow seeding the anonymized mapping
  2020-06-23 15:24   ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Jeff King
                       ` (9 preceding siblings ...)
  2020-06-23 15:25     ` [PATCH 10/10] fast-export: anonymize "master" refname Jeff King
@ 2020-06-23 19:34     ` Junio C Hamano
  2020-06-23 19:44       ` Jeff King
  2020-06-25 19:48     ` [PATCH v2 0/11] " Jeff King
  11 siblings, 1 reply; 64+ messages in thread
From: Junio C Hamano @ 2020-06-23 19:34 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Eric Sunshine, Johannes Schindelin

Jeff King <peff@peff.net> writes:

> On Mon, Jun 22, 2020 at 05:47:46PM -0400, Jeff King wrote:
>
>> Here's a v2 which I think addresses all of the comments. I have to admit
>> that after writing my last email to Junio, I am wondering whether it
>> would be sufficient and simpler to let the user specify a static mapping
>> of tokens (that could just be applied anywhere).

Yeah, dumping the random mapping created and telling the user to
figure out what the tool did is less nice than allowing the user
to enumerate what tokens are sensitible and need to be replaced.

> Both of these techniques _could_ live side-by-side within fast-export,
> as they have slightly different strengths and weaknesses. But I'd prefer
> to just go with one (this one) in the name of simplicity, and I strongly
> suspect nobody will ever ask for the other.

OK.  So should we revert the merge of the other one into 'next'?
That is easy enough ;-)

Thanks.

>   [01/10]: t9351: derive anonymized tree checks from original repo
>   [02/10]: fast-export: use xmemdupz() for anonymizing oids
>
>     These first two are actually a bug-fix that I noticed while writing
>     it.
>
>   [03/10]: fast-export: store anonymized oids as hex strings
>   [04/10]: fast-export: tighten anonymize_mem() interface to handle only strings
>   [05/10]: fast-export: stop storing lengths in anonymized hashmaps
>   [06/10]: fast-export: use a flex array to store anonymized entries
>   [07/10]: fast-export: move global "idents" anonymize hashmap into function
>   [08/10]: fast-export: add a "data" callback parameter to anonymize_str()
>
>     This is all cleanup and prep. More than is strictly necessary for
>     this series, but it does simplify things and reduce the memory
>     footprint (only a few megabytes in git.git, but more in larger
>     repos).
>
>   [09/10]: fast-export: allow seeding the anonymized mapping
>
>     And then this is the actual feature...
>
>   [10/10]: fast-export: anonymize "master" refname
>
>     ...which finally lets us drop the special name rule.
>
>  Documentation/git-fast-export.txt |  24 +++++
>  builtin/fast-export.c             | 161 +++++++++++++++++++-----------
>  t/t9351-fast-export-anonymize.sh  |  54 +++++++---
>  3 files changed, 167 insertions(+), 72 deletions(-)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [alternative 0/10] fast-export: allow seeding the anonymized mapping
  2020-06-23 19:34     ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Junio C Hamano
@ 2020-06-23 19:44       ` Jeff King
  0 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-23 19:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Eric Sunshine, Johannes Schindelin

On Tue, Jun 23, 2020 at 12:34:47PM -0700, Junio C Hamano wrote:

> > Both of these techniques _could_ live side-by-side within fast-export,
> > as they have slightly different strengths and weaknesses. But I'd prefer
> > to just go with one (this one) in the name of simplicity, and I strongly
> > suspect nobody will ever ask for the other.
> 
> OK.  So should we revert the merge of the other one into 'next'?
> That is easy enough ;-)

Yes, please. :)

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 09/10] fast-export: allow seeding the anonymized mapping
  2020-06-23 18:30         ` Jeff King
@ 2020-06-23 20:30           ` Eric Sunshine
  2020-06-24 15:47             ` Jeff King
  0 siblings, 1 reply; 64+ messages in thread
From: Eric Sunshine @ 2020-06-23 20:30 UTC (permalink / raw)
  To: Jeff King; +Cc: Git List, Junio C Hamano, Johannes Schindelin

On Tue, Jun 23, 2020 at 2:31 PM Jeff King <peff@peff.net> wrote:
> On Tue, Jun 23, 2020 at 01:16:05PM -0400, Eric Sunshine wrote:
> > On Tue, Jun 23, 2020 at 11:25 AM Jeff King <peff@peff.net> wrote:
> > I understand that your intention here is to demonstrate both forms of
> > --seed-anonymized, but I'm slightly concerned that people may
> > interpret this example as meaning that you are not allowed to
> > anonymize the refname when anonymizing a pathname. It might be less
> > ambiguous to avoid the "short form" in the example; people who have
> > read the description of --seed-anonymized will know that the short
> > form can be used without having to see it in an example.
>
> I'm not sure what you'd write, then. You can't mention "mybranch"
> anymore if it was anonymized. Are you suggesting to make the example:
>
>  git rev-list -- foo.c
>
> by itself?

Sorry, I meant to provide an example like this:

    For example, if you have a bug which reproduces with `git rev-list
    sensitive -- secret.c`, you can run:

    $ git fast-export --anonymize --all \
        --seed-anonymized=sensitive:foo \
        --seed-anonymized=secret.c:bar.c \
        >stream

    After importing the stream, you can then run `git rev-list foo --
    bar.c` in the anonymized repository.

> > > +Note that paths and refnames are split into tokens at slash boundaries.
> > > +The command above would anonymize `subdir/foo.c` as something like
> > > +`path123/secret.c`.
> >
> > Confusing. This seems to be saying that anonymizing filenames in
> > subdirectories is pointless because you can't know how the leading
> > directory names will be anonymized. That leaves the reader wondering
> > how to deal with the situation. Does it require using
> > --seed-anonymized for each path component leading up to the filename?
>
> You can do that, but I think it would be simpler to just find "secret.c"
> in the anonymized repo (either in the checkout, or just "git ls-tree
> -r").
>
> > Or can --seed-anonymized take an full pathname (leading directory
> > components and filename) in one shot?
>
> No, it can't. Suggested wording? That's what I was trying to say with
> the above sentence.

Hmm, perhaps your original attempt can be extended slightly to state
it more explicitly?

    Note that paths and refnames are split into tokens at slash
    boundaries. The command above would anonymize `subdir/foo.c` as
    something like `path123/secret.c`; you could then search for
    `secret.c` in the anonymized repository to determine the final
    pathname.

    To make referencing the final pathname simpler, you can seed
    anonymization for each path component; so, if you also anonymize
    `subdir` to `publicdir`, then the final pathname would be
    `publicdir/secret.c`.

This makes me wonder if --seed-anonymized should do its own
tokenization so that --seed-anonymized=subdir/foo:public/bar is
automatically understood as anonymizing "subdir" to "public" _and_
"foo" to "bar". But that potentially gets weird if you say:

    --seed-anonymized=a/b:q/p --seed-anonymized=a/c:y/z

in which case you've given conflicting replacements for "a". (I
suppose it could issue a warning message in that case.)

> > Would it be worthwhile to add a check somewhere after the
> > parse_options() invocation and complain if --seed-anonymized was used
> > without --anonymize? (Or should --seed-anonymized perhaps imply
> > --anonymize?)
>
> I thought about implying, but I have a slight preference to err on the
> side of making things less magical. I don't mind triggering a warning or
> error, but it's not like anything _bad_ happens if you don't say
> --anonymize. It just doesn't do anything, which seems like a perfectly
> logical outcome.

Lack of a warning or error could be kind of bad if the person doesn't
check the fast-export file before sending it out and only discovers
later that:

    git fast-export --seed-anonymized=foo:bar

didn't perform _any_ anonymization at all.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 09/10] fast-export: allow seeding the anonymized mapping
  2020-06-23 18:35         ` Jeff King
@ 2020-06-23 20:35           ` Eric Sunshine
  2020-06-24 15:48             ` Jeff King
  0 siblings, 1 reply; 64+ messages in thread
From: Eric Sunshine @ 2020-06-23 20:35 UTC (permalink / raw)
  To: Jeff King; +Cc: Git List, Junio C Hamano, Johannes Schindelin

On Tue, Jun 23, 2020 at 2:35 PM Jeff King <peff@peff.net> wrote:
> On Tue, Jun 23, 2020 at 02:11:51PM -0400, Eric Sunshine wrote:
> > By the way (possible bikeshedding ahead), "seed anonymous" seems
> > overly technical. I wonder if a name such as
> > '--anonymize-to=<from>[:<to>]' might be clearer and easier for people
> > to understand.
>
> I wrestled with the name, and I agree "seed" is overly technical. And I
> came up with many similar variations of "anonymize-to", but they all
> seemed ambiguous (e.g., it could be "to" a file that we're storing the
> data in).
>
> Perhaps "--anonymize-map" would be less technical?

That's not too bad. It is better than --seed-anonymized. I haven't
come up with any name which improves upon it.

> > In fact, in an earlier email, I asked whether --seed-anonymized should
> > imply --anonymize. Thinking further on this, I wonder if we even need
> > the second option name. It should be possible to overload the existing
> > --anonymize to handle all functions. For instance:
> >
> >   git fast-export --all \
> >     --anonymize=foo.c:secret.c \
> >     --anonymize=mybranch >stream
> >
> > Or is that too cryptic?
>
> Yeah, that was another one I considered, but it both seemed cryptic
> (after all, we're saying what _not_ to anonymize), and it squats on the
> "anonymize" option. So imagine we had another option later, like
> "anonymize blobs and paths, but not refs", that could easily be
> "--anonymize=blobs,path" or "--anonymize=!refs". I'd rather not paint
> ourselves in a corner.

Okay, makes sense.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/10] fast-export: store anonymized oids as hex strings
  2020-06-23 15:24     ` [PATCH 03/10] fast-export: store anonymized oids as hex strings Jeff King
@ 2020-06-24 11:43       ` SZEDER Gábor
  2020-06-24 15:54         ` Jeff King
  0 siblings, 1 reply; 64+ messages in thread
From: SZEDER Gábor @ 2020-06-24 11:43 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Eric Sunshine, Junio C Hamano, Johannes Schindelin

On Tue, Jun 23, 2020 at 11:24:51AM -0400, Jeff King wrote:

> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> index 289395a131..4a3a4c933e 100644
> --- a/builtin/fast-export.c
> +++ b/builtin/fast-export.c
> @@ -387,16 +387,19 @@ static void *generate_fake_oid(const void *old, size_t *len)
>  {
>  	static uint32_t counter = 1; /* avoid null oid */
>  	const unsigned hashsz = the_hash_algo->rawsz;
> -	unsigned char *out = xcalloc(hashsz, 1);
> -	put_be32(out + hashsz - 4, counter++);
> -	return out;
> +	struct object_id oid;
> +	char *hex = xmallocz(GIT_MAX_HEXSZ);
> +
> +	oidclr(&oid);
> +	put_be32(oid.hash + hashsz - 4, counter++);
> +	return oid_to_hex_r(hex, &oid);
>  }

GCC 4.8 complains about this change in our CI job:

  $ make CC=gcc-4.8 builtin/fast-export.o
      CC builtin/fast-export.o
  builtin/fast-export.c: In function ‘generate_fake_oid’:
  builtin/fast-export.c:394:2: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]
    put_be32(oid.hash + hashsz - 4, counter++);
    ^
  cc1: all warnings being treated as errors



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 09/10] fast-export: allow seeding the anonymized mapping
  2020-06-23 20:30           ` Eric Sunshine
@ 2020-06-24 15:47             ` Jeff King
  0 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-24 15:47 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Git List, Junio C Hamano, Johannes Schindelin

On Tue, Jun 23, 2020 at 04:30:23PM -0400, Eric Sunshine wrote:

> > I'm not sure what you'd write, then. You can't mention "mybranch"
> > anymore if it was anonymized. Are you suggesting to make the example:
> >
> >  git rev-list -- foo.c
> >
> > by itself?
> 
> Sorry, I meant to provide an example like this:
> 
>     For example, if you have a bug which reproduces with `git rev-list
>     sensitive -- secret.c`, you can run:
> 
>     $ git fast-export --anonymize --all \
>         --seed-anonymized=sensitive:foo \
>         --seed-anonymized=secret.c:bar.c \
>         >stream
> 
>     After importing the stream, you can then run `git rev-list foo --
>     bar.c` in the anonymized repository.

Thanks, that makes sense. I took this as-is for my reroll (modulo the
change of option name discussed elsewhere).

> Hmm, perhaps your original attempt can be extended slightly to state
> it more explicitly?
> 
>     Note that paths and refnames are split into tokens at slash
>     boundaries. The command above would anonymize `subdir/foo.c` as
>     something like `path123/secret.c`; you could then search for
>     `secret.c` in the anonymized repository to determine the final
>     pathname.
> 
>     To make referencing the final pathname simpler, you can seed
>     anonymization for each path component; so, if you also anonymize
>     `subdir` to `publicdir`, then the final pathname would be
>     `publicdir/secret.c`.

Thanks, I took this modulo some fixups to match the example above, and
to avoid the use of the word "seed" based on our other discussion.

> This makes me wonder if --seed-anonymized should do its own
> tokenization so that --seed-anonymized=subdir/foo:public/bar is
> automatically understood as anonymizing "subdir" to "public" _and_
> "foo" to "bar". But that potentially gets weird if you say:
> 
>     --seed-anonymized=a/b:q/p --seed-anonymized=a/c:y/z
> 
> in which case you've given conflicting replacements for "a". (I
> suppose it could issue a warning message in that case.)

Right, I think you get into weird corner cases. Another issue is that
not all items are tokenized (e.g., if your author name was foo/bar,
you'd want that replaced as a whole). Probably you could add both the
broken-down and full inputs. Yet another issue is that you can't add a
token with a ":" due to the syntax.

This is an infrequently-enough-used feature that I think it's worth
keeping things simple, even if they're a little less convenient to
invoke.

> Lack of a warning or error could be kind of bad if the person doesn't
> check the fast-export file before sending it out and only discovers
> later that:
> 
>     git fast-export --seed-anonymized=foo:bar
> 
> didn't perform _any_ anonymization at all.

Good point. I'd hope people would glance at the output before sending it
out, but given that it's a potential safety issue, it probably is worth
detecting this case. I'll add it to my re-roll.

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 09/10] fast-export: allow seeding the anonymized mapping
  2020-06-23 20:35           ` Eric Sunshine
@ 2020-06-24 15:48             ` Jeff King
  0 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-24 15:48 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Git List, Junio C Hamano, Johannes Schindelin

On Tue, Jun 23, 2020 at 04:35:37PM -0400, Eric Sunshine wrote:

> > I wrestled with the name, and I agree "seed" is overly technical. And I
> > came up with many similar variations of "anonymize-to", but they all
> > seemed ambiguous (e.g., it could be "to" a file that we're storing the
> > data in).
> >
> > Perhaps "--anonymize-map" would be less technical?
> 
> That's not too bad. It is better than --seed-anonymized. I haven't
> come up with any name which improves upon it.

I went with that in my reroll, and avoided using the word "seed" at all
in the documentation (I did keep the name "anonymized_seeds" as the
internal variable name for the hashmap, since just calling it "map"
there is ambiguous with all of the other maps).

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/10] fast-export: store anonymized oids as hex strings
  2020-06-24 11:43       ` SZEDER Gábor
@ 2020-06-24 15:54         ` Jeff King
  2020-06-25 15:49           ` Jeff King
  0 siblings, 1 reply; 64+ messages in thread
From: Jeff King @ 2020-06-24 15:54 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: git, Eric Sunshine, Junio C Hamano, Johannes Schindelin

On Wed, Jun 24, 2020 at 01:43:13PM +0200, SZEDER Gábor wrote:

> >  	static uint32_t counter = 1; /* avoid null oid */
> >  	const unsigned hashsz = the_hash_algo->rawsz;
> > -	unsigned char *out = xcalloc(hashsz, 1);
> > -	put_be32(out + hashsz - 4, counter++);
> > -	return out;
> > +	struct object_id oid;
> > +	char *hex = xmallocz(GIT_MAX_HEXSZ);
> > +
> > +	oidclr(&oid);
> > +	put_be32(oid.hash + hashsz - 4, counter++);
> > +	return oid_to_hex_r(hex, &oid);
> >  }
> 
> GCC 4.8 complains about this change in our CI job:
> 
>   $ make CC=gcc-4.8 builtin/fast-export.o
>       CC builtin/fast-export.o
>   builtin/fast-export.c: In function ‘generate_fake_oid’:
>   builtin/fast-export.c:394:2: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]
>     put_be32(oid.hash + hashsz - 4, counter++);
>     ^
>   cc1: all warnings being treated as errors

Interesting. The only change on this line is swapping out the local
"unsigned char *" for an object_id, which also stores an "unsigned
char" (though as an array). And while put_be32() takes a void pointer,
it's inlined and treats it the argument an "unsigned char *" internally.
So I'm not sure I see that any type punning is going on at all.

I'll see if I can appease the compiler, though.

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 08/10] fast-export: add a "data" callback parameter to anonymize_str()
  2020-06-23 15:25     ` [PATCH 08/10] fast-export: add a "data" callback parameter to anonymize_str() Jeff King
@ 2020-06-24 19:58       ` Junio C Hamano
  0 siblings, 0 replies; 64+ messages in thread
From: Junio C Hamano @ 2020-06-24 19:58 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Eric Sunshine, Johannes Schindelin

Jeff King <peff@peff.net> writes:

> The anonymize_str() function takes a generator callback, but there's no
> way to pass extra context to it. Let's add the usual "void *data"
> parameter to the generator interface and pass it along.
>
> This is mildly annoying for existing callers, all of which pass NULL,
> but is necessary to avoid extra globals in some cases we'll add in a
> subsequent patch.

> While we're touching each of these callbacks, we can further observe
> that none of them use the existing orig/len parameters at all. This
> makes sense, since the point is for their output to have no discernable
> basis in the original (my original version had some notion that we might
> use a one-way function to obfuscate the names, but it was never
> implemented). So let's drop those extra parameters. If a caller really
> wants to do something with them, it can pass a struct through the new
> data parameter.

I guess it was giving a perfect proof that the anonymization is
good---you cannot leak the info in data you did not even look at
;-).

And we must keep passing the <orig,len> pair to the anonymize API,
of course, because that would be used as a look-up key for the
customized/seeded mapping, which makes sense.  But of course it is
not necessary to pass them to the lower-level "generate" callbacks.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/10] fast-export: store anonymized oids as hex strings
  2020-06-24 15:54         ` Jeff King
@ 2020-06-25 15:49           ` Jeff King
  2020-06-25 20:45             ` SZEDER Gábor
  0 siblings, 1 reply; 64+ messages in thread
From: Jeff King @ 2020-06-25 15:49 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: git, Eric Sunshine, Junio C Hamano, Johannes Schindelin

On Wed, Jun 24, 2020 at 11:54:20AM -0400, Jeff King wrote:

> > GCC 4.8 complains about this change in our CI job:
> > 
> >   $ make CC=gcc-4.8 builtin/fast-export.o
> >       CC builtin/fast-export.o
> >   builtin/fast-export.c: In function ‘generate_fake_oid’:
> >   builtin/fast-export.c:394:2: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]
> >     put_be32(oid.hash + hashsz - 4, counter++);
> >     ^
> >   cc1: all warnings being treated as errors
> 
> Interesting. The only change on this line is swapping out the local
> "unsigned char *" for an object_id, which also stores an "unsigned
> char" (though as an array). And while put_be32() takes a void pointer,
> it's inlined and treats it the argument an "unsigned char *" internally.
> So I'm not sure I see that any type punning is going on at all.
> 
> I'll see if I can appease the compiler, though.

Hmm. Switching to a local array makes the warning go away:

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 4a3a4c93..eae2ec5 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -387,12 +387,12 @@ static void *generate_fake_oid(const void *old, size_t *len)
 {
 	static uint32_t counter = 1; /* avoid null oid */
 	const unsigned hashsz = the_hash_algo->rawsz;
-	struct object_id oid;
+	unsigned char out[GIT_MAX_RAWSZ];
 	char *hex = xmallocz(GIT_MAX_HEXSZ);
 
-	oidclr(&oid);
-	put_be32(oid.hash + hashsz - 4, counter++);
-	return oid_to_hex_r(hex, &oid);
+	hashclr(out);
+	put_be32(out + hashsz - 4, counter++);
+	return hash_to_hex_algop_r(hex, out, the_hash_algo);
 }
 
 static const char *anonymize_oid(const char *oid_hex)

at the cost of some uglier use of the_hash_algo and RAWSZ. But
curiously, even with the array, if I adjust the write only slightly to
write at the begining instead of the end (we don't really care where our
counter appears in the hash, since the point is to anonymize):

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index eae2ec5..1f52247 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -386,12 +386,11 @@ static void print_path(const char *path)
 static void *generate_fake_oid(const void *old, size_t *len)
 {
 	static uint32_t counter = 1; /* avoid null oid */
-	const unsigned hashsz = the_hash_algo->rawsz;
 	unsigned char out[GIT_MAX_RAWSZ];
 	char *hex = xmallocz(GIT_MAX_HEXSZ);
 
 	hashclr(out);
-	put_be32(out + hashsz - 4, counter++);
+	put_be32(out, counter++);
 	return hash_to_hex_algop_r(hex, out, the_hash_algo);
 }
 

...then the warning comes back. I tried that because I was wondering
whether the same thing might make the warning go away on the object_id
version, but it doesn't.

So this really seems like a pointless false positive from the compiler,
and it's a rather old compiler (the warning no longer triggers in gcc 6
and up). Is it worth caring about? Ubuntu Trusty is out of standard
support but not fully EOL'd until 2022. Debian jessie has gcc 4.9
which triggers the warning, but will hit EOL in 5 days. If it were an
actual breakage I'd be more concerned, but keeping such an old compiler
-Werror clean doesn't seem that important.

I'd note also that none of the Actions CI jobs run with this compiler
version. If we _do_ want to care about it, it might be worth covering it
there.

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 0/11] fast-export: allow seeding the anonymized mapping
  2020-06-23 15:24   ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Jeff King
                       ` (10 preceding siblings ...)
  2020-06-23 19:34     ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Junio C Hamano
@ 2020-06-25 19:48     ` Jeff King
  2020-06-25 19:48       ` [PATCH v2 01/11] t9351: derive anonymized tree checks from original repo Jeff King
                         ` (11 more replies)
  11 siblings, 12 replies; 64+ messages in thread
From: Jeff King @ 2020-06-25 19:48 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin, SZEDER Gábor

Here's a re-roll of the "--seed-anonymize" feature, incorporating
changes from Eric's review.

Here's a summary of the changes. Patches 1-8 are unchanged:

  [01/11]: t9351: derive anonymized tree checks from original repo
  [02/11]: fast-export: use xmemdupz() for anonymizing oids
  [03/11]: fast-export: store anonymized oids as hex strings
  [04/11]: fast-export: tighten anonymize_mem() interface to handle only strings
  [05/11]: fast-export: stop storing lengths in anonymized hashmaps
  [06/11]: fast-export: use a flex array to store anonymized entries
  [07/11]: fast-export: move global "idents" anonymize hashmap into function
  [08/11]: fast-export: add a "data" callback parameter to anonymize_str()

  [09/11]: fast-export: allow seeding the anonymized mapping

    - the option is now called "--anonymize-map", and the word "seed"
      isn't used in any user-facing documentation

    - incorporated Eric's documentation suggestions

    - --anonymize-map without --anonymize now triggers an error

  [10/11]: fast-export: anonymize "master" refname

    - minor adjustments to handle change of option name

  [11/11]: fast-export: use local array to store anonymized oid

    - new in this iteration; this address the gcc-4 warning that Gábor
      mentioned. I prepared it on top since I think we'd eventually want
      to revert it once we decide that compiler is too old (and I'd be
      perfectly fine to declare that it's so now, and just never apply
      it at all).

Full range diff is below.

 1:  65dca0c684 =  1:  dd2668d7fe t9351: derive anonymized tree checks from original repo
 2:  bc79b92850 =  2:  46a07c62e1 fast-export: use xmemdupz() for anonymizing oids
 3:  f5b34b3730 =  3:  a7be6c4703 fast-export: store anonymized oids as hex strings
 4:  5982a5c764 =  4:  712e481e4e fast-export: tighten anonymize_mem() interface to handle only strings
 5:  d6b7a75a13 =  5:  1ad28f560a fast-export: stop storing lengths in anonymized hashmaps
 6:  f897a949b1 =  6:  b6456c0c7b fast-export: use a flex array to store anonymized entries
 7:  e5465fb8d8 =  7:  81d478c12b fast-export: move global "idents" anonymize hashmap into function
 8:  8dd74b95b7 =  8:  0d34f57a94 fast-export: add a "data" callback parameter to anonymize_str()
 9:  c7aec82ba5 !  9:  64e3afe46f fast-export: allow seeding the anonymized mapping
    @@ Commit message
         does work, but it's a bit awkward to use (you have to manually dig the
         items you care about out of the mapping).
     
    +    Helped-by: Eric Sunshine <sunshine@sunshineco.com>
         Signed-off-by: Jeff King <peff@peff.net>
     
      ## Documentation/git-fast-export.txt ##
     @@ Documentation/git-fast-export.txt: by keeping the marks the same across runs.
      	the shape of the history and stored tree.  See the section on
      	`ANONYMIZING` below.
      
    -+--seed-anonymized=<from>[:<to>]::
    ++--anonymize-map=<from>[:<to>]::
     +	Convert token `<from>` to `<to>` in the anonymized output. If
     +	`<to>` is omitted, map `<from>` to itself (i.e., do not
     +	anonymize it). See the section on `ANONYMIZING` below.
    @@ Documentation/git-fast-export.txt: collapse "User 0", "User 1", etc into "User X
     +paths, which becomes challenging after refnames and paths have been
     +anonymized. You can ask for a particular token to be left as-is or
     +mapped to a new value. For example, if you have a bug which reproduces
    -+with `git rev-list mybranch -- foo.c`, you can run:
    ++with `git rev-list sensitive -- secret.c`, you can run:
     +
     +---------------------------------------------------
     +$ git fast-export --anonymize --all \
    -+      --seed-anonymized=foo.c:secret.c \
    -+      --seed-anonymized=mybranch \
    ++      --anonymize-map=sensitive:foo \
    ++      --anonymize-map=secret.c:bar.c \
     +      >stream
     +---------------------------------------------------
     +
    -+After importing the stream, you can then run `git rev-list mybranch --
    -+secret.c` in the anonymized repository.
    ++After importing the stream, you can then run `git rev-list foo -- bar.c`
    ++in the anonymized repository.
     +
     +Note that paths and refnames are split into tokens at slash boundaries.
    -+The command above would anonymize `subdir/foo.c` as something like
    -+`path123/secret.c`.
    ++The command above would anonymize `subdir/secret.c` as something like
    ++`path123/bar.c`; you could then search for `bar.c` in the anonymized
    ++repository to determine the final pathname.
    ++
    ++To make referencing the final pathname simpler, you can map each path
    ++component; so if you also anonymize `subdir` to `publicdir`, then the
    ++final pathname would be `publicdir/bar.c`.
      
      LIMITATIONS
      -----------
    @@ builtin/fast-export.c: static void handle_deletes(void)
     +	return xstrdup(data);
     +}
     +
    -+static int parse_opt_seed_anonymized(const struct option *opt,
    -+				     const char *arg, int unset)
    ++static int parse_opt_anonymize_map(const struct option *opt,
    ++				   const char *arg, int unset)
     +{
     +	struct hashmap *map = opt->value;
     +	const char *delim, *value;
    @@ builtin/fast-export.c: static void handle_deletes(void)
     +	}
     +
     +	if (!keylen || !*value)
    -+		return error(_("--seed-anonymized token cannot be empty"));
    ++		return error(_("--anonymize-map token cannot be empty"));
     +
     +	anonymize_str(map, anonymize_seed, arg, keylen, (void *)value);
     +
    @@ builtin/fast-export.c: int cmd_fast_export(int argc, const char **argv, const ch
      		OPT_STRING_LIST(0, "refspec", &refspecs_list, N_("refspec"),
      			     N_("Apply refspec to exported refs")),
      		OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")),
    -+		OPT_CALLBACK_F(0, "seed-anonymized", &anonymized_seeds, N_("from:to"),
    ++		OPT_CALLBACK_F(0, "anonymize-map", &anonymized_seeds, N_("from:to"),
     +			       N_("convert <from> to <to> in anonymized output"),
    -+			       PARSE_OPT_NONEG, parse_opt_seed_anonymized),
    ++			       PARSE_OPT_NONEG, parse_opt_anonymize_map),
      		OPT_BOOL(0, "reference-excluded-parents",
      			 &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by object id")),
      		OPT_BOOL(0, "show-original-ids", &show_original_ids,
    +@@ builtin/fast-export.c: int cmd_fast_export(int argc, const char **argv, const char *prefix)
    + 	if (argc > 1)
    + 		usage_with_options (fast_export_usage, options);
    + 
    ++	if (anonymized_seeds.cmpfn && !anonymize)
    ++		die(_("--anonymize-map without --anonymize does not make sense"));
    ++
    + 	if (refspecs_list.nr) {
    + 		int i;
    + 
     
      ## t/t9351-fast-export-anonymize.sh ##
     @@ t/t9351-fast-export-anonymize.sh: test_description='basic tests for fast-export --anonymize'
    @@ t/t9351-fast-export-anonymize.sh: test_expect_success 'setup simple repo' '
      test_expect_success 'export anonymized stream' '
     -	git fast-export --anonymize --all >stream
     +	git fast-export --anonymize --all \
    -+		--seed-anonymized=retain-me \
    -+		--seed-anonymized=xyzzy:custom-name \
    ++		--anonymize-map=retain-me \
    ++		--anonymize-map=xyzzy:custom-name \
     +		>stream
      '
      
10:  59737b97de ! 10:  103f5bcd65 fast-export: anonymize "master" refname
    @@ Commit message
     
         Note that we have to adjust the test a bit, since it relied on using the
         name "master" in the anonymized repos. We could just use
    -    --seed-anonymized=master to keep the same output, but then we wouldn't
    +    --anonymize-map=master to keep the same output, but then we wouldn't
         know if it works because of our hard-coded master or because of the
    -    seeding.
    +    explicit map.
     
         So let's flip the test a bit, and confirm that we anonymize "master",
         but keep "other" in the output.
    @@ builtin/fast-export.c: static const char *anonymize_refname(const char *refname)
      ## t/t9351-fast-export-anonymize.sh ##
     @@ t/t9351-fast-export-anonymize.sh: test_expect_success 'export anonymized stream' '
      	git fast-export --anonymize --all \
    - 		--seed-anonymized=retain-me \
    - 		--seed-anonymized=xyzzy:custom-name \
    -+		--seed-anonymized=other \
    + 		--anonymize-map=retain-me \
    + 		--anonymize-map=xyzzy:custom-name \
    ++		--anonymize-map=other \
      		>stream
      '
      
 -:  ---------- > 11:  f2640407c5 fast-export: use local array to store anonymized oid

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 01/11] t9351: derive anonymized tree checks from original repo
  2020-06-25 19:48     ` [PATCH v2 0/11] " Jeff King
@ 2020-06-25 19:48       ` Jeff King
  2020-06-25 19:48       ` [PATCH v2 02/11] fast-export: use xmemdupz() for anonymizing oids Jeff King
                         ` (10 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-25 19:48 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin, SZEDER Gábor

Our tests of the anonymized repo just hard-code the expected set of
objects in the root and subdirectory trees. This makes them brittle to
the test setup changing (e.g., adding new paths that need tested).

Let's look at the original repo to compute our expected set of objects.
Note that this isn't completely perfect (e.g., we still rely on there
being only one tree in the root), but it does simplify later patches.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/t9351-fast-export-anonymize.sh | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
index 897dc50907..e772cf9930 100755
--- a/t/t9351-fast-export-anonymize.sh
+++ b/t/t9351-fast-export-anonymize.sh
@@ -71,22 +71,18 @@ test_expect_success 'repo has original shape and timestamps' '
 
 test_expect_success 'root tree has original shape' '
 	# the output entries are not necessarily in the same
-	# order, but we know at least that we will have one tree
-	# and one blob, so just check the sorted order
-	cat >expect <<-\EOF &&
-	blob
-	tree
-	EOF
+	# order, but we should at least have the same set of
+	# object types.
+	git -C .. ls-tree HEAD >orig-root &&
+	cut -d" " -f2 <orig-root | sort >expect &&
 	git ls-tree $other_branch >root &&
 	cut -d" " -f2 <root | sort >actual &&
 	test_cmp expect actual
 '
 
 test_expect_success 'paths in subdir ended up in one tree' '
-	cat >expect <<-\EOF &&
-	blob
-	blob
-	EOF
+	git -C .. ls-tree other:subdir >orig-subdir &&
+	cut -d" " -f2 <orig-subdir | sort >expect &&
 	tree=$(grep tree root | cut -f2) &&
 	git ls-tree $other_branch:$tree >tree &&
 	cut -d" " -f2 <tree >actual &&
-- 
2.27.0.593.gb3082a2aaf


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 02/11] fast-export: use xmemdupz() for anonymizing oids
  2020-06-25 19:48     ` [PATCH v2 0/11] " Jeff King
  2020-06-25 19:48       ` [PATCH v2 01/11] t9351: derive anonymized tree checks from original repo Jeff King
@ 2020-06-25 19:48       ` Jeff King
  2020-06-25 19:48       ` [PATCH v2 03/11] fast-export: store anonymized oids as hex strings Jeff King
                         ` (9 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-25 19:48 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin, SZEDER Gábor

Our anonymize_mem() function is careful to take a ptr/len pair to allow
storing binary tokens like object ids, as well as partial strings (e.g.,
just "foo" of "foo/bar"). But it duplicates the hash key using
xstrdup()! That means that:

  - for a partial string, we'd store all bytes up to the NUL, even
    though we'd never look at anything past "len". This didn't produce
    wrong behavior, but was wasteful.

  - for a binary oid that doesn't contain a zero byte, we'd copy garbage
    bytes off the end of the array (though as long as nothing complained
    about reading uninitialized bytes, further reads would be limited by
    "len", and we'd produce the correct results)

  - for a binary oid that does contain a zero byte, we'd copy _fewer_
    bytes than intended into the hashmap struct. When we later try to
    look up a value, we'd access uninitialized memory and potentially
    falsely claim that a particular oid is not present.

The most common reason to store an oid is an anonymized gitlink, but our
test case doesn't have any gitlinks at all. So let's add one whose oid
contains a NUL and is present at two different paths. ASan catches the
memory error, but even without it we can detect the bug because the oid
is not anonymized the same way for both paths.

And of course the fix is to copy the correct number of bytes. We don't
technically need the appended NUL from xmemdupz(), but it doesn't hurt
as an extra protection against anybody treating it like a string (plus a
future patch will push us more in that direction).

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-export.c            |  2 +-
 t/t9351-fast-export-anonymize.sh | 15 +++++++++++++++
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 85868162ee..289395a131 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -162,7 +162,7 @@ static const void *anonymize_mem(struct hashmap *map,
 	if (!ret) {
 		ret = xmalloc(sizeof(*ret));
 		hashmap_entry_init(&ret->hash, key.hash.hash);
-		ret->orig = xstrdup(orig);
+		ret->orig = xmemdupz(orig, *len);
 		ret->orig_len = *len;
 		ret->anon = generate(orig, len);
 		ret->anon_len = *len;
diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
index e772cf9930..dc5d75cd19 100755
--- a/t/t9351-fast-export-anonymize.sh
+++ b/t/t9351-fast-export-anonymize.sh
@@ -10,6 +10,10 @@ test_expect_success 'setup simple repo' '
 	mkdir subdir &&
 	test_commit subdir/bar &&
 	test_commit subdir/xyzzy &&
+	fake_commit=$(echo $ZERO_OID | sed s/0/a/) &&
+	git update-index --add --cacheinfo 160000,$fake_commit,link1 &&
+	git update-index --add --cacheinfo 160000,$fake_commit,link2 &&
+	git commit -m "add gitlink" &&
 	git tag -m "annotated tag" mytag
 '
 
@@ -26,6 +30,12 @@ test_expect_success 'stream omits path names' '
 	! grep xyzzy stream
 '
 
+test_expect_success 'stream omits gitlink oids' '
+	# avoid relying on the whole oid to remain hash-agnostic; this is
+	# plenty to be unique within our test case
+	! grep a000000000000000000 stream
+'
+
 test_expect_success 'stream allows master as refname' '
 	grep master stream
 '
@@ -89,6 +99,11 @@ test_expect_success 'paths in subdir ended up in one tree' '
 	test_cmp expect actual
 '
 
+test_expect_success 'identical gitlinks got identical oid' '
+	awk "/commit/ { print \$3 }" <root | sort -u >commits &&
+	test_line_count = 1 commits
+'
+
 test_expect_success 'tag points to branch tip' '
 	git rev-parse $other_branch >expect &&
 	git for-each-ref --format="%(*objectname)" | grep . >actual &&
-- 
2.27.0.593.gb3082a2aaf


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 03/11] fast-export: store anonymized oids as hex strings
  2020-06-25 19:48     ` [PATCH v2 0/11] " Jeff King
  2020-06-25 19:48       ` [PATCH v2 01/11] t9351: derive anonymized tree checks from original repo Jeff King
  2020-06-25 19:48       ` [PATCH v2 02/11] fast-export: use xmemdupz() for anonymizing oids Jeff King
@ 2020-06-25 19:48       ` Jeff King
  2020-06-25 19:48       ` [PATCH v2 04/11] fast-export: tighten anonymize_mem() interface to handle only strings Jeff King
                         ` (8 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-25 19:48 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin, SZEDER Gábor

When fast-export stores anonymized oids, it does so as binary strings.
And while the anonymous mapping storage is binary-clean (at least as of
the previous commit), this will become awkward when we start exposing
more of it to the user. In particular, if we allow a method for
retaining token "foo", then users may want to specify a hex oid as such
a token.

Let's just switch to storing the hex strings. The difference in memory
usage is negligible (especially considering how infrequently we'd
generally store an oid compared to, say, path components).

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-export.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 289395a131..4a3a4c933e 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -387,16 +387,19 @@ static void *generate_fake_oid(const void *old, size_t *len)
 {
 	static uint32_t counter = 1; /* avoid null oid */
 	const unsigned hashsz = the_hash_algo->rawsz;
-	unsigned char *out = xcalloc(hashsz, 1);
-	put_be32(out + hashsz - 4, counter++);
-	return out;
+	struct object_id oid;
+	char *hex = xmallocz(GIT_MAX_HEXSZ);
+
+	oidclr(&oid);
+	put_be32(oid.hash + hashsz - 4, counter++);
+	return oid_to_hex_r(hex, &oid);
 }
 
-static const struct object_id *anonymize_oid(const struct object_id *oid)
+static const char *anonymize_oid(const char *oid_hex)
 {
 	static struct hashmap objs;
-	size_t len = the_hash_algo->rawsz;
-	return anonymize_mem(&objs, generate_fake_oid, oid, &len);
+	size_t len = strlen(oid_hex);
+	return anonymize_mem(&objs, generate_fake_oid, oid_hex, &len);
 }
 
 static void show_filemodify(struct diff_queue_struct *q,
@@ -455,9 +458,9 @@ static void show_filemodify(struct diff_queue_struct *q,
 			 */
 			if (no_data || S_ISGITLINK(spec->mode))
 				printf("M %06o %s ", spec->mode,
-				       oid_to_hex(anonymize ?
-						  anonymize_oid(&spec->oid) :
-						  &spec->oid));
+				       anonymize ?
+				       anonymize_oid(oid_to_hex(&spec->oid)) :
+				       oid_to_hex(&spec->oid));
 			else {
 				struct object *object = lookup_object(the_repository,
 								      &spec->oid);
@@ -712,9 +715,10 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 		if (mark)
 			printf(":%d\n", mark);
 		else
-			printf("%s\n", oid_to_hex(anonymize ?
-						  anonymize_oid(&obj->oid) :
-						  &obj->oid));
+			printf("%s\n",
+			       anonymize ?
+			       anonymize_oid(oid_to_hex(&obj->oid)) :
+			       oid_to_hex(&obj->oid));
 		i++;
 	}
 
-- 
2.27.0.593.gb3082a2aaf


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 04/11] fast-export: tighten anonymize_mem() interface to handle only strings
  2020-06-25 19:48     ` [PATCH v2 0/11] " Jeff King
                         ` (2 preceding siblings ...)
  2020-06-25 19:48       ` [PATCH v2 03/11] fast-export: store anonymized oids as hex strings Jeff King
@ 2020-06-25 19:48       ` Jeff King
  2020-06-25 19:48       ` [PATCH v2 05/11] fast-export: stop storing lengths in anonymized hashmaps Jeff King
                         ` (7 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-25 19:48 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin, SZEDER Gábor

While the anonymize_mem() interface _can_ store arbitrary byte
sequences, none of the callers uses this feature (as of the previous
commit). We'd like to keep it that way, as we'll be exposing the
string-like nature of the anonymization routines to the user. So let's
tighten up the interface a bit:

  - don't treat "len" as an out-parameter from anonymize_mem(); this
    ensures callers treat the pointer result as a NUL-terminated string

  - likewise, don't treat "len" as an out-parameter from generator
    functions

  - swap out "void *" for "char *" as appropriate to signal that we
    don't handle arbitrary memory

  - rename the function to anonymize_str()

This will also open up some optimization opportunities in a future
patch.

Note that we can't drop the "len" parameter entirely. Some callers do
pass in partial strings (e.g., "foo/bar", len=3) to avoid copying, and
we need to handle those still.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-export.c | 53 +++++++++++++++++++++----------------------
 1 file changed, 26 insertions(+), 27 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 4a3a4c933e..d8ea067630 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -145,31 +145,30 @@ static int anonymized_entry_cmp(const void *unused_cmp_data,
  * the same anonymized string with another. The actual generation
  * is farmed out to the generate function.
  */
-static const void *anonymize_mem(struct hashmap *map,
-				 void *(*generate)(const void *, size_t *),
-				 const void *orig, size_t *len)
+static const char *anonymize_str(struct hashmap *map,
+				 char *(*generate)(const char *, size_t),
+				 const char *orig, size_t len)
 {
 	struct anonymized_entry key, *ret;
 
 	if (!map->cmpfn)
 		hashmap_init(map, anonymized_entry_cmp, NULL, 0);
 
-	hashmap_entry_init(&key.hash, memhash(orig, *len));
+	hashmap_entry_init(&key.hash, memhash(orig, len));
 	key.orig = orig;
-	key.orig_len = *len;
+	key.orig_len = len;
 	ret = hashmap_get_entry(map, &key, hash, NULL);
 
 	if (!ret) {
 		ret = xmalloc(sizeof(*ret));
 		hashmap_entry_init(&ret->hash, key.hash.hash);
-		ret->orig = xmemdupz(orig, *len);
-		ret->orig_len = *len;
+		ret->orig = xmemdupz(orig, len);
+		ret->orig_len = len;
 		ret->anon = generate(orig, len);
-		ret->anon_len = *len;
+		ret->anon_len = strlen(ret->anon);
 		hashmap_put(map, &ret->hash);
 	}
 
-	*len = ret->anon_len;
 	return ret->anon;
 }
 
@@ -181,13 +180,13 @@ static const void *anonymize_mem(struct hashmap *map,
  */
 static void anonymize_path(struct strbuf *out, const char *path,
 			   struct hashmap *map,
-			   void *(*generate)(const void *, size_t *))
+			   char *(*generate)(const char *, size_t))
 {
 	while (*path) {
 		const char *end_of_component = strchrnul(path, '/');
 		size_t len = end_of_component - path;
-		const char *c = anonymize_mem(map, generate, path, &len);
-		strbuf_add(out, c, len);
+		const char *c = anonymize_str(map, generate, path, len);
+		strbuf_addstr(out, c);
 		path = end_of_component;
 		if (*path)
 			strbuf_addch(out, *path++);
@@ -361,12 +360,12 @@ static void print_path_1(const char *path)
 		printf("%s", path);
 }
 
-static void *anonymize_path_component(const void *path, size_t *len)
+static char *anonymize_path_component(const char *path, size_t len)
 {
 	static int counter;
 	struct strbuf out = STRBUF_INIT;
 	strbuf_addf(&out, "path%d", counter++);
-	return strbuf_detach(&out, len);
+	return strbuf_detach(&out, NULL);
 }
 
 static void print_path(const char *path)
@@ -383,7 +382,7 @@ static void print_path(const char *path)
 	}
 }
 
-static void *generate_fake_oid(const void *old, size_t *len)
+static char *generate_fake_oid(const char *old, size_t len)
 {
 	static uint32_t counter = 1; /* avoid null oid */
 	const unsigned hashsz = the_hash_algo->rawsz;
@@ -399,7 +398,7 @@ static const char *anonymize_oid(const char *oid_hex)
 {
 	static struct hashmap objs;
 	size_t len = strlen(oid_hex);
-	return anonymize_mem(&objs, generate_fake_oid, oid_hex, &len);
+	return anonymize_str(&objs, generate_fake_oid, oid_hex, len);
 }
 
 static void show_filemodify(struct diff_queue_struct *q,
@@ -496,12 +495,12 @@ static const char *find_encoding(const char *begin, const char *end)
 	return bol;
 }
 
-static void *anonymize_ref_component(const void *old, size_t *len)
+static char *anonymize_ref_component(const char *old, size_t len)
 {
 	static int counter;
 	struct strbuf out = STRBUF_INIT;
 	strbuf_addf(&out, "ref%d", counter++);
-	return strbuf_detach(&out, len);
+	return strbuf_detach(&out, NULL);
 }
 
 static const char *anonymize_refname(const char *refname)
@@ -550,13 +549,13 @@ static char *anonymize_commit_message(const char *old)
 }
 
 static struct hashmap idents;
-static void *anonymize_ident(const void *old, size_t *len)
+static char *anonymize_ident(const char *old, size_t len)
 {
 	static int counter;
 	struct strbuf out = STRBUF_INIT;
 	strbuf_addf(&out, "User %d <user%d@example.com>", counter, counter);
 	counter++;
-	return strbuf_detach(&out, len);
+	return strbuf_detach(&out, NULL);
 }
 
 /*
@@ -591,9 +590,9 @@ static void anonymize_ident_line(const char **beg, const char **end)
 		size_t len;
 
 		len = split.mail_end - split.name_begin;
-		ident = anonymize_mem(&idents, anonymize_ident,
-				      split.name_begin, &len);
-		strbuf_add(out, ident, len);
+		ident = anonymize_str(&idents, anonymize_ident,
+				      split.name_begin, len);
+		strbuf_addstr(out, ident);
 		strbuf_addch(out, ' ');
 		strbuf_add(out, split.date_begin, split.tz_end - split.date_begin);
 	} else {
@@ -733,12 +732,12 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 	show_progress();
 }
 
-static void *anonymize_tag(const void *old, size_t *len)
+static char *anonymize_tag(const char *old, size_t len)
 {
 	static int counter;
 	struct strbuf out = STRBUF_INIT;
 	strbuf_addf(&out, "tag message %d", counter++);
-	return strbuf_detach(&out, len);
+	return strbuf_detach(&out, NULL);
 }
 
 static void handle_tail(struct object_array *commits, struct rev_info *revs,
@@ -808,8 +807,8 @@ static void handle_tag(const char *name, struct tag *tag)
 		name = anonymize_refname(name);
 		if (message) {
 			static struct hashmap tags;
-			message = anonymize_mem(&tags, anonymize_tag,
-						message, &message_size);
+			message = anonymize_str(&tags, anonymize_tag,
+						message, message_size);
 		}
 	}
 
-- 
2.27.0.593.gb3082a2aaf


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 05/11] fast-export: stop storing lengths in anonymized hashmaps
  2020-06-25 19:48     ` [PATCH v2 0/11] " Jeff King
                         ` (3 preceding siblings ...)
  2020-06-25 19:48       ` [PATCH v2 04/11] fast-export: tighten anonymize_mem() interface to handle only strings Jeff King
@ 2020-06-25 19:48       ` Jeff King
  2020-06-25 19:48       ` [PATCH v2 06/11] fast-export: use a flex array to store anonymized entries Jeff King
                         ` (6 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-25 19:48 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin, SZEDER Gábor

Now that the anonymize_str() interface is restricted to NUL-terminated
strings, there's no need for us to keep track of the length of each
entry in the hashmap. This simplifies the code and saves a bit of
memory.

Note that we do still need to compare the stored results to partial
strings passed in by the callers. We can do that by using hashmap's
keydata feature to get the ptr/len pair into the comparison function,
and then using strncmp().

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-export.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index d8ea067630..5df2ada47d 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -121,23 +121,32 @@ static int has_unshown_parent(struct commit *commit)
 struct anonymized_entry {
 	struct hashmap_entry hash;
 	const char *orig;
-	size_t orig_len;
 	const char *anon;
-	size_t anon_len;
+};
+
+struct anonymized_entry_key {
+	struct hashmap_entry hash;
+	const char *orig;
+	size_t orig_len;
 };
 
 static int anonymized_entry_cmp(const void *unused_cmp_data,
 				const struct hashmap_entry *eptr,
 				const struct hashmap_entry *entry_or_key,
-				const void *unused_keydata)
+				const void *keydata)
 {
 	const struct anonymized_entry *a, *b;
 
 	a = container_of(eptr, const struct anonymized_entry, hash);
-	b = container_of(entry_or_key, const struct anonymized_entry, hash);
+	if (keydata) {
+		const struct anonymized_entry_key *key = keydata;
+		int equal = !strncmp(a->orig, key->orig, key->orig_len) &&
+			    !a->orig[key->orig_len];
+		return !equal;
+	}
 
-	return a->orig_len != b->orig_len ||
-		memcmp(a->orig, b->orig, a->orig_len);
+	b = container_of(entry_or_key, const struct anonymized_entry, hash);
+	return strcmp(a->orig, b->orig);
 }
 
 /*
@@ -149,23 +158,22 @@ static const char *anonymize_str(struct hashmap *map,
 				 char *(*generate)(const char *, size_t),
 				 const char *orig, size_t len)
 {
-	struct anonymized_entry key, *ret;
+	struct anonymized_entry_key key;
+	struct anonymized_entry *ret;
 
 	if (!map->cmpfn)
 		hashmap_init(map, anonymized_entry_cmp, NULL, 0);
 
 	hashmap_entry_init(&key.hash, memhash(orig, len));
 	key.orig = orig;
 	key.orig_len = len;
-	ret = hashmap_get_entry(map, &key, hash, NULL);
+	ret = hashmap_get_entry(map, &key, hash, &key);
 
 	if (!ret) {
 		ret = xmalloc(sizeof(*ret));
 		hashmap_entry_init(&ret->hash, key.hash.hash);
 		ret->orig = xmemdupz(orig, len);
-		ret->orig_len = len;
 		ret->anon = generate(orig, len);
-		ret->anon_len = strlen(ret->anon);
 		hashmap_put(map, &ret->hash);
 	}
 
-- 
2.27.0.593.gb3082a2aaf


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 06/11] fast-export: use a flex array to store anonymized entries
  2020-06-25 19:48     ` [PATCH v2 0/11] " Jeff King
                         ` (4 preceding siblings ...)
  2020-06-25 19:48       ` [PATCH v2 05/11] fast-export: stop storing lengths in anonymized hashmaps Jeff King
@ 2020-06-25 19:48       ` Jeff King
  2020-06-25 19:48       ` [PATCH v2 07/11] fast-export: move global "idents" anonymize hashmap into function Jeff King
                         ` (5 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-25 19:48 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin, SZEDER Gábor

Now that we're using a separate keydata struct for hash lookups, we have
more flexibility in how we allocate anonymized_entry structs. Let's push
the "orig" key into a flex member within the struct. That should save us
a few bytes of memory per entry (a pointer plus any malloc overhead),
and may make lookups a little faster (since it's one less pointer to
chase in the comparison function).

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-export.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 5df2ada47d..99d4a2b404 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -120,8 +120,8 @@ static int has_unshown_parent(struct commit *commit)
 
 struct anonymized_entry {
 	struct hashmap_entry hash;
-	const char *orig;
 	const char *anon;
+	const char orig[FLEX_ARRAY];
 };
 
 struct anonymized_entry_key {
@@ -170,9 +170,8 @@ static const char *anonymize_str(struct hashmap *map,
 	ret = hashmap_get_entry(map, &key, hash, &key);
 
 	if (!ret) {
-		ret = xmalloc(sizeof(*ret));
+		FLEX_ALLOC_MEM(ret, orig, orig, len);
 		hashmap_entry_init(&ret->hash, key.hash.hash);
-		ret->orig = xmemdupz(orig, len);
 		ret->anon = generate(orig, len);
 		hashmap_put(map, &ret->hash);
 	}
-- 
2.27.0.593.gb3082a2aaf


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 07/11] fast-export: move global "idents" anonymize hashmap into function
  2020-06-25 19:48     ` [PATCH v2 0/11] " Jeff King
                         ` (5 preceding siblings ...)
  2020-06-25 19:48       ` [PATCH v2 06/11] fast-export: use a flex array to store anonymized entries Jeff King
@ 2020-06-25 19:48       ` Jeff King
  2020-06-25 19:48       ` [PATCH v2 08/11] fast-export: add a "data" callback parameter to anonymize_str() Jeff King
                         ` (4 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-25 19:48 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin, SZEDER Gábor

All of the other anonymization functions keep their static mappings
inside the function to avoid polluting the global namespace. Let's do
the same for "idents", as nobody needs it outside of
anonymize_ident_line().

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-export.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 99d4a2b404..16a1563e49 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -555,7 +555,6 @@ static char *anonymize_commit_message(const char *old)
 	return xstrfmt("subject %d\n\nbody\n", counter++);
 }
 
-static struct hashmap idents;
 static char *anonymize_ident(const char *old, size_t len)
 {
 	static int counter;
@@ -572,6 +571,7 @@ static char *anonymize_ident(const char *old, size_t len)
  */
 static void anonymize_ident_line(const char **beg, const char **end)
 {
+	static struct hashmap idents;
 	static struct strbuf buffers[] = { STRBUF_INIT, STRBUF_INIT };
 	static unsigned which_buffer;
 
-- 
2.27.0.593.gb3082a2aaf


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 08/11] fast-export: add a "data" callback parameter to anonymize_str()
  2020-06-25 19:48     ` [PATCH v2 0/11] " Jeff King
                         ` (6 preceding siblings ...)
  2020-06-25 19:48       ` [PATCH v2 07/11] fast-export: move global "idents" anonymize hashmap into function Jeff King
@ 2020-06-25 19:48       ` Jeff King
  2020-06-25 19:48       ` [PATCH v2 09/11] fast-export: allow seeding the anonymized mapping Jeff King
                         ` (3 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-25 19:48 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin, SZEDER Gábor

The anonymize_str() function takes a generator callback, but there's no
way to pass extra context to it. Let's add the usual "void *data"
parameter to the generator interface and pass it along.

This is mildly annoying for existing callers, all of which pass NULL,
but is necessary to avoid extra globals in some cases we'll add in a
subsequent patch.

While we're touching each of these callbacks, we can further observe
that none of them use the existing orig/len parameters at all. This
makes sense, since the point is for their output to have no discernable
basis in the original (my original version had some notion that we might
use a one-way function to obfuscate the names, but it was never
implemented). So let's drop those extra parameters. If a caller really
wants to do something with them, it can pass a struct through the new
data parameter.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-export.c | 27 ++++++++++++++-------------
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 16a1563e49..1cbca5b4b4 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -155,8 +155,9 @@ static int anonymized_entry_cmp(const void *unused_cmp_data,
  * is farmed out to the generate function.
  */
 static const char *anonymize_str(struct hashmap *map,
-				 char *(*generate)(const char *, size_t),
-				 const char *orig, size_t len)
+				 char *(*generate)(void *),
+				 const char *orig, size_t len,
+				 void *data)
 {
 	struct anonymized_entry_key key;
 	struct anonymized_entry *ret;
@@ -172,7 +173,7 @@ static const char *anonymize_str(struct hashmap *map,
 	if (!ret) {
 		FLEX_ALLOC_MEM(ret, orig, orig, len);
 		hashmap_entry_init(&ret->hash, key.hash.hash);
-		ret->anon = generate(orig, len);
+		ret->anon = generate(data);
 		hashmap_put(map, &ret->hash);
 	}
 
@@ -187,12 +188,12 @@ static const char *anonymize_str(struct hashmap *map,
  */
 static void anonymize_path(struct strbuf *out, const char *path,
 			   struct hashmap *map,
-			   char *(*generate)(const char *, size_t))
+			   char *(*generate)(void *))
 {
 	while (*path) {
 		const char *end_of_component = strchrnul(path, '/');
 		size_t len = end_of_component - path;
-		const char *c = anonymize_str(map, generate, path, len);
+		const char *c = anonymize_str(map, generate, path, len, NULL);
 		strbuf_addstr(out, c);
 		path = end_of_component;
 		if (*path)
@@ -367,7 +368,7 @@ static void print_path_1(const char *path)
 		printf("%s", path);
 }
 
-static char *anonymize_path_component(const char *path, size_t len)
+static char *anonymize_path_component(void *data)
 {
 	static int counter;
 	struct strbuf out = STRBUF_INIT;
@@ -389,7 +390,7 @@ static void print_path(const char *path)
 	}
 }
 
-static char *generate_fake_oid(const char *old, size_t len)
+static char *generate_fake_oid(void *data)
 {
 	static uint32_t counter = 1; /* avoid null oid */
 	const unsigned hashsz = the_hash_algo->rawsz;
@@ -405,7 +406,7 @@ static const char *anonymize_oid(const char *oid_hex)
 {
 	static struct hashmap objs;
 	size_t len = strlen(oid_hex);
-	return anonymize_str(&objs, generate_fake_oid, oid_hex, len);
+	return anonymize_str(&objs, generate_fake_oid, oid_hex, len, NULL);
 }
 
 static void show_filemodify(struct diff_queue_struct *q,
@@ -502,7 +503,7 @@ static const char *find_encoding(const char *begin, const char *end)
 	return bol;
 }
 
-static char *anonymize_ref_component(const char *old, size_t len)
+static char *anonymize_ref_component(void *data)
 {
 	static int counter;
 	struct strbuf out = STRBUF_INIT;
@@ -555,7 +556,7 @@ static char *anonymize_commit_message(const char *old)
 	return xstrfmt("subject %d\n\nbody\n", counter++);
 }
 
-static char *anonymize_ident(const char *old, size_t len)
+static char *anonymize_ident(void *data)
 {
 	static int counter;
 	struct strbuf out = STRBUF_INIT;
@@ -598,7 +599,7 @@ static void anonymize_ident_line(const char **beg, const char **end)
 
 		len = split.mail_end - split.name_begin;
 		ident = anonymize_str(&idents, anonymize_ident,
-				      split.name_begin, len);
+				      split.name_begin, len, NULL);
 		strbuf_addstr(out, ident);
 		strbuf_addch(out, ' ');
 		strbuf_add(out, split.date_begin, split.tz_end - split.date_begin);
@@ -739,7 +740,7 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 	show_progress();
 }
 
-static char *anonymize_tag(const char *old, size_t len)
+static char *anonymize_tag(void *data)
 {
 	static int counter;
 	struct strbuf out = STRBUF_INIT;
@@ -815,7 +816,7 @@ static void handle_tag(const char *name, struct tag *tag)
 		if (message) {
 			static struct hashmap tags;
 			message = anonymize_str(&tags, anonymize_tag,
-						message, message_size);
+						message, message_size, NULL);
 		}
 	}
 
-- 
2.27.0.593.gb3082a2aaf


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 09/11] fast-export: allow seeding the anonymized mapping
  2020-06-25 19:48     ` [PATCH v2 0/11] " Jeff King
                         ` (7 preceding siblings ...)
  2020-06-25 19:48       ` [PATCH v2 08/11] fast-export: add a "data" callback parameter to anonymize_str() Jeff King
@ 2020-06-25 19:48       ` Jeff King
  2020-06-25 19:48       ` [PATCH v2 10/11] fast-export: anonymize "master" refname Jeff King
                         ` (2 subsequent siblings)
  11 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-25 19:48 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin, SZEDER Gábor

After you anonymize a repository, it can be hard to find which commits
correspond between the original and the result, and thus hard to
reproduce commands that triggered bugs in the original.

Let's make it possible to seed the anonymization map. This lets users
either:

  - mark names to be retained as-is, if they don't consider them secret
    (in which case their original commands would just work)

  - map names to new values, which lets them adapt the reproduction
    recipe to the new names without revealing the originals

The implementation is fairly straight-forward. We already store each
anonymized token in a hashmap (so that the same token appearing twice is
converted to the same result). We can just introduce a new "seed"
hashmap which is consulted first.

This does make a few more promises to the user about how we'll anonymize
things (e.g., token-splitting pathnames). But it's unlikely that we'd
want to change those rules, even if the actual anonymization of a single
token changes. And it makes things much easier for the user, who can
unblind only a directory name without having to specify each path within
it.

One alternative to this approach would be to anonymize as we see fit,
and then dump the whole refname and pathname mappings to a file. This
does work, but it's a bit awkward to use (you have to manually dig the
items you care about out of the mapping).

Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/git-fast-export.txt | 29 ++++++++++++++++++
 builtin/fast-export.c             | 50 ++++++++++++++++++++++++++++++-
 t/t9351-fast-export-anonymize.sh  | 11 ++++++-
 3 files changed, 88 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index e8950de3ba..1978dbdc6a 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -119,6 +119,11 @@ by keeping the marks the same across runs.
 	the shape of the history and stored tree.  See the section on
 	`ANONYMIZING` below.
 
+--anonymize-map=<from>[:<to>]::
+	Convert token `<from>` to `<to>` in the anonymized output. If
+	`<to>` is omitted, map `<from>` to itself (i.e., do not
+	anonymize it). See the section on `ANONYMIZING` below.
+
 --reference-excluded-parents::
 	By default, running a command such as `git fast-export
 	master~5..master` will not include the commit master{tilde}5
@@ -238,6 +243,30 @@ collapse "User 0", "User 1", etc into "User X"). This produces a much
 smaller output, and it is usually easy to quickly confirm that there is
 no private data in the stream.
 
+Reproducing some bugs may require referencing particular commits or
+paths, which becomes challenging after refnames and paths have been
+anonymized. You can ask for a particular token to be left as-is or
+mapped to a new value. For example, if you have a bug which reproduces
+with `git rev-list sensitive -- secret.c`, you can run:
+
+---------------------------------------------------
+$ git fast-export --anonymize --all \
+      --anonymize-map=sensitive:foo \
+      --anonymize-map=secret.c:bar.c \
+      >stream
+---------------------------------------------------
+
+After importing the stream, you can then run `git rev-list foo -- bar.c`
+in the anonymized repository.
+
+Note that paths and refnames are split into tokens at slash boundaries.
+The command above would anonymize `subdir/secret.c` as something like
+`path123/bar.c`; you could then search for `bar.c` in the anonymized
+repository to determine the final pathname.
+
+To make referencing the final pathname simpler, you can map each path
+component; so if you also anonymize `subdir` to `publicdir`, then the
+final pathname would be `publicdir/bar.c`.
 
 LIMITATIONS
 -----------
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 1cbca5b4b4..b0b09bca30 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -45,6 +45,7 @@ static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
 static struct string_list tag_refs = STRING_LIST_INIT_NODUP;
 static struct refspec refspecs = REFSPEC_INIT_FETCH;
 static int anonymize;
+static struct hashmap anonymized_seeds;
 static struct revision_sources revision_sources;
 
 static int parse_opt_signed_tag_mode(const struct option *opt,
@@ -168,8 +169,18 @@ static const char *anonymize_str(struct hashmap *map,
 	hashmap_entry_init(&key.hash, memhash(orig, len));
 	key.orig = orig;
 	key.orig_len = len;
-	ret = hashmap_get_entry(map, &key, hash, &key);
 
+	/* First check if it's a token the user configured manually... */
+	if (anonymized_seeds.cmpfn)
+		ret = hashmap_get_entry(&anonymized_seeds, &key, hash, &key);
+	else
+		ret = NULL;
+
+	/* ...otherwise check if we've already seen it in this context... */
+	if (!ret)
+		ret = hashmap_get_entry(map, &key, hash, &key);
+
+	/* ...and finally generate a new mapping if necessary */
 	if (!ret) {
 		FLEX_ALLOC_MEM(ret, orig, orig, len);
 		hashmap_entry_init(&ret->hash, key.hash.hash);
@@ -1147,6 +1158,37 @@ static void handle_deletes(void)
 	}
 }
 
+static char *anonymize_seed(void *data)
+{
+	return xstrdup(data);
+}
+
+static int parse_opt_anonymize_map(const struct option *opt,
+				   const char *arg, int unset)
+{
+	struct hashmap *map = opt->value;
+	const char *delim, *value;
+	size_t keylen;
+
+	BUG_ON_OPT_NEG(unset);
+
+	delim = strchr(arg, ':');
+	if (delim) {
+		keylen = delim - arg;
+		value = delim + 1;
+	} else {
+		keylen = strlen(arg);
+		value = arg;
+	}
+
+	if (!keylen || !*value)
+		return error(_("--anonymize-map token cannot be empty"));
+
+	anonymize_str(map, anonymize_seed, arg, keylen, (void *)value);
+
+	return 0;
+}
+
 int cmd_fast_export(int argc, const char **argv, const char *prefix)
 {
 	struct rev_info revs;
@@ -1188,6 +1230,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_STRING_LIST(0, "refspec", &refspecs_list, N_("refspec"),
 			     N_("Apply refspec to exported refs")),
 		OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")),
+		OPT_CALLBACK_F(0, "anonymize-map", &anonymized_seeds, N_("from:to"),
+			       N_("convert <from> to <to> in anonymized output"),
+			       PARSE_OPT_NONEG, parse_opt_anonymize_map),
 		OPT_BOOL(0, "reference-excluded-parents",
 			 &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by object id")),
 		OPT_BOOL(0, "show-original-ids", &show_original_ids,
@@ -1215,6 +1260,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 	if (argc > 1)
 		usage_with_options (fast_export_usage, options);
 
+	if (anonymized_seeds.cmpfn && !anonymize)
+		die(_("--anonymize-map without --anonymize does not make sense"));
+
 	if (refspecs_list.nr) {
 		int i;
 
diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
index dc5d75cd19..5a21c71568 100755
--- a/t/t9351-fast-export-anonymize.sh
+++ b/t/t9351-fast-export-anonymize.sh
@@ -6,6 +6,7 @@ test_description='basic tests for fast-export --anonymize'
 test_expect_success 'setup simple repo' '
 	test_commit base &&
 	test_commit foo &&
+	test_commit retain-me &&
 	git checkout -b other HEAD^ &&
 	mkdir subdir &&
 	test_commit subdir/bar &&
@@ -18,7 +19,10 @@ test_expect_success 'setup simple repo' '
 '
 
 test_expect_success 'export anonymized stream' '
-	git fast-export --anonymize --all >stream
+	git fast-export --anonymize --all \
+		--anonymize-map=retain-me \
+		--anonymize-map=xyzzy:custom-name \
+		>stream
 '
 
 # this also covers commit messages
@@ -30,6 +34,11 @@ test_expect_success 'stream omits path names' '
 	! grep xyzzy stream
 '
 
+test_expect_success 'stream contains user-specified names' '
+	grep retain-me stream &&
+	grep custom-name stream
+'
+
 test_expect_success 'stream omits gitlink oids' '
 	# avoid relying on the whole oid to remain hash-agnostic; this is
 	# plenty to be unique within our test case
-- 
2.27.0.593.gb3082a2aaf


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 10/11] fast-export: anonymize "master" refname
  2020-06-25 19:48     ` [PATCH v2 0/11] " Jeff King
                         ` (8 preceding siblings ...)
  2020-06-25 19:48       ` [PATCH v2 09/11] fast-export: allow seeding the anonymized mapping Jeff King
@ 2020-06-25 19:48       ` Jeff King
  2020-06-25 19:48       ` [PATCH v2 11/11] fast-export: use local array to store anonymized oid Jeff King
  2020-06-25 21:22       ` [PATCH v2 0/11] fast-export: allow seeding the anonymized mapping Junio C Hamano
  11 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-25 19:48 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin, SZEDER Gábor

Running "fast-export --anonymize" will leave "refs/heads/master"
untouched in the output, for two reasons:

  - it helped to have some known reference point between the original
    and anonymized repository

  - since it's historically the default branch name, it doesn't leak any
    information

Now that we can ask fast-export to retain particular tokens, we have a
much better tool for the first one (because it works for any ref, not
just master).

For the second, the notion of "default branch name" is likely to become
configurable soon, at which point the name _does_ leak information.
Let's drop this special case in preparation.

Note that we have to adjust the test a bit, since it relied on using the
name "master" in the anonymized repos. We could just use
--anonymize-map=master to keep the same output, but then we wouldn't
know if it works because of our hard-coded master or because of the
explicit map.

So let's flip the test a bit, and confirm that we anonymize "master",
but keep "other" in the output.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-export.c            |  7 -------
 t/t9351-fast-export-anonymize.sh | 12 +++++++-----
 2 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index b0b09bca30..c6ecf404d7 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -538,13 +538,6 @@ static const char *anonymize_refname(const char *refname)
 	static struct strbuf anon = STRBUF_INIT;
 	int i;
 
-	/*
-	 * We also leave "master" as a special case, since it does not reveal
-	 * anything interesting.
-	 */
-	if (!strcmp(refname, "refs/heads/master"))
-		return refname;
-
 	strbuf_reset(&anon);
 	for (i = 0; i < ARRAY_SIZE(prefixes); i++) {
 		if (skip_prefix(refname, prefixes[i], &refname)) {
diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
index 5a21c71568..5ac2c3b5ee 100755
--- a/t/t9351-fast-export-anonymize.sh
+++ b/t/t9351-fast-export-anonymize.sh
@@ -22,6 +22,7 @@ test_expect_success 'export anonymized stream' '
 	git fast-export --anonymize --all \
 		--anonymize-map=retain-me \
 		--anonymize-map=xyzzy:custom-name \
+		--anonymize-map=other \
 		>stream
 '
 
@@ -45,12 +46,12 @@ test_expect_success 'stream omits gitlink oids' '
 	! grep a000000000000000000 stream
 '
 
-test_expect_success 'stream allows master as refname' '
-	grep master stream
+test_expect_success 'stream retains other as refname' '
+	grep other stream
 '
 
 test_expect_success 'stream omits other refnames' '
-	! grep other stream &&
+	! grep master stream &&
 	! grep mytag stream
 '
 
@@ -76,15 +77,16 @@ test_expect_success 'import stream to new repository' '
 test_expect_success 'result has two branches' '
 	git for-each-ref --format="%(refname)" refs/heads >branches &&
 	test_line_count = 2 branches &&
-	other_branch=$(grep -v refs/heads/master branches)
+	other_branch=refs/heads/other &&
+	main_branch=$(grep -v $other_branch branches)
 '
 
 test_expect_success 'repo has original shape and timestamps' '
 	shape () {
 		git log --format="%m %ct" --left-right --boundary "$@"
 	} &&
 	(cd .. && shape master...other) >expect &&
-	shape master...$other_branch >actual &&
+	shape $main_branch...$other_branch >actual &&
 	test_cmp expect actual
 '
 
-- 
2.27.0.593.gb3082a2aaf


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 11/11] fast-export: use local array to store anonymized oid
  2020-06-25 19:48     ` [PATCH v2 0/11] " Jeff King
                         ` (9 preceding siblings ...)
  2020-06-25 19:48       ` [PATCH v2 10/11] fast-export: anonymize "master" refname Jeff King
@ 2020-06-25 19:48       ` Jeff King
  2020-06-25 21:22       ` [PATCH v2 0/11] fast-export: allow seeding the anonymized mapping Junio C Hamano
  11 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-25 19:48 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Johannes Schindelin, SZEDER Gábor

Some older versions of gcc complain about this line:

  builtin/fast-export.c:412:2: error: dereferencing type-punned pointer
       will break strict-aliasing rules [-Werror=strict-aliasing]
    put_be32(oid.hash + hashsz - 4, counter++);
    ^

This seems to be a false positive, as there's no type-punning at all
here. oid.hash is an array of unsigned char; when we pass it to a
function it decays to a pointer to unsigned char. We do take a void
pointer in put_be32(), but it's immediately aliased with another pointer
to unsigned char (and clearly the compiler is looking inside the inlined
put_be32(), since the warning doesn't happen with -O0).

This happens on gcc 4.8 and 4.9, but not later versions (I tested gcc 6,
7, 8, and 9).

We can work around it by using a local array instead of an object_id
struct. This is a little more intimate with the details of object_id,
but for whatever reason doesn't seem to trigger the compiler warning.
We can revert this patch once we decide that those gcc versions are too
old to care about for a warning like this (gcc 4.8 is the default
compiler for Ubuntu Trusty, which is out-of-support but not fully
end-of-life'd until April 2022).

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-export.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index c6ecf404d7..9f37895d4c 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -405,12 +405,12 @@ static char *generate_fake_oid(void *data)
 {
 	static uint32_t counter = 1; /* avoid null oid */
 	const unsigned hashsz = the_hash_algo->rawsz;
-	struct object_id oid;
+	unsigned char out[GIT_MAX_RAWSZ];
 	char *hex = xmallocz(GIT_MAX_HEXSZ);
 
-	oidclr(&oid);
-	put_be32(oid.hash + hashsz - 4, counter++);
-	return oid_to_hex_r(hex, &oid);
+	hashclr(out);
+	put_be32(out + hashsz - 4, counter++);
+	return hash_to_hex_algop_r(hex, out, the_hash_algo);
 }
 
 static const char *anonymize_oid(const char *oid_hex)
-- 
2.27.0.593.gb3082a2aaf

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/10] fast-export: store anonymized oids as hex strings
  2020-06-25 15:49           ` Jeff King
@ 2020-06-25 20:45             ` SZEDER Gábor
  2020-06-25 21:15               ` Jeff King
  0 siblings, 1 reply; 64+ messages in thread
From: SZEDER Gábor @ 2020-06-25 20:45 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Eric Sunshine, Junio C Hamano, Johannes Schindelin

On Thu, Jun 25, 2020 at 11:49:48AM -0400, Jeff King wrote:
> On Wed, Jun 24, 2020 at 11:54:20AM -0400, Jeff King wrote:
> 
> > > GCC 4.8 complains about this change in our CI job:
> > > 
> > >   $ make CC=gcc-4.8 builtin/fast-export.o
> > >       CC builtin/fast-export.o
> > >   builtin/fast-export.c: In function ‘generate_fake_oid’:
> > >   builtin/fast-export.c:394:2: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]
> > >     put_be32(oid.hash + hashsz - 4, counter++);
> > >     ^
> > >   cc1: all warnings being treated as errors


> So this really seems like a pointless false positive from the compiler,
> and it's a rather old compiler (the warning no longer triggers in gcc 6
> and up). Is it worth caring about? Ubuntu Trusty is out of standard
> support but not fully EOL'd until 2022. Debian jessie has gcc 4.9
> which triggers the warning, but will hit EOL in 5 days. If it were an
> actual breakage I'd be more concerned, but keeping such an old compiler
> -Werror clean doesn't seem that important.
> 
> I'd note also that none of the Actions CI jobs run with this compiler
> version. If we _do_ want to care about it, it might be worth covering it
> there.

C99 style 'for' loop initial declarations are still frowned upon in
Git's codebase, and as far as we know it GCC 4.8 is the the most
recent compiler that can reasonably detect it; see fb9d7431cf
(travis-ci: build with GCC 4.8 as well, 2019-07-18).


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/10] fast-export: store anonymized oids as hex strings
  2020-06-25 20:45             ` SZEDER Gábor
@ 2020-06-25 21:15               ` Jeff King
  2020-06-29 13:17                 ` Johannes Schindelin
  0 siblings, 1 reply; 64+ messages in thread
From: Jeff King @ 2020-06-25 21:15 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: git, Eric Sunshine, Junio C Hamano, Johannes Schindelin

On Thu, Jun 25, 2020 at 10:45:32PM +0200, SZEDER Gábor wrote:

> > So this really seems like a pointless false positive from the compiler,
> > and it's a rather old compiler (the warning no longer triggers in gcc 6
> > and up). Is it worth caring about? Ubuntu Trusty is out of standard
> > support but not fully EOL'd until 2022. Debian jessie has gcc 4.9
> > which triggers the warning, but will hit EOL in 5 days. If it were an
> > actual breakage I'd be more concerned, but keeping such an old compiler
> > -Werror clean doesn't seem that important.
> > 
> > I'd note also that none of the Actions CI jobs run with this compiler
> > version. If we _do_ want to care about it, it might be worth covering it
> > there.
> 
> C99 style 'for' loop initial declarations are still frowned upon in
> Git's codebase, and as far as we know it GCC 4.8 is the the most
> recent compiler that can reasonably detect it; see fb9d7431cf
> (travis-ci: build with GCC 4.8 as well, 2019-07-18).

TBH, that does not seem like that compelling a reason to me to keep it
around. If no compiler is complaining of C99 for-loop declarations, then
maybe we should consider loosening our style. Or are we trying to be
kind of some unknown set of platform-specific compilers that we can't
feasibly hit in our CI?

I also note that fb9d7431cf mentions that -std=c90 doesn't work, because
there are other spots where we violate the standard (looks like "inline"
is a big one). But gcc with -std=gnu89 seems to compile just fine for
me, and does notice for-loop declarations. That's obviously totally
unportable, but it would be sufficient for a CI test.

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 0/11] fast-export: allow seeding the anonymized mapping
  2020-06-25 19:48     ` [PATCH v2 0/11] " Jeff King
                         ` (10 preceding siblings ...)
  2020-06-25 19:48       ` [PATCH v2 11/11] fast-export: use local array to store anonymized oid Jeff King
@ 2020-06-25 21:22       ` Junio C Hamano
  11 siblings, 0 replies; 64+ messages in thread
From: Junio C Hamano @ 2020-06-25 21:22 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Eric Sunshine, Johannes Schindelin, SZEDER Gábor

Jeff King <peff@peff.net> writes:

>   [09/11]: fast-export: allow seeding the anonymized mapping
>
>     - the option is now called "--anonymize-map", and the word "seed"
>       isn't used in any user-facing documentation
>
>     - incorporated Eric's documentation suggestions
>
>     - --anonymize-map without --anonymize now triggers an error

The changes compared to the v1 here looked sensible.

>   [10/11]: fast-export: anonymize "master" refname
>
>     - minor adjustments to handle change of option name
>
>   [11/11]: fast-export: use local array to store anonymized oid
>
>     - new in this iteration; this address the gcc-4 warning that Gábor
>       mentioned. I prepared it on top since I think we'd eventually want
>       to revert it once we decide that compiler is too old (and I'd be
>       perfectly fine to declare that it's so now, and just never apply
>       it at all).

Sounds like a sensible way.

Let's merge it to 'next' soonish.

Thanks, all.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/10] fast-export: store anonymized oids as hex strings
  2020-06-25 21:15               ` Jeff King
@ 2020-06-29 13:17                 ` Johannes Schindelin
  2020-06-30 19:35                   ` Jeff King
  0 siblings, 1 reply; 64+ messages in thread
From: Johannes Schindelin @ 2020-06-29 13:17 UTC (permalink / raw)
  To: Jeff King; +Cc: SZEDER Gábor, git, Eric Sunshine, Junio C Hamano


[-- Attachment #1: Type: text/plain, Size: 1899 bytes --]

Hi Peff & Gábor,

On Thu, 25 Jun 2020, Jeff King wrote:

> On Thu, Jun 25, 2020 at 10:45:32PM +0200, SZEDER Gábor wrote:
>
> > > So this really seems like a pointless false positive from the compiler,
> > > and it's a rather old compiler (the warning no longer triggers in gcc 6
> > > and up). Is it worth caring about? Ubuntu Trusty is out of standard
> > > support but not fully EOL'd until 2022. Debian jessie has gcc 4.9
> > > which triggers the warning, but will hit EOL in 5 days. If it were an
> > > actual breakage I'd be more concerned, but keeping such an old compiler
> > > -Werror clean doesn't seem that important.
> > >
> > > I'd note also that none of the Actions CI jobs run with this compiler
> > > version. If we _do_ want to care about it, it might be worth covering it
> > > there.
> >
> > C99 style 'for' loop initial declarations are still frowned upon in
> > Git's codebase, and as far as we know it GCC 4.8 is the the most
> > recent compiler that can reasonably detect it; see fb9d7431cf
> > (travis-ci: build with GCC 4.8 as well, 2019-07-18).
>
> TBH, that does not seem like that compelling a reason to me to keep it
> around. If no compiler is complaining of C99 for-loop declarations, then
> maybe we should consider loosening our style. Or are we trying to be
> kind of some unknown set of platform-specific compilers that we can't
> feasibly hit in our CI?

FWIW _iff_ we decide to loosen our style, I would like to propose doing it
in one place first, and keep it that way for two or three major versions.

Traditionally, people stuck on platforms such as IRIX or HP/UX with
proprietary C compilers (and remember: I once was one of those people)
often lack the luxury of upgrading frequently. And if it turns out that we
want to revert the style-loosening, it is much easier to do it in one
place than in many.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/10] fast-export: store anonymized oids as hex strings
  2020-06-29 13:17                 ` Johannes Schindelin
@ 2020-06-30 19:35                   ` Jeff King
  0 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2020-06-30 19:35 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: SZEDER Gábor, git, Eric Sunshine, Junio C Hamano

On Mon, Jun 29, 2020 at 03:17:00PM +0200, Johannes Schindelin wrote:

> > TBH, that does not seem like that compelling a reason to me to keep it
> > around. If no compiler is complaining of C99 for-loop declarations, then
> > maybe we should consider loosening our style. Or are we trying to be
> > kind of some unknown set of platform-specific compilers that we can't
> > feasibly hit in our CI?
> 
> FWIW _iff_ we decide to loosen our style, I would like to propose doing it
> in one place first, and keep it that way for two or three major versions.
> 
> Traditionally, people stuck on platforms such as IRIX or HP/UX with
> proprietary C compilers (and remember: I once was one of those people)
> often lack the luxury of upgrading frequently. And if it turns out that we
> want to revert the style-loosening, it is much easier to do it in one
> place than in many.

Yeah, I definitely agree that would be the way to do it. I admit I don't
even _really_ care that much about allowing the feature. More that it
might not be worth trying to crack down on it so aggressively, and
just letting it get picked up by review (or if it slips through, letting
the poor souls stuck on HP/UX complain).

(And I say "worth" because now the use of gcc 4.8 as the checking tool
has demonstrably cost a bunch of human time, so it comes with a cost).

-Peff


^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, back to index

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-19 13:23 [PATCH 0/3] fast-export: allow dumping anonymization mappings Jeff King
2020-06-19 13:25 ` [PATCH 1/3] fast-export: allow dumping the refname mapping Jeff King
2020-06-19 15:51   ` Eric Sunshine
2020-06-19 16:01     ` Jeff King
2020-06-19 16:18       ` Eric Sunshine
2020-06-19 17:45         ` Jeff King
2020-06-19 18:00           ` Eric Sunshine
2020-06-22 21:30             ` Jeff King
2020-06-19 19:20         ` Junio C Hamano
2020-06-22 21:32           ` Jeff King
2020-06-19 13:26 ` [PATCH 2/3] fast-export: anonymize "master" refname Jeff King
2020-06-19 13:29 ` [PATCH 3/3] fast-export: allow dumping the path mapping Jeff King
2020-06-19 16:00   ` Eric Sunshine
2020-06-19 19:24   ` Junio C Hamano
2020-06-22 21:38     ` Jeff King
2020-06-19 13:51 ` [PATCH 0/3] fast-export: allow dumping anonymization mappings Johannes Schindelin
2020-06-22 16:35   ` Junio C Hamano
2020-06-22 21:47 ` [PATCH v2 0/4] " Jeff King
2020-06-22 21:47   ` [PATCH v2 1/4] fast-export: allow dumping the refname mapping Jeff King
2020-06-22 21:48   ` [PATCH v2 2/4] fast-export: anonymize "master" refname Jeff King
2020-06-22 21:48   ` [PATCH v2 3/4] fast-export: refactor path printing to not rely on stdout Jeff King
2020-06-22 21:48   ` [PATCH v2 4/4] fast-export: allow dumping the path mapping Jeff King
2020-06-23 15:24   ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Jeff King
2020-06-23 15:24     ` [PATCH 01/10] t9351: derive anonymized tree checks from original repo Jeff King
2020-06-23 15:24     ` [PATCH 02/10] fast-export: use xmemdupz() for anonymizing oids Jeff King
2020-06-23 15:24     ` [PATCH 03/10] fast-export: store anonymized oids as hex strings Jeff King
2020-06-24 11:43       ` SZEDER Gábor
2020-06-24 15:54         ` Jeff King
2020-06-25 15:49           ` Jeff King
2020-06-25 20:45             ` SZEDER Gábor
2020-06-25 21:15               ` Jeff King
2020-06-29 13:17                 ` Johannes Schindelin
2020-06-30 19:35                   ` Jeff King
2020-06-23 15:24     ` [PATCH 04/10] fast-export: tighten anonymize_mem() interface to handle only strings Jeff King
2020-06-23 15:24     ` [PATCH 05/10] fast-export: stop storing lengths in anonymized hashmaps Jeff King
2020-06-23 15:24     ` [PATCH 06/10] fast-export: use a flex array to store anonymized entries Jeff King
2020-06-23 15:25     ` [PATCH 07/10] fast-export: move global "idents" anonymize hashmap into function Jeff King
2020-06-23 15:25     ` [PATCH 08/10] fast-export: add a "data" callback parameter to anonymize_str() Jeff King
2020-06-24 19:58       ` Junio C Hamano
2020-06-23 15:25     ` [PATCH 09/10] fast-export: allow seeding the anonymized mapping Jeff King
2020-06-23 17:16       ` Eric Sunshine
2020-06-23 18:30         ` Jeff King
2020-06-23 20:30           ` Eric Sunshine
2020-06-24 15:47             ` Jeff King
2020-06-23 18:11       ` Eric Sunshine
2020-06-23 18:35         ` Jeff King
2020-06-23 20:35           ` Eric Sunshine
2020-06-24 15:48             ` Jeff King
2020-06-23 15:25     ` [PATCH 10/10] fast-export: anonymize "master" refname Jeff King
2020-06-23 19:34     ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Junio C Hamano
2020-06-23 19:44       ` Jeff King
2020-06-25 19:48     ` [PATCH v2 0/11] " Jeff King
2020-06-25 19:48       ` [PATCH v2 01/11] t9351: derive anonymized tree checks from original repo Jeff King
2020-06-25 19:48       ` [PATCH v2 02/11] fast-export: use xmemdupz() for anonymizing oids Jeff King
2020-06-25 19:48       ` [PATCH v2 03/11] fast-export: store anonymized oids as hex strings Jeff King
2020-06-25 19:48       ` [PATCH v2 04/11] fast-export: tighten anonymize_mem() interface to handle only strings Jeff King
2020-06-25 19:48       ` [PATCH v2 05/11] fast-export: stop storing lengths in anonymized hashmaps Jeff King
2020-06-25 19:48       ` [PATCH v2 06/11] fast-export: use a flex array to store anonymized entries Jeff King
2020-06-25 19:48       ` [PATCH v2 07/11] fast-export: move global "idents" anonymize hashmap into function Jeff King
2020-06-25 19:48       ` [PATCH v2 08/11] fast-export: add a "data" callback parameter to anonymize_str() Jeff King
2020-06-25 19:48       ` [PATCH v2 09/11] fast-export: allow seeding the anonymized mapping Jeff King
2020-06-25 19:48       ` [PATCH v2 10/11] fast-export: anonymize "master" refname Jeff King
2020-06-25 19:48       ` [PATCH v2 11/11] fast-export: use local array to store anonymized oid Jeff King
2020-06-25 21:22       ` [PATCH v2 0/11] fast-export: allow seeding the anonymized mapping Junio C Hamano

Git Mailing List Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/git/0 git/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 git git/ https://lore.kernel.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.git


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git