All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Cc: Eric Sunshine <sunshine@sunshineco.com>,
	Junio C Hamano <gitster@pobox.com>,
	Johannes Schindelin <Johannes.Schindelin@gmx.de>
Subject: [PATCH v2 0/4] fast-export: allow dumping anonymization mappings
Date: Mon, 22 Jun 2020 17:47:45 -0400	[thread overview]
Message-ID: <20200622214745.GA3302779@coredump.intra.peff.net> (raw)
In-Reply-To: <20200619132304.GA2540657@coredump.intra.peff.net>

On Fri, Jun 19, 2020 at 09:23:04AM -0400, Jeff King wrote:

> This series gives an alternate way to achieve the same effect, but much
> better in that it works for _any_ ref (so if you are trying to reproduce
> the effect of "rev-list origin/foo..bar" in the anonymized repo, you can
> easily do so). Ditto for paths, so that "rev-list -- foo.c" can be
> reproduced in the anonymized repo.

Here's a v2 which I think addresses all of the comments. I have to admit
that after writing my last email to Junio, I am wondering whether it
would be sufficient and simpler to let the user specify a static mapping
of tokens (that could just be applied anywhere).

I'll take a look at that, but since I worked up this version, here it is
in the meantime.

The interesting changes are:

  - path output is now quoted, making it unambiguous. The intent is for
    humans to look at it, but it's not much extra work to make it
    machine readable, too.

  - the path dumping was in the wrong spot. It was happening in the
    generic function that's used for "path-like" things, including
    refnames. So the path mapping dump had extra cruft in it.

  - got rid of the maybe_dump_anon() helper

  - tests now avoid hard-coding expected counts

  - the path-dump test now checks the expected count

  [1/4]: fast-export: allow dumping the refname mapping
  [2/4]: fast-export: anonymize "master" refname
  [3/4]: fast-export: refactor path printing to not rely on stdout
  [4/4]: fast-export: allow dumping the path mapping

 Documentation/git-fast-export.txt | 34 +++++++++++++++
 builtin/fast-export.c             | 69 +++++++++++++++++++++++++------
 t/t9351-fast-export-anonymize.sh  | 44 ++++++++++++++++----
 3 files changed, 125 insertions(+), 22 deletions(-)

Range-diff from v1:

1:  82a17ae976 ! 1:  7ba5582d66 fast-export: allow dumping the refname mapping
    @@ builtin/fast-export.c: static int has_unshown_parent(struct commit *commit)
     +	kh_put_strset(seen->set, xstrdup(str), &hashret);
     +	return 0;
     +}
    -+
    -+static void maybe_dump_anon(FILE *out, struct seen_set *seen,
    -+			    const char *orig, const char *anon)
    -+{
    -+	if (!out)
    -+		return;
    -+	if (!check_and_mark_seen(seen, orig))
    -+		fprintf(out, "%s %s\n", orig, anon);
    -+}
     +
      struct anonymized_entry {
      	struct hashmap_entry hash;
    @@ builtin/fast-export.c: static const char *anonymize_refname(const char *refname)
      	}
      
      	anonymize_path(&anon, refname, &refs, anonymize_ref_component);
    -+	maybe_dump_anon(anonymized_refnames_handle, &seen,
    ++
    ++	if (anonymized_refnames_handle &&
    ++	    !check_and_mark_seen(&seen, full_refname))
    ++		fprintf(anonymized_refnames_handle, "%s %s\n",
     +			full_refname, anon.buf);
    ++
      	return anon.buf;
      }
      
    @@ t/t9351-fast-export-anonymize.sh: test_expect_success 'stream omits tag message'
     +	# we make no guarantees of the exact anonymized names,
     +	# so just check that we have the right number and
     +	# that a sample line looks sane.
    ++	expected_count=$(git for-each-ref | wc -l) &&
     +	# Note that master is not anonymized, and so not included
     +	# in the mapping.
    -+	test_line_count = 6 refs.out &&
    ++	expected_count=$((expected_count - 1)) &&
    ++	test_line_count = $expected_count refs.out &&
     +	grep "^refs/heads/other refs/heads/" refs.out
     +'
     +
2:  be56b375cc ! 2:  d88f7c83a5 fast-export: anonymize "master" refname
    @@ t/t9351-fast-export-anonymize.sh: test_expect_success 'stream omits path names'
      	! grep mytag stream
      '
     @@ t/t9351-fast-export-anonymize.sh: test_expect_success 'refname mapping can be dumped' '
    - 	# we make no guarantees of the exact anonymized names,
      	# so just check that we have the right number and
      	# that a sample line looks sane.
    + 	expected_count=$(git for-each-ref | wc -l) &&
     -	# Note that master is not anonymized, and so not included
     -	# in the mapping.
    --	test_line_count = 6 refs.out &&
    -+	test_line_count = 7 refs.out &&
    +-	expected_count=$((expected_count - 1)) &&
    + 	test_line_count = $expected_count refs.out &&
      	grep "^refs/heads/other refs/heads/" refs.out
      '
    - 
     @@ t/t9351-fast-export-anonymize.sh: test_expect_success 'import stream to new repository' '
      test_expect_success 'result has two branches' '
      	git for-each-ref --format="%(refname)" refs/heads >branches &&
-:  ---------- > 3:  164f1e1eab fast-export: refactor path printing to not rely on stdout
3:  a4e9f1f2ac ! 4:  b0aa59f07e fast-export: allow dumping the path mapping
    @@ Commit message
     
         We recently taught fast-export to dump the refname mapping. Let's do the
         same thing for paths, which can reuse most of the same infrastructure.
    -    Note that the output format isn't unambiguous here (because paths could
    -    contain spaces). That's OK because this is meant to be examined by a
    -    human.
     
         We could also just introduce a "dump mapping" file that shows every
         mapping we make. But it would be a bit more awkward to work with, as the
    @@ Documentation/git-fast-export.txt: by keeping the marks the same across runs.
     +	Output the mapping of real paths to anonymized paths to <file>.
     +	The output will contain one line per path that appears in the
     +	output stream, with the original path, a space, and its
    -+	anonymized counterpart. See the section on `ANONYMIZING` below.
    ++	anonymized counterpart. Paths may be quoted if they contain a
    ++	space, or unusual characters; see `core.quotePath` in
    ++	linkgit:git-config(1). See also `ANONYMIZING` below.
     +
      --reference-excluded-parents::
      	By default, running a command such as `git fast-export
    @@ builtin/fast-export.c: static struct string_list tag_refs = STRING_LIST_INIT_NOD
      static struct revision_sources revision_sources;
      
      static int parse_opt_signed_tag_mode(const struct option *opt,
    -@@ builtin/fast-export.c: static void anonymize_path(struct strbuf *out, const char *path,
    - 			   struct hashmap *map,
    - 			   void *(*generate)(const void *, size_t *))
    - {
    -+	static struct seen_set seen;
    -+	const char *full_path = path;
    +@@ builtin/fast-export.c: static void print_path(const char *path)
    + 		print_path_1(stdout, path);
    + 	else {
    + 		static struct hashmap paths;
    ++		static struct seen_set seen;
    + 		static struct strbuf anon = STRBUF_INIT;
    + 
    + 		anonymize_path(&anon, path, &paths, anonymize_path_component);
    ++		if (anonymized_paths_handle &&
    ++		    !check_and_mark_seen(&seen, path)) {
    ++			print_path_1(anonymized_paths_handle, path);
    ++			fputc(' ', anonymized_paths_handle);
    ++			print_path_1(anonymized_paths_handle, anon.buf);
    ++			fputc('\n', anonymized_paths_handle);
    ++		}
     +
    - 	while (*path) {
    - 		const char *end_of_component = strchrnul(path, '/');
    - 		size_t len = end_of_component - path;
    -@@ builtin/fast-export.c: static void anonymize_path(struct strbuf *out, const char *path,
    - 		if (*path)
    - 			strbuf_addch(out, *path++);
    + 		print_path_1(stdout, anon.buf);
    + 		strbuf_reset(&anon);
      	}
    -+
    -+	maybe_dump_anon(anonymized_paths_handle, &seen, full_path, out->buf);
    - }
    - 
    - static inline void *mark_to_ptr(uint32_t mark)
     @@ builtin/fast-export.c: int cmd_fast_export(int argc, const char **argv, const char *prefix)
      	     *import_filename = NULL,
      	     *import_filename_if_exists = NULL;
    @@ builtin/fast-export.c: int cmd_fast_export(int argc, const char **argv, const ch
      		printf("feature done\n");
     
      ## t/t9351-fast-export-anonymize.sh ##
    +@@ t/t9351-fast-export-anonymize.sh: test_expect_success 'setup simple repo' '
    + 	git checkout -b other HEAD^ &&
    + 	mkdir subdir &&
    + 	test_commit subdir/bar &&
    +-	test_commit subdir/xyzzy &&
    ++	test_commit quoting "subdir/this needs quoting" &&
    + 	git tag -m "annotated tag" mytag
    + '
    + 
    +@@ t/t9351-fast-export-anonymize.sh: test_expect_success 'stream omits path names' '
    + 	! grep foo stream &&
    + 	! grep subdir stream &&
    + 	! grep bar stream &&
    +-	! grep xyzzy stream
    ++	! grep quoting stream
    + '
    + 
    + test_expect_success 'stream omits refnames' '
     @@ t/t9351-fast-export-anonymize.sh: test_expect_success 'refname mapping can be dumped' '
      	grep "^refs/heads/other refs/heads/" refs.out
      '
      
     +test_expect_success 'path mapping can be dumped' '
     +	git fast-export --anonymize --all \
     +		--dump-anonymized-paths=paths.out >/dev/null &&
    -+	# do not assume a particular anonymization scheme or order;
    -+	# just sanity check that a sample line looks sensible.
    -+	grep "^foo " paths.out
    ++	# as above, avoid depending on the exact scheme, but
    ++	# but check that we have the right number of mappings,
    ++	# and spot-check one sample.
    ++	expected_count=$(
    ++		git rev-list --objects --all |
    ++		git cat-file --batch-check="%(objecttype) %(rest)" |
    ++		sed -ne "s/^blob //p" |
    ++		sort -u |
    ++		wc -l
    ++	) &&
    ++	test_line_count = $expected_count paths.out &&
    ++	grep "^\"subdir/this needs quoting\" " paths.out
     +'
     +
      # NOTE: we chdir to the new, anonymized repository

  parent reply	other threads:[~2020-06-22 21:47 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-19 13:23 [PATCH 0/3] fast-export: allow dumping anonymization mappings Jeff King
2020-06-19 13:25 ` [PATCH 1/3] fast-export: allow dumping the refname mapping Jeff King
2020-06-19 15:51   ` Eric Sunshine
2020-06-19 16:01     ` Jeff King
2020-06-19 16:18       ` Eric Sunshine
2020-06-19 17:45         ` Jeff King
2020-06-19 18:00           ` Eric Sunshine
2020-06-22 21:30             ` Jeff King
2020-06-19 19:20         ` Junio C Hamano
2020-06-22 21:32           ` Jeff King
2020-06-19 13:26 ` [PATCH 2/3] fast-export: anonymize "master" refname Jeff King
2020-06-19 13:29 ` [PATCH 3/3] fast-export: allow dumping the path mapping Jeff King
2020-06-19 16:00   ` Eric Sunshine
2020-06-19 19:24   ` Junio C Hamano
2020-06-22 21:38     ` Jeff King
2020-06-19 13:51 ` [PATCH 0/3] fast-export: allow dumping anonymization mappings Johannes Schindelin
2020-06-22 16:35   ` Junio C Hamano
2020-06-22 21:47 ` Jeff King [this message]
2020-06-22 21:47   ` [PATCH v2 1/4] fast-export: allow dumping the refname mapping Jeff King
2020-06-22 21:48   ` [PATCH v2 2/4] fast-export: anonymize "master" refname Jeff King
2020-06-22 21:48   ` [PATCH v2 3/4] fast-export: refactor path printing to not rely on stdout Jeff King
2020-06-22 21:48   ` [PATCH v2 4/4] fast-export: allow dumping the path mapping Jeff King
2020-06-23 15:24   ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Jeff King
2020-06-23 15:24     ` [PATCH 01/10] t9351: derive anonymized tree checks from original repo Jeff King
2020-06-23 15:24     ` [PATCH 02/10] fast-export: use xmemdupz() for anonymizing oids Jeff King
2020-06-23 15:24     ` [PATCH 03/10] fast-export: store anonymized oids as hex strings Jeff King
2020-06-24 11:43       ` SZEDER Gábor
2020-06-24 15:54         ` Jeff King
2020-06-25 15:49           ` Jeff King
2020-06-25 20:45             ` SZEDER Gábor
2020-06-25 21:15               ` Jeff King
2020-06-29 13:17                 ` Johannes Schindelin
2020-06-30 19:35                   ` Jeff King
2020-06-23 15:24     ` [PATCH 04/10] fast-export: tighten anonymize_mem() interface to handle only strings Jeff King
2020-06-23 15:24     ` [PATCH 05/10] fast-export: stop storing lengths in anonymized hashmaps Jeff King
2020-06-23 15:24     ` [PATCH 06/10] fast-export: use a flex array to store anonymized entries Jeff King
2020-06-23 15:25     ` [PATCH 07/10] fast-export: move global "idents" anonymize hashmap into function Jeff King
2020-06-23 15:25     ` [PATCH 08/10] fast-export: add a "data" callback parameter to anonymize_str() Jeff King
2020-06-24 19:58       ` Junio C Hamano
2020-06-23 15:25     ` [PATCH 09/10] fast-export: allow seeding the anonymized mapping Jeff King
2020-06-23 17:16       ` Eric Sunshine
2020-06-23 18:30         ` Jeff King
2020-06-23 20:30           ` Eric Sunshine
2020-06-24 15:47             ` Jeff King
2020-06-23 18:11       ` Eric Sunshine
2020-06-23 18:35         ` Jeff King
2020-06-23 20:35           ` Eric Sunshine
2020-06-24 15:48             ` Jeff King
2020-06-23 15:25     ` [PATCH 10/10] fast-export: anonymize "master" refname Jeff King
2020-06-23 19:34     ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Junio C Hamano
2020-06-23 19:44       ` Jeff King
2020-06-25 19:48     ` [PATCH v2 0/11] " Jeff King
2020-06-25 19:48       ` [PATCH v2 01/11] t9351: derive anonymized tree checks from original repo Jeff King
2020-06-25 19:48       ` [PATCH v2 02/11] fast-export: use xmemdupz() for anonymizing oids Jeff King
2020-06-25 19:48       ` [PATCH v2 03/11] fast-export: store anonymized oids as hex strings Jeff King
2020-06-25 19:48       ` [PATCH v2 04/11] fast-export: tighten anonymize_mem() interface to handle only strings Jeff King
2020-06-25 19:48       ` [PATCH v2 05/11] fast-export: stop storing lengths in anonymized hashmaps Jeff King
2020-06-25 19:48       ` [PATCH v2 06/11] fast-export: use a flex array to store anonymized entries Jeff King
2020-06-25 19:48       ` [PATCH v2 07/11] fast-export: move global "idents" anonymize hashmap into function Jeff King
2020-06-25 19:48       ` [PATCH v2 08/11] fast-export: add a "data" callback parameter to anonymize_str() Jeff King
2020-06-25 19:48       ` [PATCH v2 09/11] fast-export: allow seeding the anonymized mapping Jeff King
2020-06-25 19:48       ` [PATCH v2 10/11] fast-export: anonymize "master" refname Jeff King
2020-06-25 19:48       ` [PATCH v2 11/11] fast-export: use local array to store anonymized oid Jeff King
2020-06-25 21:22       ` [PATCH v2 0/11] fast-export: allow seeding the anonymized mapping Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200622214745.GA3302779@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=sunshine@sunshineco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.