git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* git fast-import leaks memory drastically, so crashes with out of  memory by attempt to import 22MB export dump
@ 2020-10-14  9:22 Dipl. Ing. Sergey Brester
  2020-10-15  1:26 ` Jeff King
  0 siblings, 1 reply; 20+ messages in thread
From: Dipl. Ing. Sergey Brester @ 2020-10-14  9:22 UTC (permalink / raw)
  To: git

Steps to reproduce the issue:

  1. export from fossil and import the dump to the git:
```
fossil export --git --import-marks .git/.fossil2git-fssl --export-marks 
.git/.fossil2git-fssl.tmp ^
  | git fast-import --import-marks=.git/.fossil2git-git 
--export-marks=.git/.fossil2git-git.tmp
```
  during the import git-fast-import.exe is growing on memory (more than I 
have physically, e.g. noticed over 20GB),
  SO FINALLY IT IS CRASHING WITH:
```
fatal: Out of memory, malloc failed (tried to allocate 2097152 bytes)
fast-import: dumping crash report to .git/fast_import_crash_1800
```
  the crash report contains:
```
fast-import crash report:
  fast-import process: 1800
  parent process : 1
  at 2020-10-13 18:55:19 +0000

fatal: Out of memory, malloc failed (tried to allocate 2097152 bytes)

Most Recent Commands Before Crash
---------------------------------

Active Branch LRU
-----------------
  active_branches = 0 cur, 5 max

  pos clock name
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Inactive Branches
-----------------

Marks
-----
  exported to .git/.fossil2git-git.tmp

-------------------
END OF CRASH REPORT
```

  2. if I do the export firstly (redirect to file), fossil creates 22MB 
large export dump-file,
  if I import it via git hereafter, it crashes in the same way.
```
fossil export --git --import-marks .git/.fossil2git-fssl --export-marks 
.git/.fossil2git-fssl.tmp > tmp-dump-out-of-mem.txt

dir tmp-dump-out-of-mem.txt
13.10.2020 20:37 22.916.280 tmp-dump-out-of-mem.txt

type tmp-dump-out-of-mem.txt | git fast-import 
--import-marks=.git/.fossil2git-git 
--export-marks=.git/.fossil2git-git.tmp

fatal: Out of memory, malloc failed (tried to allocate 2097152 bytes)
fast-import: dumping crash report to .git/fast_import_crash_1800
```

I did not see any issues with (even much larger) imports, before I 
upgraded git to 2.28.0 (from 2.25.1, I guess).

[System Info]
git version:
git version 2.28.0.windows.1
cpu: x86_64
built from commit: 77982caf269b7ee713a76da2bcf260c34d3bf7a7
sizeof-long: 4
sizeof-size_t: 8
shell-path: /bin/sh
uname: Windows 10.0 18363
compiler info: gnuc: 10.2
libc info: no libc information available
$SHELL (typically, interactive shell): <unset>

An attempt to repeat this with 2.27.0 (portable) changes nothing 
(crashed with the same issue).

BUT TRYING THAT WITH 2.14.4 (MINGIT-2.14.4.WINDOWS.7-64-BIT) WORKS WELL:
```
C:SoftDevGit-2.14mingw64bingit-fast-import.exe statistics:
---------------------------------------------------------------------
Alloc'd objects: 130000
Total objects: 591 ( 80 duplicates )
  blobs : 224 ( 0 duplicates 151 deltas of 224 attempts)
  trees : 260 ( 80 duplicates 191 deltas of 260 attempts)
  commits: 107 ( 0 duplicates 0 deltas of 0 attempts)
  tags : 0 ( 0 duplicates 0 deltas of 0 attempts)
Total branches: 201 ( 46 loads )
  marks: 1048576 ( 129437 unique )
  atoms: 1503
Memory total: 10439 KiB
  pools: 4346 KiB
  objects: 6093 KiB
---------------------------------------------------------------------
pack_report: getpagesize() = 65536
pack_report: core.packedGitWindowSize = 1073741824
pack_report: core.packedGitLimit = 35184372088832
pack_report: pack_used_ctr = 131659
pack_report: pack_mmap_calls = 119
pack_report: pack_open_windows = 66 / 66
pack_report: pack_mapped = 488903419 / 488903419
---------------------------------------------------------------------
```

The crash with newer versions is pretty well reproducible, so I have 
frozen this state to be able to test it later (or check whether it gets 
fixed).
Don't hesitate to ping me if you need some data or tests.

-- 

Regards,
Sergey Brester

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: git fast-import leaks memory drastically, so crashes with out of memory by attempt to import 22MB export dump
  2020-10-14  9:22 git fast-import leaks memory drastically, so crashes with out of memory by attempt to import 22MB export dump Dipl. Ing. Sergey Brester
@ 2020-10-15  1:26 ` Jeff King
  2020-10-15 11:50   ` Dipl. Ing. Sergey Brester
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff King @ 2020-10-15  1:26 UTC (permalink / raw)
  To: serg.brester; +Cc: git

On Wed, Oct 14, 2020 at 11:22:03AM +0200, Dipl. Ing. Sergey Brester wrote:

> I did not see any issues with (even much larger) imports, before I upgraded
> git to 2.28.0 (from 2.25.1, I guess).

I wasn't able to reproduce, but it sounds like there's something special
about this repo (since the dump is not that big, and you said much
larger dumps have succeeded).

Is it possible to make either the fossil repo, or the fast-import input
file available?

If not, since you said it works with 2.14.4, is it possible to bisect in
Git to see when it stopped working? You'll need to be able to build Git
from source.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: git fast-import leaks memory drastically, so crashes with out  of memory by attempt to import 22MB export dump
  2020-10-15  1:26 ` Jeff King
@ 2020-10-15 11:50   ` Dipl. Ing. Sergey Brester
  2020-10-15 15:38     ` [PATCH] fast-import: fix over-allocation of marks storage Jeff King
  2020-10-15 15:52     ` git fast-import leaks memory drastically, so crashes with out of memory by attempt to import 22MB export dump René Scharfe
  0 siblings, 2 replies; 20+ messages in thread
From: Dipl. Ing. Sergey Brester @ 2020-10-15 11:50 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Hi Jeff,

well, I don't know how you were trying to reproduce it.

My first attempt with a git-repository (cloned from 
https://github.com/git/git.git) showed that immediately to me.
Here you go (I used git bash here):

```
# clone or copy git repository (we'll use it for export and import):
git clone https://github.com/git/git.git
cd git

# make 1st fast-export in order to generate marks (don't need the dump, 
only marks are needed):
git fast-export --reencode=yes --export-marks=.tmp.exp-1.tmp 
e83c5163316f89bfbde7d9ab23ca2e25604af290..dc04167d378fb29d30e1647ff6ff51dd182bc9a3 
 > /dev/null

# make 2nd fast-export in order to generate partial export dump (file 
".tmp.dump" will be ca. 87MB):
git fast-export --reencode=yes --import-marks=.tmp.exp-1.tmp 
--export-marks=.tmp.exp-2.tmp 
61addb841f2a6d74a1737a01e03df1f773e04944..master > .tmp.dump

# now try to import this dump, using first marks as import marks (we 
have all revs in git-repo):
git fast-import --import-marks=.tmp.exp-1.tmp 
--export-marks=.tmp.imp.tmp < .tmp.dump

```

And see how git-fast-import eating your whole memory and enjoy the crash 
:)

```
fatal: Out of memory, malloc failed (tried to allocate 2097152 bytes)
fast-import: dumping crash report to .git/fast_import_crash_6684
```

Regards,
Sergey

15.10.2020 03:26, Jeff King wrote:

> On Wed, Oct 14, 2020 at 11:22:03AM +0200, Dipl. Ing. Sergey Brester 
> wrote:
> 
>> I did not see any issues with (even much larger) imports, before I 
>> upgraded git to 2.28.0 (from 2.25.1, I guess).
> 
> I wasn't able to reproduce, but it sounds like there's something 
> special
> about this repo (since the dump is not that big, and you said much
> larger dumps have succeeded).
> 
> Is it possible to make either the fossil repo, or the fast-import input
> file available?
> 
> If not, since you said it works with 2.14.4, is it possible to bisect 
> in
> Git to see when it stopped working? You'll need to be able to build Git
> from source.
> 
> -Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] fast-import: fix over-allocation of marks storage
  2020-10-15 11:50   ` Dipl. Ing. Sergey Brester
@ 2020-10-15 15:38     ` Jeff King
  2020-10-15 17:29       ` Junio C Hamano
  2020-10-15 17:57       ` René Scharfe
  2020-10-15 15:52     ` git fast-import leaks memory drastically, so crashes with out of memory by attempt to import 22MB export dump René Scharfe
  1 sibling, 2 replies; 20+ messages in thread
From: Jeff King @ 2020-10-15 15:38 UTC (permalink / raw)
  To: Dipl. Ing. Sergey Brester; +Cc: brian m. carlson, git

On Thu, Oct 15, 2020 at 01:50:30PM +0200, Dipl. Ing. Sergey Brester wrote:

> well, I don't know how you were trying to reproduce it.
> 
> My first attempt with a git-repository (cloned from
> https://github.com/git/git.git) showed that immediately to me.
> Here you go (I used git bash here):
> [...]

Thanks for this recipe. The key thing I was missing was having a
reasonably large number of marks to be imported.

The problem bisects to ddddf8d7e2 (fast-import: permit reading multiple
marks files, 2020-02-22), which is in v2.27.0. The fix is below. Since
we're entering the -rc2 period for v2.29 today and this isn't a
regression since v2.28, it probably won't make it into v2.29. But it's a
pretty serious bug (I'm actually surprised it took this long for anyone
to notice, as mark importing of any decent size is basically broken),
so I hope it will make it onto the maint branch soon after release.

-- >8 --
Subject: [PATCH] fast-import: fix over-allocation of marks storage

Fast-import stores its marks in a trie-like structure made of mark_set
structs. Each struct has a fixed size (1024). If our id number is too
large to fit in the struct, then we allocate a new struct which shifts
the id number by 10 bits. Our original struct becomes a child node
of this new layer, and the new struct becomes the top level of the trie.

This scheme was broken by ddddf8d7e2 (fast-import: permit reading
multiple marks files, 2020-02-22). Before then, we had a top-level
"marks" pointer, and the push-down worked by assigning the new top-level
struct to "marks". But after that commit, insert_mark() takes a pointer
to the mark_set, rather than using the global "marks". It continued to
assign to the global "marks" variable during the push down, which was
wrong for two reasons:

  - we added a call in option_rewrite_submodules() which uses a separate
    mark set; pushing down on "marks" is outright wrong here. We'd
    corrupt the "marks" set, and we'd fail to correctly store any
    submodule mappings with an id over 1024.

  - the other callers passed "marks", but the push-down was still wrong.
    In read_mark_file(), we take the pointer to the mark_set as a
    parameter. So even though insert_mark() was updating the global
    "marks", the local pointer we had in read_mark_file() was not
    updated. As a result, we'd add a new level when needed, but then the
    next call to insert_mark() wouldn't see it! It would then allocate a
    new layer, which would also not be seen, and so on. Lookups for the
    lost layers obviously wouldn't work, but before we even hit any
    lookup stage, we'd generally run out of memory and die.

Our tests didn't notice either of these cases because they didn't have
enough marks to trigger the push-down behavior. The new tests in t9304
cover both cases (and fail without this patch).

We can solve the problem by having insert_mark() take a pointer-to-pointer
of the top-level of the set. Then our push down can assign to it in a
way that the caller actually sees. Note the subtle reordering in
option_rewrite_submodules(). Our call to read_mark_file() may modify our
top-level set pointer, so we have to wait until after it returns to
assign its value into the string_list.

Reported-by: Sergey Brester <serg.brester@sebres.de>
Signed-off-by: Jeff King <peff@peff.net>
---
Two additional notes:

  - we could rename the global to "marks_toplevel" or something to make
    sure we got all references to it. But it makes the lookup code much
    uglier (it has to use the new name, and otherwise doesn't need
    touched by this patch). I actually did that temporarily to make sure
    there weren't any other lingering references, but it was too ugly to
    keep.

  - there's another global in insert_mark(), which is marks_set_count.
    We increment it once for each mark. We use the same counter whether
    we're adding a real mark, or a submodule-rewrite mark. Since it's
    not used for anything except reporting statistics at the end of the
    program, I think it's fine (it's not clear whether somebody would
    want the set of actual marks, or to know how often we had to call
    into the mark-insertion code).

 builtin/fast-import.c        | 31 ++++++++++++----------
 t/t9304-fast-import-marks.sh | 51 ++++++++++++++++++++++++++++++++++++
 2 files changed, 68 insertions(+), 14 deletions(-)
 create mode 100755 t/t9304-fast-import-marks.sh

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 1bf50a73dc..70d7d25eed 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -150,7 +150,7 @@ struct recent_command {
 	char *buf;
 };
 
-typedef void (*mark_set_inserter_t)(struct mark_set *s, struct object_id *oid, uintmax_t mark);
+typedef void (*mark_set_inserter_t)(struct mark_set **s, struct object_id *oid, uintmax_t mark);
 typedef void (*each_mark_fn_t)(uintmax_t mark, void *obj, void *cbp);
 
 /* Configured limits on output */
@@ -526,13 +526,15 @@ static unsigned int hc_str(const char *s, size_t len)
 	return r;
 }
 
-static void insert_mark(struct mark_set *s, uintmax_t idnum, struct object_entry *oe)
+static void insert_mark(struct mark_set **top, uintmax_t idnum, struct object_entry *oe)
 {
+	struct mark_set *s = *top;
+
 	while ((idnum >> s->shift) >= 1024) {
 		s = mem_pool_calloc(&fi_mem_pool, 1, sizeof(struct mark_set));
-		s->shift = marks->shift + 10;
-		s->data.sets[0] = marks;
-		marks = s;
+		s->shift = (*top)->shift + 10;
+		s->data.sets[0] = *top;
+		*top = s;
 	}
 	while (s->shift) {
 		uintmax_t i = idnum >> s->shift;
@@ -944,7 +946,7 @@ static int store_object(
 
 	e = insert_object(&oid);
 	if (mark)
-		insert_mark(marks, mark, e);
+		insert_mark(&marks, mark, e);
 	if (e->idx.offset) {
 		duplicate_count_by_type[type]++;
 		return 1;
@@ -1142,7 +1144,7 @@ static void stream_blob(uintmax_t len, struct object_id *oidout, uintmax_t mark)
 	e = insert_object(&oid);
 
 	if (mark)
-		insert_mark(marks, mark, e);
+		insert_mark(&marks, mark, e);
 
 	if (e->idx.offset) {
 		duplicate_count_by_type[OBJ_BLOB]++;
@@ -1717,7 +1719,7 @@ static void dump_marks(void)
 	}
 }
 
-static void insert_object_entry(struct mark_set *s, struct object_id *oid, uintmax_t mark)
+static void insert_object_entry(struct mark_set **s, struct object_id *oid, uintmax_t mark)
 {
 	struct object_entry *e;
 	e = find_object(oid);
@@ -1734,12 +1736,12 @@ static void insert_object_entry(struct mark_set *s, struct object_id *oid, uintm
 	insert_mark(s, mark, e);
 }
 
-static void insert_oid_entry(struct mark_set *s, struct object_id *oid, uintmax_t mark)
+static void insert_oid_entry(struct mark_set **s, struct object_id *oid, uintmax_t mark)
 {
 	insert_mark(s, mark, xmemdupz(oid, sizeof(*oid)));
 }
 
-static void read_mark_file(struct mark_set *s, FILE *f, mark_set_inserter_t inserter)
+static void read_mark_file(struct mark_set **s, FILE *f, mark_set_inserter_t inserter)
 {
 	char line[512];
 	while (fgets(line, sizeof(line), f)) {
@@ -1772,7 +1774,7 @@ static void read_marks(void)
 		goto done; /* Marks file does not exist */
 	else
 		die_errno("cannot read '%s'", import_marks_file);
-	read_mark_file(marks, f, insert_object_entry);
+	read_mark_file(&marks, f, insert_object_entry);
 	fclose(f);
 done:
 	import_marks_file_done = 1;
@@ -3228,7 +3230,7 @@ static void parse_alias(void)
 		die(_("Expected 'to' command, got %s"), command_buf.buf);
 	e = find_object(&b.oid);
 	assert(e);
-	insert_mark(marks, next_mark, e);
+	insert_mark(&marks, next_mark, e);
 }
 
 static char* make_fast_import_path(const char *path)
@@ -3321,13 +3323,14 @@ static void option_rewrite_submodules(const char *arg, struct string_list *list)
 	*f = '\0';
 	f++;
 	ms = xcalloc(1, sizeof(*ms));
-	string_list_insert(list, s)->util = ms;
 
 	fp = fopen(f, "r");
 	if (!fp)
 		die_errno("cannot read '%s'", f);
-	read_mark_file(ms, fp, insert_oid_entry);
+	read_mark_file(&ms, fp, insert_oid_entry);
 	fclose(fp);
+
+	string_list_insert(list, s)->util = ms;
 }
 
 static int parse_one_option(const char *option)
diff --git a/t/t9304-fast-import-marks.sh b/t/t9304-fast-import-marks.sh
new file mode 100755
index 0000000000..d4359dba21
--- /dev/null
+++ b/t/t9304-fast-import-marks.sh
@@ -0,0 +1,51 @@
+#!/bin/sh
+
+test_description='test exotic situations with marks'
+. ./test-lib.sh
+
+test_expect_success 'setup dump of basic history' '
+	test_commit one &&
+	git fast-export --export-marks=marks HEAD >dump
+'
+
+test_expect_success 'setup large marks file' '
+	# normally a marks file would have a lot of useful, unique
+	# marks. But for our purposes, just having a lot of nonsense
+	# ones is fine. Start at 1024 to avoid clashing with marks
+	# legitimately used in our tiny dump.
+	blob=$(git rev-parse HEAD:one.t) &&
+	for i in $(test_seq 1024 16384)
+	do
+		echo ":$i $blob"
+	done >>marks
+'
+
+test_expect_success 'import with large marks file' '
+	git fast-import --import-marks=marks <dump
+'
+
+test_expect_success 'setup dump with submodule' '
+	git submodule add "$PWD" sub &&
+	git commit -m "add submodule" &&
+	git fast-export HEAD >dump
+'
+
+test_expect_success 'setup submodule mapping with large id' '
+	old=$(git rev-parse HEAD:sub) &&
+	new=$(echo $old | sed s/./a/g) &&
+	echo ":12345 $old" >from &&
+	echo ":12345 $new" >to
+'
+
+test_expect_success 'import with submodule mapping' '
+	git init dst &&
+	git -C dst fast-import \
+		--rewrite-submodules-from=sub:../from \
+		--rewrite-submodules-to=sub:../to \
+		<dump &&
+	git -C dst rev-parse HEAD:sub >actual &&
+	echo "$new" >expect &&
+	test_cmp expect actual
+'
+
+test_done
-- 
2.29.0.rc1.562.g7bd9bc8902


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: git fast-import leaks memory drastically, so crashes with out of memory by attempt to import 22MB export dump
  2020-10-15 11:50   ` Dipl. Ing. Sergey Brester
  2020-10-15 15:38     ` [PATCH] fast-import: fix over-allocation of marks storage Jeff King
@ 2020-10-15 15:52     ` René Scharfe
  2020-10-15 16:19       ` Dipl. Ing. Sergey Brester
  1 sibling, 1 reply; 20+ messages in thread
From: René Scharfe @ 2020-10-15 15:52 UTC (permalink / raw)
  To: Dipl. Ing. Sergey Brester, Jeff King; +Cc: git, brian m. carlson

Am 15.10.20 um 13:50 schrieb Dipl. Ing. Sergey Brester:
> ```
> # clone or copy git repository (we'll use it for export and import):
> git clone https://github.com/git/git.git
> cd git
>
> # make 1st fast-export in order to generate marks (don't need the dump, only marks are needed):
> git fast-export --reencode=yes --export-marks=.tmp.exp-1.tmp e83c5163316f89bfbde7d9ab23ca2e25604af290..dc04167d378fb29d30e1647ff6ff51dd182bc9a3 > /dev/null
>
> # make 2nd fast-export in order to generate partial export dump (file ".tmp.dump" will be ca. 87MB):
> git fast-export --reencode=yes --import-marks=.tmp.exp-1.tmp --export-marks=.tmp.exp-2.tmp 61addb841f2a6d74a1737a01e03df1f773e04944..master > .tmp.dump

In case someone else is wondering about the meaning of those hashes,
here are their reference strings:

e83c516331 (Initial revision of "git", the information manager from hell, 2005-04-07)
dc04167d37 (Fourth batch, 2020-08-04)
61addb841f (Merge branch 'jk/strvec' into next, 2020-08-04)

> # now try to import this dump, using first marks as import marks (we have all revs in git-repo):
> git fast-import --import-marks=.tmp.exp-1.tmp --export-marks=.tmp.imp.tmp < .tmp.dump
>
> ```

So you use the marks generated by the first export to import the second
export.  I wonder if that's relevant to trigger the memory allocation
issue.

> And see how git-fast-import eating your whole memory and enjoy the crash :)
>
> ```
> fatal: Out of memory, malloc failed (tried to allocate 2097152 bytes)
> fast-import: dumping crash report to .git/fast_import_crash_6684
> ```

I can reproduce this on Debian -- but don't get a crash report, just a
notice from the OOM killer.  It bisects to ddddf8d7e2 (fast-import:
permit reading multiple marks files, 2020-02-22).

René

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: git fast-import leaks memory drastically, so crashes with out  of memory by attempt to import 22MB export dump
  2020-10-15 15:52     ` git fast-import leaks memory drastically, so crashes with out of memory by attempt to import 22MB export dump René Scharfe
@ 2020-10-15 16:19       ` Dipl. Ing. Sergey Brester
  0 siblings, 0 replies; 20+ messages in thread
From: Dipl. Ing. Sergey Brester @ 2020-10-15 16:19 UTC (permalink / raw)
  To: René Scharfe; +Cc: Jeff King, git, brian m. carlson

15.10.2020 17:52, René Scharfe wrote:

> In case someone else is wondering about the meaning of those hashes,
> here are their reference strings:
> e83c516331 (Initial revision of "git", the information manager from 
> hell, 2005-04-07)
> dc04167d37 (Fourth batch, 2020-08-04)
> 61addb841f (Merge branch 'jk/strvec' into next, 2020-08-04)
> 

These are just revisions of master branch, and the process was splittet 
into two parts to generate a large marks file with a small partial dump, 
that can be imported over first marks.
So quasi to simulate normal situation (large repo, small partial export 
dump), for which import/export with marks is basically used.
It doesn't matter here, just helpful to export/import single branch only 
(to bypass signed tags, avoid installing gpg keys, etc).

> So you use the marks generated by the first export to import the second
> export.

It also doesn't matter, because you could import whole first dump 
(several gigabytes, I guess) in order to generate new import marks...
Which will then be the same after all :) just because it is the same 
repository used for export (and all the revisions are already there, 
moreover having the same hash values, let alone internal marks index and 
the export sequence).

> I wonder if that's relevant to trigger the memory allocation issue.

No, it is not relevant.
Also note my initial report where it is affected by real situation (over 
import marks generated previosly with git fast-import).

> I can reproduce this on Debian -- but don't get a crash report, just a
> notice from the OOM killer. It bisects to ddddf8d7e2 (fast-import:
> permit reading multiple marks files, 2020-02-22).

Sure, OOM manager kills this process (before it is able to crash e. g. 
create a dump on failed malloc)...
If you'll disable OOMK (or try to run it on windows), you'll see the 
crash and dump is there. :)

Regards,
Sergey.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fast-import: fix over-allocation of marks storage
  2020-10-15 15:38     ` [PATCH] fast-import: fix over-allocation of marks storage Jeff King
@ 2020-10-15 17:29       ` Junio C Hamano
  2020-10-15 17:34         ` Junio C Hamano
  2020-10-15 17:57       ` René Scharfe
  1 sibling, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2020-10-15 17:29 UTC (permalink / raw)
  To: Jeff King; +Cc: Dipl. Ing. Sergey Brester, brian m. carlson, git

Jeff King <peff@peff.net> writes:

> Subject: [PATCH] fast-import: fix over-allocation of marks storage
>
> Fast-import stores its marks in a trie-like structure made of mark_set
> structs. Each struct has a fixed size (1024). If our id number is too
> large to fit in the struct, then we allocate a new struct which shifts
> the id number by 10 bits. Our original struct becomes a child node
> of this new layer, and the new struct becomes the top level of the trie.
>
> This scheme was broken by ddddf8d7e2 (fast-import: permit reading
> multiple marks files, 2020-02-22). Before then, we had a top-level
> "marks" pointer, and the push-down worked by assigning the new top-level
> struct to "marks". But after that commit, insert_mark() takes a pointer
> to the mark_set, rather than using the global "marks". It continued to
> assign to the global "marks" variable during the push down, which was
> wrong for two reasons:
>
>   - we added a call in option_rewrite_submodules() which uses a separate
>     mark set; pushing down on "marks" is outright wrong here. We'd
>     corrupt the "marks" set, and we'd fail to correctly store any
>     submodule mappings with an id over 1024.
>
>   - the other callers passed "marks", but the push-down was still wrong.
>     In read_mark_file(), we take the pointer to the mark_set as a
>     parameter. So even though insert_mark() was updating the global
>     "marks", the local pointer we had in read_mark_file() was not
>     updated. As a result, we'd add a new level when needed, but then the
>     next call to insert_mark() wouldn't see it! It would then allocate a
>     new layer, which would also not be seen, and so on. Lookups for the
>     lost layers obviously wouldn't work, but before we even hit any
>     lookup stage, we'd generally run out of memory and die.
>
> Our tests didn't notice either of these cases because they didn't have
> enough marks to trigger the push-down behavior. The new tests in t9304
> cover both cases (and fail without this patch).
>
> We can solve the problem by having insert_mark() take a pointer-to-pointer
> of the top-level of the set. Then our push down can assign to it in a
> way that the caller actually sees. Note the subtle reordering in
> option_rewrite_submodules(). Our call to read_mark_file() may modify our
> top-level set pointer, so we have to wait until after it returns to
> assign its value into the string_list.

Nice.

Why does this vaguely look familiar, I wonder.  I can swear that we
saw a breakage due to a similar pattern that attempts to convert a
global to an arg that is passed down to the callchain but not doing
so fully.

Anyway, the diagnoses above look correct and cleanly described.

Will queue.  Thanks.

>
> Reported-by: Sergey Brester <serg.brester@sebres.de>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> Two additional notes:
>
>   - we could rename the global to "marks_toplevel" or something to make
>     sure we got all references to it. But it makes the lookup code much
>     uglier (it has to use the new name, and otherwise doesn't need
>     touched by this patch). I actually did that temporarily to make sure
>     there weren't any other lingering references, but it was too ugly to
>     keep.
>
>   - there's another global in insert_mark(), which is marks_set_count.
>     We increment it once for each mark. We use the same counter whether
>     we're adding a real mark, or a submodule-rewrite mark. Since it's
>     not used for anything except reporting statistics at the end of the
>     program, I think it's fine (it's not clear whether somebody would
>     want the set of actual marks, or to know how often we had to call
>     into the mark-insertion code).
>
>  builtin/fast-import.c        | 31 ++++++++++++----------
>  t/t9304-fast-import-marks.sh | 51 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 68 insertions(+), 14 deletions(-)
>  create mode 100755 t/t9304-fast-import-marks.sh
>
> diff --git a/builtin/fast-import.c b/builtin/fast-import.c
> index 1bf50a73dc..70d7d25eed 100644
> --- a/builtin/fast-import.c
> +++ b/builtin/fast-import.c
> @@ -150,7 +150,7 @@ struct recent_command {
>  	char *buf;
>  };
>  
> -typedef void (*mark_set_inserter_t)(struct mark_set *s, struct object_id *oid, uintmax_t mark);
> +typedef void (*mark_set_inserter_t)(struct mark_set **s, struct object_id *oid, uintmax_t mark);
>  typedef void (*each_mark_fn_t)(uintmax_t mark, void *obj, void *cbp);
>  
>  /* Configured limits on output */
> @@ -526,13 +526,15 @@ static unsigned int hc_str(const char *s, size_t len)
>  	return r;
>  }
>  
> -static void insert_mark(struct mark_set *s, uintmax_t idnum, struct object_entry *oe)
> +static void insert_mark(struct mark_set **top, uintmax_t idnum, struct object_entry *oe)
>  {
> +	struct mark_set *s = *top;
> +
>  	while ((idnum >> s->shift) >= 1024) {
>  		s = mem_pool_calloc(&fi_mem_pool, 1, sizeof(struct mark_set));
> -		s->shift = marks->shift + 10;
> -		s->data.sets[0] = marks;
> -		marks = s;
> +		s->shift = (*top)->shift + 10;
> +		s->data.sets[0] = *top;
> +		*top = s;
>  	}
>  	while (s->shift) {
>  		uintmax_t i = idnum >> s->shift;
> @@ -944,7 +946,7 @@ static int store_object(
>  
>  	e = insert_object(&oid);
>  	if (mark)
> -		insert_mark(marks, mark, e);
> +		insert_mark(&marks, mark, e);
>  	if (e->idx.offset) {
>  		duplicate_count_by_type[type]++;
>  		return 1;
> @@ -1142,7 +1144,7 @@ static void stream_blob(uintmax_t len, struct object_id *oidout, uintmax_t mark)
>  	e = insert_object(&oid);
>  
>  	if (mark)
> -		insert_mark(marks, mark, e);
> +		insert_mark(&marks, mark, e);
>  
>  	if (e->idx.offset) {
>  		duplicate_count_by_type[OBJ_BLOB]++;
> @@ -1717,7 +1719,7 @@ static void dump_marks(void)
>  	}
>  }
>  
> -static void insert_object_entry(struct mark_set *s, struct object_id *oid, uintmax_t mark)
> +static void insert_object_entry(struct mark_set **s, struct object_id *oid, uintmax_t mark)
>  {
>  	struct object_entry *e;
>  	e = find_object(oid);
> @@ -1734,12 +1736,12 @@ static void insert_object_entry(struct mark_set *s, struct object_id *oid, uintm
>  	insert_mark(s, mark, e);
>  }
>  
> -static void insert_oid_entry(struct mark_set *s, struct object_id *oid, uintmax_t mark)
> +static void insert_oid_entry(struct mark_set **s, struct object_id *oid, uintmax_t mark)
>  {
>  	insert_mark(s, mark, xmemdupz(oid, sizeof(*oid)));
>  }
>  
> -static void read_mark_file(struct mark_set *s, FILE *f, mark_set_inserter_t inserter)
> +static void read_mark_file(struct mark_set **s, FILE *f, mark_set_inserter_t inserter)
>  {
>  	char line[512];
>  	while (fgets(line, sizeof(line), f)) {
> @@ -1772,7 +1774,7 @@ static void read_marks(void)
>  		goto done; /* Marks file does not exist */
>  	else
>  		die_errno("cannot read '%s'", import_marks_file);
> -	read_mark_file(marks, f, insert_object_entry);
> +	read_mark_file(&marks, f, insert_object_entry);
>  	fclose(f);
>  done:
>  	import_marks_file_done = 1;
> @@ -3228,7 +3230,7 @@ static void parse_alias(void)
>  		die(_("Expected 'to' command, got %s"), command_buf.buf);
>  	e = find_object(&b.oid);
>  	assert(e);
> -	insert_mark(marks, next_mark, e);
> +	insert_mark(&marks, next_mark, e);
>  }
>  
>  static char* make_fast_import_path(const char *path)
> @@ -3321,13 +3323,14 @@ static void option_rewrite_submodules(const char *arg, struct string_list *list)
>  	*f = '\0';
>  	f++;
>  	ms = xcalloc(1, sizeof(*ms));
> -	string_list_insert(list, s)->util = ms;
>  
>  	fp = fopen(f, "r");
>  	if (!fp)
>  		die_errno("cannot read '%s'", f);
> -	read_mark_file(ms, fp, insert_oid_entry);
> +	read_mark_file(&ms, fp, insert_oid_entry);
>  	fclose(fp);
> +
> +	string_list_insert(list, s)->util = ms;
>  }
>  
>  static int parse_one_option(const char *option)
> diff --git a/t/t9304-fast-import-marks.sh b/t/t9304-fast-import-marks.sh
> new file mode 100755
> index 0000000000..d4359dba21
> --- /dev/null
> +++ b/t/t9304-fast-import-marks.sh
> @@ -0,0 +1,51 @@
> +#!/bin/sh
> +
> +test_description='test exotic situations with marks'
> +. ./test-lib.sh
> +
> +test_expect_success 'setup dump of basic history' '
> +	test_commit one &&
> +	git fast-export --export-marks=marks HEAD >dump
> +'
> +
> +test_expect_success 'setup large marks file' '
> +	# normally a marks file would have a lot of useful, unique
> +	# marks. But for our purposes, just having a lot of nonsense
> +	# ones is fine. Start at 1024 to avoid clashing with marks
> +	# legitimately used in our tiny dump.
> +	blob=$(git rev-parse HEAD:one.t) &&
> +	for i in $(test_seq 1024 16384)
> +	do
> +		echo ":$i $blob"
> +	done >>marks
> +'
> +
> +test_expect_success 'import with large marks file' '
> +	git fast-import --import-marks=marks <dump
> +'
> +
> +test_expect_success 'setup dump with submodule' '
> +	git submodule add "$PWD" sub &&
> +	git commit -m "add submodule" &&
> +	git fast-export HEAD >dump
> +'
> +
> +test_expect_success 'setup submodule mapping with large id' '
> +	old=$(git rev-parse HEAD:sub) &&
> +	new=$(echo $old | sed s/./a/g) &&
> +	echo ":12345 $old" >from &&
> +	echo ":12345 $new" >to
> +'
> +
> +test_expect_success 'import with submodule mapping' '
> +	git init dst &&
> +	git -C dst fast-import \
> +		--rewrite-submodules-from=sub:../from \
> +		--rewrite-submodules-to=sub:../to \
> +		<dump &&
> +	git -C dst rev-parse HEAD:sub >actual &&
> +	echo "$new" >expect &&
> +	test_cmp expect actual
> +'
> +
> +test_done

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fast-import: fix over-allocation of marks storage
  2020-10-15 17:29       ` Junio C Hamano
@ 2020-10-15 17:34         ` Junio C Hamano
  2020-10-15 18:09           ` Dipl. Ing. Sergey Brester
  0 siblings, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2020-10-15 17:34 UTC (permalink / raw)
  To: Jeff King; +Cc: Dipl. Ing. Sergey Brester, brian m. carlson, git

Junio C Hamano <gitster@pobox.com> writes:

> Why does this vaguely look familiar, I wonder.  I can swear that we
> saw a breakage due to a similar pattern that attempts to convert a
> global to an arg that is passed down to the callchain but not doing
> so fully.

Are we revisiting this thread?

https://lore.kernel.org/git/xmqqtuzlld1d.fsf@gitster.c.googlers.com/

I wonder what happend to the thread at the end ...

> Anyway, the diagnoses above look correct and cleanly described.
>
> Will queue.  Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fast-import: fix over-allocation of marks storage
  2020-10-15 15:38     ` [PATCH] fast-import: fix over-allocation of marks storage Jeff King
  2020-10-15 17:29       ` Junio C Hamano
@ 2020-10-15 17:57       ` René Scharfe
  1 sibling, 0 replies; 20+ messages in thread
From: René Scharfe @ 2020-10-15 17:57 UTC (permalink / raw)
  To: Jeff King, Dipl. Ing. Sergey Brester; +Cc: brian m. carlson, git

Am 15.10.20 um 17:38 schrieb Jeff King:
> On Thu, Oct 15, 2020 at 01:50:30PM +0200, Dipl. Ing. Sergey Brester wrote:
>
>> well, I don't know how you were trying to reproduce it.
>>
>> My first attempt with a git-repository (cloned from
>> https://github.com/git/git.git) showed that immediately to me.
>> Here you go (I used git bash here):
>> [...]
>
> Thanks for this recipe. The key thing I was missing was having a
> reasonably large number of marks to be imported.
>
> The problem bisects to ddddf8d7e2 (fast-import: permit reading multiple
> marks files, 2020-02-22), which is in v2.27.0. The fix is below. Since
> we're entering the -rc2 period for v2.29 today and this isn't a
> regression since v2.28, it probably won't make it into v2.29. But it's a
> pretty serious bug (I'm actually surprised it took this long for anyone
> to notice, as mark importing of any decent size is basically broken),
> so I hope it will make it onto the maint branch soon after release.
>
> -- >8 --
> Subject: [PATCH] fast-import: fix over-allocation of marks storage
>
> Fast-import stores its marks in a trie-like structure made of mark_set
> structs. Each struct has a fixed size (1024). If our id number is too
> large to fit in the struct, then we allocate a new struct which shifts
> the id number by 10 bits. Our original struct becomes a child node
> of this new layer, and the new struct becomes the top level of the trie.
>
> This scheme was broken by ddddf8d7e2 (fast-import: permit reading
> multiple marks files, 2020-02-22). Before then, we had a top-level
> "marks" pointer, and the push-down worked by assigning the new top-level
> struct to "marks". But after that commit, insert_mark() takes a pointer
> to the mark_set, rather than using the global "marks". It continued to
> assign to the global "marks" variable during the push down, which was
> wrong for two reasons:
>
>   - we added a call in option_rewrite_submodules() which uses a separate
>     mark set; pushing down on "marks" is outright wrong here. We'd
>     corrupt the "marks" set, and we'd fail to correctly store any
>     submodule mappings with an id over 1024.
>
>   - the other callers passed "marks", but the push-down was still wrong.
>     In read_mark_file(), we take the pointer to the mark_set as a
>     parameter. So even though insert_mark() was updating the global
>     "marks", the local pointer we had in read_mark_file() was not
>     updated. As a result, we'd add a new level when needed, but then the
>     next call to insert_mark() wouldn't see it! It would then allocate a
>     new layer, which would also not be seen, and so on. Lookups for the
>     lost layers obviously wouldn't work, but before we even hit any
>     lookup stage, we'd generally run out of memory and die.
>
> Our tests didn't notice either of these cases because they didn't have
> enough marks to trigger the push-down behavior. The new tests in t9304
> cover both cases (and fail without this patch).
>
> We can solve the problem by having insert_mark() take a pointer-to-pointer
> of the top-level of the set. Then our push down can assign to it in a
> way that the caller actually sees. Note the subtle reordering in
> option_rewrite_submodules(). Our call to read_mark_file() may modify our
> top-level set pointer, so we have to wait until after it returns to
> assign its value into the string_list.

Looks good to me, and FWIW it fixes Sergey's test case for me as well.
(I wish I'd seen it before I hit send on my other reply..)

>
> Reported-by: Sergey Brester <serg.brester@sebres.de>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> Two additional notes:
>
>   - we could rename the global to "marks_toplevel" or something to make
>     sure we got all references to it. But it makes the lookup code much
>     uglier (it has to use the new name, and otherwise doesn't need
>     touched by this patch). I actually did that temporarily to make sure
>     there weren't any other lingering references, but it was too ugly to
>     keep.
>
>   - there's another global in insert_mark(), which is marks_set_count.
>     We increment it once for each mark. We use the same counter whether
>     we're adding a real mark, or a submodule-rewrite mark. Since it's
>     not used for anything except reporting statistics at the end of the
>     program, I think it's fine (it's not clear whether somebody would
>     want the set of actual marks, or to know how often we had to call
>     into the mark-insertion code).

Semi-related: We should turn the struct object_entry pointers in
insert_mark() and find_mark() into void pointers, since they are
used "generically".  Perhaps struct mark_set would benefit from a
type member to implement our own type checks.

>
>  builtin/fast-import.c        | 31 ++++++++++++----------
>  t/t9304-fast-import-marks.sh | 51 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 68 insertions(+), 14 deletions(-)
>  create mode 100755 t/t9304-fast-import-marks.sh
>
> diff --git a/builtin/fast-import.c b/builtin/fast-import.c
> index 1bf50a73dc..70d7d25eed 100644
> --- a/builtin/fast-import.c
> +++ b/builtin/fast-import.c
> @@ -150,7 +150,7 @@ struct recent_command {
>  	char *buf;
>  };
>
> -typedef void (*mark_set_inserter_t)(struct mark_set *s, struct object_id *oid, uintmax_t mark);
> +typedef void (*mark_set_inserter_t)(struct mark_set **s, struct object_id *oid, uintmax_t mark);
>  typedef void (*each_mark_fn_t)(uintmax_t mark, void *obj, void *cbp);
>
>  /* Configured limits on output */
> @@ -526,13 +526,15 @@ static unsigned int hc_str(const char *s, size_t len)
>  	return r;
>  }
>
> -static void insert_mark(struct mark_set *s, uintmax_t idnum, struct object_entry *oe)
> +static void insert_mark(struct mark_set **top, uintmax_t idnum, struct object_entry *oe)
>  {
> +	struct mark_set *s = *top;
> +
>  	while ((idnum >> s->shift) >= 1024) {
>  		s = mem_pool_calloc(&fi_mem_pool, 1, sizeof(struct mark_set));
> -		s->shift = marks->shift + 10;
> -		s->data.sets[0] = marks;
> -		marks = s;
> +		s->shift = (*top)->shift + 10;
> +		s->data.sets[0] = *top;
> +		*top = s;
>  	}
>  	while (s->shift) {
>  		uintmax_t i = idnum >> s->shift;
> @@ -944,7 +946,7 @@ static int store_object(
>
>  	e = insert_object(&oid);
>  	if (mark)
> -		insert_mark(marks, mark, e);
> +		insert_mark(&marks, mark, e);
>  	if (e->idx.offset) {
>  		duplicate_count_by_type[type]++;
>  		return 1;
> @@ -1142,7 +1144,7 @@ static void stream_blob(uintmax_t len, struct object_id *oidout, uintmax_t mark)
>  	e = insert_object(&oid);
>
>  	if (mark)
> -		insert_mark(marks, mark, e);
> +		insert_mark(&marks, mark, e);
>
>  	if (e->idx.offset) {
>  		duplicate_count_by_type[OBJ_BLOB]++;
> @@ -1717,7 +1719,7 @@ static void dump_marks(void)
>  	}
>  }
>
> -static void insert_object_entry(struct mark_set *s, struct object_id *oid, uintmax_t mark)
> +static void insert_object_entry(struct mark_set **s, struct object_id *oid, uintmax_t mark)
>  {
>  	struct object_entry *e;
>  	e = find_object(oid);
> @@ -1734,12 +1736,12 @@ static void insert_object_entry(struct mark_set *s, struct object_id *oid, uintm
>  	insert_mark(s, mark, e);
>  }
>
> -static void insert_oid_entry(struct mark_set *s, struct object_id *oid, uintmax_t mark)
> +static void insert_oid_entry(struct mark_set **s, struct object_id *oid, uintmax_t mark)
>  {
>  	insert_mark(s, mark, xmemdupz(oid, sizeof(*oid)));
>  }
>
> -static void read_mark_file(struct mark_set *s, FILE *f, mark_set_inserter_t inserter)
> +static void read_mark_file(struct mark_set **s, FILE *f, mark_set_inserter_t inserter)
>  {
>  	char line[512];
>  	while (fgets(line, sizeof(line), f)) {
> @@ -1772,7 +1774,7 @@ static void read_marks(void)
>  		goto done; /* Marks file does not exist */
>  	else
>  		die_errno("cannot read '%s'", import_marks_file);
> -	read_mark_file(marks, f, insert_object_entry);
> +	read_mark_file(&marks, f, insert_object_entry);
>  	fclose(f);
>  done:
>  	import_marks_file_done = 1;
> @@ -3228,7 +3230,7 @@ static void parse_alias(void)
>  		die(_("Expected 'to' command, got %s"), command_buf.buf);
>  	e = find_object(&b.oid);
>  	assert(e);
> -	insert_mark(marks, next_mark, e);
> +	insert_mark(&marks, next_mark, e);
>  }
>
>  static char* make_fast_import_path(const char *path)
> @@ -3321,13 +3323,14 @@ static void option_rewrite_submodules(const char *arg, struct string_list *list)
>  	*f = '\0';
>  	f++;
>  	ms = xcalloc(1, sizeof(*ms));
> -	string_list_insert(list, s)->util = ms;
>
>  	fp = fopen(f, "r");
>  	if (!fp)
>  		die_errno("cannot read '%s'", f);
> -	read_mark_file(ms, fp, insert_oid_entry);
> +	read_mark_file(&ms, fp, insert_oid_entry);
>  	fclose(fp);
> +
> +	string_list_insert(list, s)->util = ms;
>  }
>
>  static int parse_one_option(const char *option)
> diff --git a/t/t9304-fast-import-marks.sh b/t/t9304-fast-import-marks.sh
> new file mode 100755
> index 0000000000..d4359dba21
> --- /dev/null
> +++ b/t/t9304-fast-import-marks.sh
> @@ -0,0 +1,51 @@
> +#!/bin/sh
> +
> +test_description='test exotic situations with marks'
> +. ./test-lib.sh
> +
> +test_expect_success 'setup dump of basic history' '
> +	test_commit one &&
> +	git fast-export --export-marks=marks HEAD >dump
> +'
> +
> +test_expect_success 'setup large marks file' '
> +	# normally a marks file would have a lot of useful, unique
> +	# marks. But for our purposes, just having a lot of nonsense
> +	# ones is fine. Start at 1024 to avoid clashing with marks
> +	# legitimately used in our tiny dump.
> +	blob=$(git rev-parse HEAD:one.t) &&
> +	for i in $(test_seq 1024 16384)
> +	do
> +		echo ":$i $blob"
> +	done >>marks
> +'
> +
> +test_expect_success 'import with large marks file' '
> +	git fast-import --import-marks=marks <dump
> +'
> +
> +test_expect_success 'setup dump with submodule' '
> +	git submodule add "$PWD" sub &&
> +	git commit -m "add submodule" &&
> +	git fast-export HEAD >dump
> +'
> +
> +test_expect_success 'setup submodule mapping with large id' '
> +	old=$(git rev-parse HEAD:sub) &&
> +	new=$(echo $old | sed s/./a/g) &&
> +	echo ":12345 $old" >from &&
> +	echo ":12345 $new" >to
> +'
> +
> +test_expect_success 'import with submodule mapping' '
> +	git init dst &&
> +	git -C dst fast-import \
> +		--rewrite-submodules-from=sub:../from \
> +		--rewrite-submodules-to=sub:../to \
> +		<dump &&
> +	git -C dst rev-parse HEAD:sub >actual &&
> +	echo "$new" >expect &&
> +	test_cmp expect actual
> +'
> +
> +test_done
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fast-import: fix over-allocation of marks storage
  2020-10-15 17:34         ` Junio C Hamano
@ 2020-10-15 18:09           ` Dipl. Ing. Sergey Brester
  2020-10-15 18:35             ` Junio C Hamano
  0 siblings, 1 reply; 20+ messages in thread
From: Dipl. Ing. Sergey Brester @ 2020-10-15 18:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, brian m. carlson, git

Nice.

Brian's patch looks indeed really similar.

May be this is a sign to introduce real issue tracker finally? :)
No offence, but I was always wondering how a team is able to hold all 
the issue related stuff in form
of a mailing list, without to experience such an "embarrassments".
Especially on such large projects and communities.

Regards,
Sergey

15.10.2020 19:34, Junio C Hamano wrote:

> Junio C Hamano <gitster@pobox.com> writes:
> 
>> Why does this vaguely look familiar, I wonder. I can swear that we saw 
>> a breakage due to a similar pattern that attempts to convert a global 
>> to an arg that is passed down to the callchain but not doing so fully.
> 
> Are we revisiting this thread?
> 
> https://lore.kernel.org/git/xmqqtuzlld1d.fsf@gitster.c.googlers.com/ 
> [1]
> 
> I wonder what happend to the thread at the end ...
> 
>> Anyway, the diagnoses above look correct and cleanly described. Will 
>> queue. Thanks.


Links:
------
[1] https://lore.kernel.org/git/xmqqtuzlld1d.fsf@gitster.c.googlers.com/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fast-import: fix over-allocation of marks storage
  2020-10-15 18:09           ` Dipl. Ing. Sergey Brester
@ 2020-10-15 18:35             ` Junio C Hamano
  2020-10-15 18:58               ` Jeff King
                                 ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Junio C Hamano @ 2020-10-15 18:35 UTC (permalink / raw)
  To: Dipl. Ing. Sergey Brester; +Cc: Jeff King, brian m. carlson, git

"Dipl. Ing. Sergey Brester" <serg.brester@sebres.de> writes:

> May be this is a sign to introduce real issue tracker finally? :)
> No offence, but I was always wondering how a team is able to hold all
> the issue related stuff in form
> of a mailing list, without to experience such an "embarrassments".
> Especially on such large projects and communities.

I do not know if an issue-tracker would have helped, though.  The
issue was discovered and discussed there the day before:

  https://lore.kernel.org/git/xmqqimg5o5fq.fsf@gitster.c.googlers.com/

and then was discussed in other thread the next day.  Somehow the
discussion petered out without producing the final patch to be
applied.  For this particular case, what we need is a functioning
patch tracker *and* people who pay attention to patches in the "came
very close to conclusion but no final patch in the tree" state.  We
need people who can volunteer their eyeballs and attention to nudge,
prod and help patches to perfection, and that won't be me.

By the way, now I know why it looked familiar---the fix largely was
my code.  And the diff between Brian's from June and Peff's in this
thread is indeed quite small (shown below), which actually worries
me.  Was there something in the old attempt that was incomplete that
made us wait for the final finishing touches?  If so, is the current
round missing the same thing?  Or perhaps the test was what was
missing in the old attempt, in which case it's perfect (in the
attached diff, I excluded t/ directroy as the old fix didn't have
tests).

Thanks.

diff --git w/builtin/fast-import.c c/builtin/fast-import.c
index 71289b21e3..70d7d25eed 100644
--- w/builtin/fast-import.c
+++ c/builtin/fast-import.c
@@ -526,15 +526,15 @@ static unsigned int hc_str(const char *s, size_t len)
 	return r;
 }
 
-static void insert_mark(struct mark_set **sp, uintmax_t idnum, struct object_entry *oe)
+static void insert_mark(struct mark_set **top, uintmax_t idnum, struct object_entry *oe)
 {
-	struct mark_set *s = *sp;
+	struct mark_set *s = *top;
 
 	while ((idnum >> s->shift) >= 1024) {
 		s = mem_pool_calloc(&fi_mem_pool, 1, sizeof(struct mark_set));
-		s->shift = (*sp)->shift + 10;
-		s->data.sets[0] = (*sp);
-		(*sp) = s;
+		s->shift = (*top)->shift + 10;
+		s->data.sets[0] = *top;
+		*top = s;
 	}
 	while (s->shift) {
 		uintmax_t i = idnum >> s->shift;
@@ -3323,13 +3323,14 @@ static void option_rewrite_submodules(const char *arg, struct string_list *list)
 	*f = '\0';
 	f++;
 	ms = xcalloc(1, sizeof(*ms));
-	string_list_insert(list, s)->util = ms;
 
 	fp = fopen(f, "r");
 	if (!fp)
 		die_errno("cannot read '%s'", f);
 	read_mark_file(&ms, fp, insert_oid_entry);
 	fclose(fp);
+
+	string_list_insert(list, s)->util = ms;
 }
 
 static int parse_one_option(const char *option)

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] fast-import: fix over-allocation of marks storage
  2020-10-15 18:35             ` Junio C Hamano
@ 2020-10-15 18:58               ` Jeff King
  2020-10-15 19:13                 ` Junio C Hamano
  2020-10-16  2:37                 ` brian m. carlson
  2020-10-15 19:05               ` Jeff King
  2020-10-15 19:17               ` Dipl. Ing. Sergey Brester
  2 siblings, 2 replies; 20+ messages in thread
From: Jeff King @ 2020-10-15 18:58 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Dipl. Ing. Sergey Brester, brian m. carlson, git

On Thu, Oct 15, 2020 at 11:35:28AM -0700, Junio C Hamano wrote:

> "Dipl. Ing. Sergey Brester" <serg.brester@sebres.de> writes:
> 
> > May be this is a sign to introduce real issue tracker finally? :)
> > No offence, but I was always wondering how a team is able to hold all
> > the issue related stuff in form
> > of a mailing list, without to experience such an "embarrassments".
> > Especially on such large projects and communities.
> 
> I do not know if an issue-tracker would have helped, though.  The
> issue was discovered and discussed there the day before:
> 
>   https://lore.kernel.org/git/xmqqimg5o5fq.fsf@gitster.c.googlers.com/

Doh, and I was so proud of myself for diagnosing and fixing it. ;)

I hadn't read either of the threads you linked before today (I found
them in my "catch up on list reading" queue, though likely I would have
declared bankruptcy before reading them anyway).

At least that explains my surprise that the issue was not reported
earlier. It was. :)

IMHO an issue tracker wouldn't really change things here. The original
can be found in the first page of results of:

  https://lore.kernel.org/git/?q=fast-import+leak

(though if you add "-cooking -announce" there is even less noise). I
don't know that searching an issue tracker would do much better.

> By the way, now I know why it looked familiar---the fix largely was
> my code.  And the diff between Brian's from June and Peff's in this
> thread is indeed quite small (shown below), which actually worries
> me.  Was there something in the old attempt that was incomplete that
> made us wait for the final finishing touches?  If so, is the current
> round missing the same thing?  Or perhaps the test was what was
> missing in the old attempt, in which case it's perfect (in the
> attached diff, I excluded t/ directroy as the old fix didn't have
> tests).

Looking over the thread, I don't see any problems pointed out (though
as your diff below shows, the original patch missed the re-ordering
required for the submodule mapping call).

So I'd prefer my patch because of that fix and because of the tests.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fast-import: fix over-allocation of marks storage
  2020-10-15 18:35             ` Junio C Hamano
  2020-10-15 18:58               ` Jeff King
@ 2020-10-15 19:05               ` Jeff King
  2020-10-15 19:06                 ` Jeff King
  2020-10-16  3:18                 ` brian m. carlson
  2020-10-15 19:17               ` Dipl. Ing. Sergey Brester
  2 siblings, 2 replies; 20+ messages in thread
From: Jeff King @ 2020-10-15 19:05 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Dipl. Ing. Sergey Brester, brian m. carlson, git

On Thu, Oct 15, 2020 at 11:35:28AM -0700, Junio C Hamano wrote:

> For this particular case, what we need is a functioning
> patch tracker *and* people who pay attention to patches in the "came
> very close to conclusion but no final patch in the tree" state.  We
> need people who can volunteer their eyeballs and attention to nudge,
> prod and help patches to perfection, and that won't be me.

Usually I'd expect this to be the responsibility of the patch submitter
to make sure their stuff gets merged (and if not, to figure out why).

Personally I make a branch for almost every patch/series I submit, no
matter how trivial[1]. And then part of my daily ritual is seeing which
ones have been merged, and dropping them. You can use git-cherry for
that, though it's not 100% perfect (sometimes patches are munged as they
are applied). I use a combination of that and aggressively rebasing
patches forward (and eventually they rebase down into nothing when
they've been fully merged).

GitHub PRs can also serve as an open bookmark if people use them, but
they're similarly not good at auto-closing because of our patch workflow
(I don't know if GGG has any intelligence there, but it would presumably
be subject to the same git-cherry issues).

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fast-import: fix over-allocation of marks storage
  2020-10-15 19:05               ` Jeff King
@ 2020-10-15 19:06                 ` Jeff King
  2020-10-16  3:18                 ` brian m. carlson
  1 sibling, 0 replies; 20+ messages in thread
From: Jeff King @ 2020-10-15 19:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Dipl. Ing. Sergey Brester, brian m. carlson, git

On Thu, Oct 15, 2020 at 03:05:32PM -0400, Jeff King wrote:

> Personally I make a branch for almost every patch/series I submit, no
> matter how trivial[1]. And then part of my daily ritual is seeing which

I forgot my footnote. It wasn't very interesting, but to avoid
confusion, it was:

  I very occasionally will not bother to create a branch, but my mental
  checklist for that is always: will I be sad if this patch gets lost in
  the discussion and never applied? And the answer is almost always
  "yes", so I create a branch.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fast-import: fix over-allocation of marks storage
  2020-10-15 18:58               ` Jeff King
@ 2020-10-15 19:13                 ` Junio C Hamano
  2020-10-16  2:37                 ` brian m. carlson
  1 sibling, 0 replies; 20+ messages in thread
From: Junio C Hamano @ 2020-10-15 19:13 UTC (permalink / raw)
  To: Jeff King; +Cc: Dipl. Ing. Sergey Brester, brian m. carlson, git

Jeff King <peff@peff.net> writes:

> Looking over the thread, I don't see any problems pointed out (though
> as your diff below shows, the original patch missed the re-ordering
> required for the submodule mapping call).
>
> So I'd prefer my patch because of that fix and because of the tests.

Oh, no question about that.  I would take a fresh fix that was done
on a more recent codebase than the one that was 4 months old and had
to be rebased.  Having a test too is a huge plus.

Thanks.  And by the way, thanks for the "-cooking -announce" tip ;-)


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fast-import: fix over-allocation of marks storage
  2020-10-15 18:35             ` Junio C Hamano
  2020-10-15 18:58               ` Jeff King
  2020-10-15 19:05               ` Jeff King
@ 2020-10-15 19:17               ` Dipl. Ing. Sergey Brester
  2020-10-15 20:15                 ` Junio C Hamano
  2 siblings, 1 reply; 20+ messages in thread
From: Dipl. Ing. Sergey Brester @ 2020-10-15 19:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, brian m. carlson, git


Well, a bug-tracker had some clear benefits:

1. it has controlling and statistic mechanisms like a road-map, 
milestones etc;

2. the issues get a priority (so one is able to select bugs with high 
precedence);

3. they can be labeled or even characterized with other metrics to 
signal seriousness of the issue or
to emphasize them in order to be flashy in the road-map.

4. the issues are bound to the participating contributors (reporter, 
devs, tester, etc), so for example there are reports like "open issues
belonging to me", which could also help to organize work a bit 
("remember" them).

5. Transparency of the representation of issue or some lists is not to 
compare with a thread in a mailing list at all.

I could continue yet, but unsure the arguments will be heard or welcome 
(I know it is your work, and everyone
organizes his/her workplace how one need or want). I was just wondering 
how it could work, even for years.
In my opinion all pros of a bug-tracker are obvious for everyone once 
using them.

And an automatic export/import from/to tracker into mailing lists (to 
retain it for conservative part of
old school participants) is mostly trivial.

Thanks,
Serg.

15.10.2020 20:35, Junio C Hamano wrote:

> "Dipl. Ing. Sergey Brester" <serg.brester@sebres.de> writes:
> 
>> May be this is a sign to introduce real issue tracker finally? :) No 
>> offence, but I was always wondering how a team is able to hold all the 
>> issue related stuff in form of a mailing list, without to experience 
>> such an "embarrassments". Especially on such large projects and 
>> communities.
> 
> I do not know if an issue-tracker would have helped, though.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fast-import: fix over-allocation of marks storage
  2020-10-15 19:17               ` Dipl. Ing. Sergey Brester
@ 2020-10-15 20:15                 ` Junio C Hamano
  0 siblings, 0 replies; 20+ messages in thread
From: Junio C Hamano @ 2020-10-15 20:15 UTC (permalink / raw)
  To: Dipl. Ing. Sergey Brester; +Cc: Jeff King, brian m. carlson, git

"Dipl. Ing. Sergey Brester" <serg.brester@sebres.de> writes:

> Well, a bug-tracker had some clear benefits:
>
> 1. it has controlling and statistic mechanisms like a road-map,
> milestones etc;
>
> 2. the issues get a priority (so one is able to select bugs with high
> precedence);
>
> 3. they can be labeled or even characterized with other metrics to
> signal seriousness of the issue or
> to emphasize them in order to be flashy in the road-map.
>
> 4. the issues are bound to the participating contributors (reporter,
> devs, tester, etc), so for example there are reports like "open issues
> belonging to me", which could also help to organize work a bit
> ("remember" them).
>
> 5. Transparency of the representation of issue or some lists is not to
> compare with a thread in a mailing list at all.

What is curious in the above list is that their benefits are heavily
dependent on people actually curating them.  No issue tracker
automatically makes random issues "prioritized", no issue tracker
automatically closes a stale and/or irrelevant issue.  And without
curation, it tends to become pile of useless reports.

> I could continue yet, but unsure the arguments will be heard or
> welcome ...

I actually think people already _know_ the benefit of having well
curated tracker, and suspect that what needs to be stressed is *not*
the "arguments" for having one.  It just is that those who bring it
up on this list never seem to be interested in maintaining the
tracker in useful state, and assumes that issue curation magically
happens for free, and that is probably it never materializes.

It would be great if you are volunteering to be one of the issue
managers, of course.  FWIW, I think somebody (jrnieder?) already
runs a tracker for us.  I do not know how much community
participation the instance gets to to keep it relevant.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fast-import: fix over-allocation of marks storage
  2020-10-15 18:58               ` Jeff King
  2020-10-15 19:13                 ` Junio C Hamano
@ 2020-10-16  2:37                 ` brian m. carlson
  1 sibling, 0 replies; 20+ messages in thread
From: brian m. carlson @ 2020-10-16  2:37 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Dipl. Ing. Sergey Brester, git

[-- Attachment #1: Type: text/plain, Size: 940 bytes --]

On 2020-10-15 at 18:58:53, Jeff King wrote:
> On Thu, Oct 15, 2020 at 11:35:28AM -0700, Junio C Hamano wrote:
> > I do not know if an issue-tracker would have helped, though.  The
> > issue was discovered and discussed there the day before:
> > 
> >   https://lore.kernel.org/git/xmqqimg5o5fq.fsf@gitster.c.googlers.com/
> 
> Doh, and I was so proud of myself for diagnosing and fixing it. ;)

Well, you did write a great commit message and patch with functional
tests.

> Looking over the thread, I don't see any problems pointed out (though
> as your diff below shows, the original patch missed the re-ordering
> required for the submodule mapping call).
> 
> So I'd prefer my patch because of that fix and because of the tests.

Yeah, I'm fine with taking your patch as well.  Thanks for the tests,
which I think help us avoid future regressions here.
-- 
brian m. carlson (he/him or they/them)
Houston, Texas, US

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fast-import: fix over-allocation of marks storage
  2020-10-15 19:05               ` Jeff King
  2020-10-15 19:06                 ` Jeff King
@ 2020-10-16  3:18                 ` brian m. carlson
  2020-10-16 20:25                   ` Jeff King
  1 sibling, 1 reply; 20+ messages in thread
From: brian m. carlson @ 2020-10-16  3:18 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Dipl. Ing. Sergey Brester, git

[-- Attachment #1: Type: text/plain, Size: 1948 bytes --]

On 2020-10-15 at 19:05:32, Jeff King wrote:
> On Thu, Oct 15, 2020 at 11:35:28AM -0700, Junio C Hamano wrote:
> 
> > For this particular case, what we need is a functioning
> > patch tracker *and* people who pay attention to patches in the "came
> > very close to conclusion but no final patch in the tree" state.  We
> > need people who can volunteer their eyeballs and attention to nudge,
> > prod and help patches to perfection, and that won't be me.
> 
> Usually I'd expect this to be the responsibility of the patch submitter
> to make sure their stuff gets merged (and if not, to figure out why).

Normally I try to keep up with what's cooking emails, but I remember the
original bug report came in on a day when I had some other event and I
probably got distracted with whatever else I was doing later and forgot
about keeping up with the patch.

It would be very convenient if we did have a functioning patch tracker
which could be looked up by user, because then it'd be easier to monitor
one's own series.

> Personally I make a branch for almost every patch/series I submit, no
> matter how trivial[1]. And then part of my daily ritual is seeing which
> ones have been merged, and dropping them. You can use git-cherry for
> that, though it's not 100% perfect (sometimes patches are munged as they
> are applied). I use a combination of that and aggressively rebasing
> patches forward (and eventually they rebase down into nothing when
> they've been fully merged).

I'm really terrible at deleting data, so I have (exactly) 256 branches
in my local repository.  Some of them are merged, and some are not.
This would be a viable approach if I were better about deleting old
series (and completing and sending in the prototypes I've built), or if
I sent in series that were smaller so rebasing them were not so big of a
time commitment.
-- 
brian m. carlson (he/him or they/them)
Houston, Texas, US

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] fast-import: fix over-allocation of marks storage
  2020-10-16  3:18                 ` brian m. carlson
@ 2020-10-16 20:25                   ` Jeff King
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff King @ 2020-10-16 20:25 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Junio C Hamano, Dipl. Ing. Sergey Brester, git

On Fri, Oct 16, 2020 at 03:18:34AM +0000, brian m. carlson wrote:

> > Personally I make a branch for almost every patch/series I submit, no
> > matter how trivial[1]. And then part of my daily ritual is seeing which
> > ones have been merged, and dropping them. You can use git-cherry for
> > that, though it's not 100% perfect (sometimes patches are munged as they
> > are applied). I use a combination of that and aggressively rebasing
> > patches forward (and eventually they rebase down into nothing when
> > they've been fully merged).
> 
> I'm really terrible at deleting data, so I have (exactly) 256 branches
> in my local repository.  Some of them are merged, and some are not.
> This would be a viable approach if I were better about deleting old
> series (and completing and sending in the prototypes I've built), or if
> I sent in series that were smaller so rebasing them were not so big of a
> time commitment.

That's pretty good. I only have about 80. :)

I hesitate to point anybody at my mass of it-works-for-me scripts for
fear of making their lives worse, but certainly you (or anybody) is
welcome to adopt them if you want to do the aggressive-rebase thing. You
can see them at:

  https://github.com/peff/git/tree/meta

I generally do:

  git clone --single-branch -b meta https://github.com/peff/git Meta

in my git.git checkout. And then:

  Meta/merged

shows what has been merged to master or next. Running:

  Meta/rebase

will try to rebase everything on its upstream (I almost always point
branch upstreams to origin/master, though the script also handles the
case that branch X's upstream is a local branch Y). That catches cases
where the patch-ids changed enough that git-cherry can't recognize them.

Ideally that shrinks merged series down to nothing through a combination
of automatic and manual skipping.  Of course that will often yield
conflicts if later already-merged patches in the series touched the same
lines of code. You can either "git rebase --skip" past such commits, or
if you realize the whole giant series is merged, just "rebase --abort"
and delete the branch manually.

Perhaps either those scripts or at least the techniques within them
might help with your cleanup.

Sort-of orthogonal, I also use:

  Meta/private

to create a branch "private" that is a merge of all of my topics on top
of next (minus ones that have "-wip" in the name). That's what I use for
my daily-driver version of Git.

That actually helps somewhat with the rebase process because rerere
learns about conflicts early during this step, and then later the rebase
can reuse the results. It's not perfect though (it merges in
alphabetical order, but the order in which topics graduate might be
different, meaning we see a conflict in reverse order. Junio's Meta
scripts take a specific order from a topics file; but he has the luxury
of knowing that the same topics file is what will be used to actually
graduate the topics for real).

Anyway. I hope that might perhaps help a little, but don't feel
compelled to jump down the rabbit hole if you're happy with your own
workflows. ;)

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2020-10-16 20:25 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-14  9:22 git fast-import leaks memory drastically, so crashes with out of memory by attempt to import 22MB export dump Dipl. Ing. Sergey Brester
2020-10-15  1:26 ` Jeff King
2020-10-15 11:50   ` Dipl. Ing. Sergey Brester
2020-10-15 15:38     ` [PATCH] fast-import: fix over-allocation of marks storage Jeff King
2020-10-15 17:29       ` Junio C Hamano
2020-10-15 17:34         ` Junio C Hamano
2020-10-15 18:09           ` Dipl. Ing. Sergey Brester
2020-10-15 18:35             ` Junio C Hamano
2020-10-15 18:58               ` Jeff King
2020-10-15 19:13                 ` Junio C Hamano
2020-10-16  2:37                 ` brian m. carlson
2020-10-15 19:05               ` Jeff King
2020-10-15 19:06                 ` Jeff King
2020-10-16  3:18                 ` brian m. carlson
2020-10-16 20:25                   ` Jeff King
2020-10-15 19:17               ` Dipl. Ing. Sergey Brester
2020-10-15 20:15                 ` Junio C Hamano
2020-10-15 17:57       ` René Scharfe
2020-10-15 15:52     ` git fast-import leaks memory drastically, so crashes with out of memory by attempt to import 22MB export dump René Scharfe
2020-10-15 16:19       ` Dipl. Ing. Sergey Brester

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).