All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] fast-export: speed improvements
@ 2013-05-03  4:31 Felipe Contreras
  2013-05-03  4:31 ` [PATCH 1/4] fast-{import,export}: use get_sha1_hex() directly Felipe Contreras
                   ` (3 more replies)
  0 siblings, 4 replies; 38+ messages in thread
From: Felipe Contreras @ 2013-05-03  4:31 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Antoine Pelisse, Johannes Schindelin, Felipe Contreras

Hi,

Parsing the marks of an import of the emacs repository moves fast-export to a
crawl. It takes 14 minutes in my setup, after these patches, it takes 1 second.

The important patches are #2 and #3, the rest are niceities.

Felipe Contreras (4):
  fast-{import,export}: use get_sha1_hex() directly
  fast-export: improve speed by skipping blobs
  fast-export: don't parse all the commits
  fast-import: only store commit objects

 builtin/fast-export.c | 17 ++++++++++-------
 fast-import.c         | 19 ++++++++++++-------
 2 files changed, 22 insertions(+), 14 deletions(-)

-- 
1.8.3.rc0.401.g45bba44

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/4] fast-{import,export}: use get_sha1_hex() directly
  2013-05-03  4:31 [PATCH 0/4] fast-export: speed improvements Felipe Contreras
@ 2013-05-03  4:31 ` Felipe Contreras
  2013-05-03 21:50   ` Junio C Hamano
  2013-05-03  4:31 ` [PATCH 2/4] fast-export: improve speed by skipping blobs Felipe Contreras
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 38+ messages in thread
From: Felipe Contreras @ 2013-05-03  4:31 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Antoine Pelisse, Johannes Schindelin, Felipe Contreras

There's no point in calling get_sha1() if we know they are SHA-1s.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
---
 builtin/fast-export.c |  2 +-
 fast-import.c         | 10 +++++-----
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index d60d675..a4dee14 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -621,7 +621,7 @@ static void import_marks(char *input_file)
 
 		mark = strtoumax(line + 1, &mark_end, 10);
 		if (!mark || mark_end == line + 1
-			|| *mark_end != ' ' || get_sha1(mark_end + 1, sha1))
+			|| *mark_end != ' ' || get_sha1_hex(mark_end + 1, sha1))
 			die("corrupt mark line: %s", line);
 
 		if (last_idnum < mark)
diff --git a/fast-import.c b/fast-import.c
index 5f539d7..e02f212 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1822,7 +1822,7 @@ static void read_marks(void)
 		*end = 0;
 		mark = strtoumax(line + 1, &end, 10);
 		if (!mark || end == line + 1
-			|| *end != ' ' || get_sha1(end + 1, sha1))
+			|| *end != ' ' || get_sha1_hex(end + 1, sha1))
 			die("corrupt mark line: %s", line);
 		e = find_object(sha1);
 		if (!e) {
@@ -2490,7 +2490,7 @@ static void note_change_n(struct branch *b, unsigned char *old_fanout)
 		if (commit_oe->type != OBJ_COMMIT)
 			die("Mark :%" PRIuMAX " not a commit", commit_mark);
 		hashcpy(commit_sha1, commit_oe->idx.sha1);
-	} else if (!get_sha1(p, commit_sha1)) {
+	} else if (!get_sha1_hex(p, commit_sha1)) {
 		unsigned long size;
 		char *buf = read_object_with_reference(commit_sha1,
 			commit_type, &size, commit_sha1);
@@ -2604,7 +2604,7 @@ static int parse_from(struct branch *b)
 			free(buf);
 		} else
 			parse_from_existing(b);
-	} else if (!get_sha1(from, b->sha1))
+	} else if (!get_sha1_hex(from, b->sha1))
 		parse_from_existing(b);
 	else
 		die("Invalid ref name or SHA1 expression: %s", from);
@@ -2632,7 +2632,7 @@ static struct hash_list *parse_merge(unsigned int *count)
 			if (oe->type != OBJ_COMMIT)
 				die("Mark :%" PRIuMAX " not a commit", idnum);
 			hashcpy(n->sha1, oe->idx.sha1);
-		} else if (!get_sha1(from, n->sha1)) {
+		} else if (!get_sha1_hex(from, n->sha1)) {
 			unsigned long size;
 			char *buf = read_object_with_reference(n->sha1,
 				commit_type, &size, n->sha1);
@@ -2792,7 +2792,7 @@ static void parse_new_tag(void)
 		oe = find_mark(from_mark);
 		type = oe->type;
 		hashcpy(sha1, oe->idx.sha1);
-	} else if (!get_sha1(from, sha1)) {
+	} else if (!get_sha1_hex(from, sha1)) {
 		struct object_entry *oe = find_object(sha1);
 		if (!oe) {
 			type = sha1_object_info(sha1, NULL);
-- 
1.8.3.rc0.401.g45bba44

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/4] fast-export: improve speed by skipping blobs
  2013-05-03  4:31 [PATCH 0/4] fast-export: speed improvements Felipe Contreras
  2013-05-03  4:31 ` [PATCH 1/4] fast-{import,export}: use get_sha1_hex() directly Felipe Contreras
@ 2013-05-03  4:31 ` Felipe Contreras
  2013-05-03 21:51   ` Junio C Hamano
  2013-05-03  4:31 ` [PATCH 3/4] fast-export: don't parse all the commits Felipe Contreras
  2013-05-03  4:31 ` [PATCH 4/4] fast-import: only store commit objects Felipe Contreras
  3 siblings, 1 reply; 38+ messages in thread
From: Felipe Contreras @ 2013-05-03  4:31 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Antoine Pelisse, Johannes Schindelin, Felipe Contreras

We don't care about blobs, or any object other than commits, but in
order to find the type of object, we are parsing the whole thing, which
is slow, specially in big repositories with lots of big files.

There's no need for that, we can query the object information with
sha1_object_info();

Before this, loading the objects of a fresh emacs import, with 260598
blobs took 14 minutes, after this patch, it takes 3 seconds.

This is the way fast-import does it. Also die if the object is not
found (like fast-import).

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
---
 builtin/fast-export.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index a4dee14..a5b8da8 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -613,6 +613,7 @@ static void import_marks(char *input_file)
 		char *line_end, *mark_end;
 		unsigned char sha1[20];
 		struct object *object;
+		enum object_type type;
 
 		line_end = strchr(line, '\n');
 		if (line[0] != ':' || !line_end)
@@ -627,17 +628,19 @@ static void import_marks(char *input_file)
 		if (last_idnum < mark)
 			last_idnum = mark;
 
-		object = parse_object(sha1);
-		if (!object)
+		type = sha1_object_info(sha1, NULL);
+		if (type < 0)
+			die("object not found: %s", sha1_to_hex(sha1));
+
+		if (type != OBJ_COMMIT)
+			/* only commits */
 			continue;
 
+		object = parse_object(sha1);
+
 		if (object->flags & SHOWN)
 			error("Object %s already has a mark", sha1_to_hex(sha1));
 
-		if (object->type != OBJ_COMMIT)
-			/* only commits */
-			continue;
-
 		mark_object(object, mark);
 
 		object->flags |= SHOWN;
-- 
1.8.3.rc0.401.g45bba44

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/4] fast-export: don't parse all the commits
  2013-05-03  4:31 [PATCH 0/4] fast-export: speed improvements Felipe Contreras
  2013-05-03  4:31 ` [PATCH 1/4] fast-{import,export}: use get_sha1_hex() directly Felipe Contreras
  2013-05-03  4:31 ` [PATCH 2/4] fast-export: improve speed by skipping blobs Felipe Contreras
@ 2013-05-03  4:31 ` Felipe Contreras
  2013-05-03 21:54   ` Junio C Hamano
  2013-05-03  4:31 ` [PATCH 4/4] fast-import: only store commit objects Felipe Contreras
  3 siblings, 1 reply; 38+ messages in thread
From: Felipe Contreras @ 2013-05-03  4:31 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Antoine Pelisse, Johannes Schindelin, Felipe Contreras

We don't need the parsed objects at this point, merely the information
that they have marks.

Seems to be three times faster in my setup with lots of objects.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
---
 builtin/fast-export.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index a5b8da8..3c5a701 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -636,7 +636,7 @@ static void import_marks(char *input_file)
 			/* only commits */
 			continue;
 
-		object = parse_object(sha1);
+		object = lookup_unknown_object(sha1);
 
 		if (object->flags & SHOWN)
 			error("Object %s already has a mark", sha1_to_hex(sha1));
-- 
1.8.3.rc0.401.g45bba44

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/4] fast-import: only store commit objects
  2013-05-03  4:31 [PATCH 0/4] fast-export: speed improvements Felipe Contreras
                   ` (2 preceding siblings ...)
  2013-05-03  4:31 ` [PATCH 3/4] fast-export: don't parse all the commits Felipe Contreras
@ 2013-05-03  4:31 ` Felipe Contreras
  2013-05-03 17:56   ` Thomas Rast
  3 siblings, 1 reply; 38+ messages in thread
From: Felipe Contreras @ 2013-05-03  4:31 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Antoine Pelisse, Johannes Schindelin, Felipe Contreras

There's no point in storing blob, they would increase the time of
loading the marks, and the vast majority of them will never be used
again.

This also makes fast-export and fast-import marks compatible.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
---
 fast-import.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fast-import.c b/fast-import.c
index e02f212..c583975 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1743,9 +1743,14 @@ static void dump_marks_helper(FILE *f,
 		}
 	} else {
 		for (k = 0; k < 1024; k++) {
-			if (m->data.marked[k])
+			if (m->data.marked[k]) {
+				struct object_entry *e;
+				e = m->data.marked[k];
+				if (e->type != OBJ_COMMIT)
+					continue;
 				fprintf(f, ":%" PRIuMAX " %s\n", base + k,
-					sha1_to_hex(m->data.marked[k]->idx.sha1));
+					sha1_to_hex(e->idx.sha1));
+			}
 		}
 	}
 }
-- 
1.8.3.rc0.401.g45bba44

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-03  4:31 ` [PATCH 4/4] fast-import: only store commit objects Felipe Contreras
@ 2013-05-03 17:56   ` Thomas Rast
  2013-05-03 18:23     ` Felipe Contreras
  2013-05-03 22:08     ` Junio C Hamano
  0 siblings, 2 replies; 38+ messages in thread
From: Thomas Rast @ 2013-05-03 17:56 UTC (permalink / raw)
  To: Felipe Contreras
  Cc: git, Junio C Hamano, Antoine Pelisse, Johannes Schindelin

Felipe Contreras <felipe.contreras@gmail.com> writes:

> There's no point in storing blob, they would increase the time of
> loading the marks, and the vast majority of them will never be used
> again.
>
> This also makes fast-export and fast-import marks compatible.
[...]
> -			if (m->data.marked[k])
> +			if (m->data.marked[k]) {
> +				struct object_entry *e;
> +				e = m->data.marked[k];
> +				if (e->type != OBJ_COMMIT)
> +					continue;
>  				fprintf(f, ":%" PRIuMAX " %s\n", base + k,
> -					sha1_to_hex(m->data.marked[k]->idx.sha1));
> +					sha1_to_hex(e->idx.sha1));
> +			}

IIUC, you are unconditionally storing only marks to commit objects.

Are you allowed to do that at this point?  I notice that
git-fast-export(1) says

   --export-marks=<file>
       Dumps the internal marks table to <file> when complete. Marks are
       written one per line as :markid SHA-1. Only marks for revisions
       are dumped[...]

But git-fast-import(1) says nothing of the sort; I would even claim that

   --export-marks=<file>
       Dumps the internal marks table to <file> when complete.

means that the *full* marks table is dumped.

How do we know that this doesn't break any users of fast-import?  Your
comment isn't very reassuring:

> the vast majority of them will never be used again

So what's with the minority?

In any case, if this does go in, please update the documentation to
match, probably by copying the sentence from git-fast-export(1).

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-03 17:56   ` Thomas Rast
@ 2013-05-03 18:23     ` Felipe Contreras
  2013-05-06 10:28       ` Michael Haggerty
  2013-05-03 22:08     ` Junio C Hamano
  1 sibling, 1 reply; 38+ messages in thread
From: Felipe Contreras @ 2013-05-03 18:23 UTC (permalink / raw)
  To: Thomas Rast; +Cc: git, Junio C Hamano, Antoine Pelisse, Johannes Schindelin

On Fri, May 3, 2013 at 12:56 PM, Thomas Rast <trast@inf.ethz.ch> wrote:
> Felipe Contreras <felipe.contreras@gmail.com> writes:

> How do we know that this doesn't break any users of fast-import?  Your
> comment isn't very reassuring:
>
>> the vast majority of them will never be used again
>
> So what's with the minority?

Actually I don't think there's any minority. If the client program
doesn't store blobs, the blob marks are not used anyway. So there's no
change.

However, there's a chance the client programs do rely on blob objects,
in which case things would break. I would like to analyze some client
programs of fast-import out there, to see what would be the impact.

But I think this has to be done either way, the only question is how.
Having a warning would take a lot of effort, and it might be for
nothing. I think it might make sense to knowingly make the potentially
dangerous change, and see what breaks. Most likely nothing will, but
if something does, we revert immediately.

Otherwise we would be stuck in this non-ideal state forever.

Cheers.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/4] fast-{import,export}: use get_sha1_hex() directly
  2013-05-03  4:31 ` [PATCH 1/4] fast-{import,export}: use get_sha1_hex() directly Felipe Contreras
@ 2013-05-03 21:50   ` Junio C Hamano
  0 siblings, 0 replies; 38+ messages in thread
From: Junio C Hamano @ 2013-05-03 21:50 UTC (permalink / raw)
  To: Felipe Contreras; +Cc: git, Antoine Pelisse, Johannes Schindelin

Felipe Contreras <felipe.contreras@gmail.com> writes:

> There's no point in calling get_sha1() if we know they are SHA-1s.

If we know they _have to be_ 40-hex object names, calling get_sha1()
is not just pointless but outright wrong and these calls have to be
get_sha1_hex().

Looks like a good change to me.

>
> Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
> ---
>  builtin/fast-export.c |  2 +-
>  fast-import.c         | 10 +++++-----
>  2 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> index d60d675..a4dee14 100644
> --- a/builtin/fast-export.c
> +++ b/builtin/fast-export.c
> @@ -621,7 +621,7 @@ static void import_marks(char *input_file)
>  
>  		mark = strtoumax(line + 1, &mark_end, 10);
>  		if (!mark || mark_end == line + 1
> -			|| *mark_end != ' ' || get_sha1(mark_end + 1, sha1))
> +			|| *mark_end != ' ' || get_sha1_hex(mark_end + 1, sha1))
>  			die("corrupt mark line: %s", line);
>  
>  		if (last_idnum < mark)
> diff --git a/fast-import.c b/fast-import.c
> index 5f539d7..e02f212 100644
> --- a/fast-import.c
> +++ b/fast-import.c
> @@ -1822,7 +1822,7 @@ static void read_marks(void)
>  		*end = 0;
>  		mark = strtoumax(line + 1, &end, 10);
>  		if (!mark || end == line + 1
> -			|| *end != ' ' || get_sha1(end + 1, sha1))
> +			|| *end != ' ' || get_sha1_hex(end + 1, sha1))
>  			die("corrupt mark line: %s", line);
>  		e = find_object(sha1);
>  		if (!e) {
> @@ -2490,7 +2490,7 @@ static void note_change_n(struct branch *b, unsigned char *old_fanout)
>  		if (commit_oe->type != OBJ_COMMIT)
>  			die("Mark :%" PRIuMAX " not a commit", commit_mark);
>  		hashcpy(commit_sha1, commit_oe->idx.sha1);
> -	} else if (!get_sha1(p, commit_sha1)) {
> +	} else if (!get_sha1_hex(p, commit_sha1)) {
>  		unsigned long size;
>  		char *buf = read_object_with_reference(commit_sha1,
>  			commit_type, &size, commit_sha1);
> @@ -2604,7 +2604,7 @@ static int parse_from(struct branch *b)
>  			free(buf);
>  		} else
>  			parse_from_existing(b);
> -	} else if (!get_sha1(from, b->sha1))
> +	} else if (!get_sha1_hex(from, b->sha1))
>  		parse_from_existing(b);
>  	else
>  		die("Invalid ref name or SHA1 expression: %s", from);
> @@ -2632,7 +2632,7 @@ static struct hash_list *parse_merge(unsigned int *count)
>  			if (oe->type != OBJ_COMMIT)
>  				die("Mark :%" PRIuMAX " not a commit", idnum);
>  			hashcpy(n->sha1, oe->idx.sha1);
> -		} else if (!get_sha1(from, n->sha1)) {
> +		} else if (!get_sha1_hex(from, n->sha1)) {
>  			unsigned long size;
>  			char *buf = read_object_with_reference(n->sha1,
>  				commit_type, &size, n->sha1);
> @@ -2792,7 +2792,7 @@ static void parse_new_tag(void)
>  		oe = find_mark(from_mark);
>  		type = oe->type;
>  		hashcpy(sha1, oe->idx.sha1);
> -	} else if (!get_sha1(from, sha1)) {
> +	} else if (!get_sha1_hex(from, sha1)) {
>  		struct object_entry *oe = find_object(sha1);
>  		if (!oe) {
>  			type = sha1_object_info(sha1, NULL);

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/4] fast-export: improve speed by skipping blobs
  2013-05-03  4:31 ` [PATCH 2/4] fast-export: improve speed by skipping blobs Felipe Contreras
@ 2013-05-03 21:51   ` Junio C Hamano
  0 siblings, 0 replies; 38+ messages in thread
From: Junio C Hamano @ 2013-05-03 21:51 UTC (permalink / raw)
  To: Felipe Contreras; +Cc: git, Antoine Pelisse, Johannes Schindelin

Felipe Contreras <felipe.contreras@gmail.com> writes:

> We don't care about blobs, or any object other than commits, but in
> order to find the type of object, we are parsing the whole thing, which
> is slow, specially in big repositories with lots of big files.
>
> There's no need for that, we can query the object information with
> sha1_object_info();
>
> Before this, loading the objects of a fresh emacs import, with 260598
> blobs took 14 minutes, after this patch, it takes 3 seconds.

OK.

>
> This is the way fast-import does it. Also die if the object is not
> found (like fast-import).
>
> Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
> ---


>  builtin/fast-export.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> index a4dee14..a5b8da8 100644
> --- a/builtin/fast-export.c
> +++ b/builtin/fast-export.c
> @@ -613,6 +613,7 @@ static void import_marks(char *input_file)
>  		char *line_end, *mark_end;
>  		unsigned char sha1[20];
>  		struct object *object;
> +		enum object_type type;
>  
>  		line_end = strchr(line, '\n');
>  		if (line[0] != ':' || !line_end)
> @@ -627,17 +628,19 @@ static void import_marks(char *input_file)
>  		if (last_idnum < mark)
>  			last_idnum = mark;
>  
> -		object = parse_object(sha1);
> -		if (!object)
> +		type = sha1_object_info(sha1, NULL);
> +		if (type < 0)
> +			die("object not found: %s", sha1_to_hex(sha1));
> +
> +		if (type != OBJ_COMMIT)
> +			/* only commits */
>  			continue;
>  
> +		object = parse_object(sha1);
> +
>  		if (object->flags & SHOWN)
>  			error("Object %s already has a mark", sha1_to_hex(sha1));
>  
> -		if (object->type != OBJ_COMMIT)
> -			/* only commits */
> -			continue;
> -
>  		mark_object(object, mark);
>  
>  		object->flags |= SHOWN;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] fast-export: don't parse all the commits
  2013-05-03  4:31 ` [PATCH 3/4] fast-export: don't parse all the commits Felipe Contreras
@ 2013-05-03 21:54   ` Junio C Hamano
  2013-05-04  0:06     ` Felipe Contreras
  0 siblings, 1 reply; 38+ messages in thread
From: Junio C Hamano @ 2013-05-03 21:54 UTC (permalink / raw)
  To: Felipe Contreras; +Cc: git, Antoine Pelisse, Johannes Schindelin

Felipe Contreras <felipe.contreras@gmail.com> writes:

> We don't need the parsed objects at this point, merely the information
> that they have marks.
>
> Seems to be three times faster in my setup with lots of objects.
>
> Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
> ---
>  builtin/fast-export.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> index a5b8da8..3c5a701 100644
> --- a/builtin/fast-export.c
> +++ b/builtin/fast-export.c
> @@ -636,7 +636,7 @@ static void import_marks(char *input_file)
>  			/* only commits */
>  			continue;
>  
> -		object = parse_object(sha1);
> +		object = lookup_unknown_object(sha1);

This updates the parse_object() moved by the previous patch. At this
point in the codeflow, unlike the original, we already _know_ the
object must be a commit; wouldn't an equivalent of:

	object = &(lookup_commit(sha1)->object)

be more correct here?

>  
>  		if (object->flags & SHOWN)
>  			error("Object %s already has a mark", sha1_to_hex(sha1));

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-03 17:56   ` Thomas Rast
  2013-05-03 18:23     ` Felipe Contreras
@ 2013-05-03 22:08     ` Junio C Hamano
  2013-05-03 22:19       ` Felipe Contreras
  1 sibling, 1 reply; 38+ messages in thread
From: Junio C Hamano @ 2013-05-03 22:08 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Felipe Contreras, git, Antoine Pelisse, Johannes Schindelin

Thomas Rast <trast@inf.ethz.ch> writes:

> IIUC, you are unconditionally storing only marks to commit objects.
>
> Are you allowed to do that at this point?  I notice that
> git-fast-export(1) says
>
>    --export-marks=<file>
>        Dumps the internal marks table to <file> when complete. Marks are
>        written one per line as :markid SHA-1. Only marks for revisions
>        are dumped[...]
>
> But git-fast-import(1) says nothing of the sort; I would even claim that
>
>    --export-marks=<file>
>        Dumps the internal marks table to <file> when complete.
>
> means that the *full* marks table is dumped.
>
> How do we know that this doesn't break any users of fast-import?  Your
> comment isn't very reassuring:
>
>> the vast majority of them will never be used again
>
> So what's with the minority?
>
> In any case, if this does go in, please update the documentation to
> match, probably by copying the sentence from git-fast-export(1).

A safe and sane approach may be to teach these an option to tell
them to omit non-commits or to emit all kinds, and make remote-bzr
use that to exclude non-commits.  If the defaults is matched to the
current behaviour, nobody will get hurt, and Felipe's Emacs import,
knowing that it does not need marks to blobs, can take advantage of
the new feature.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-03 22:08     ` Junio C Hamano
@ 2013-05-03 22:19       ` Felipe Contreras
  2013-05-03 23:45         ` Junio C Hamano
  0 siblings, 1 reply; 38+ messages in thread
From: Felipe Contreras @ 2013-05-03 22:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Thomas Rast, git, Antoine Pelisse, Johannes Schindelin

On Fri, May 3, 2013 at 5:08 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Thomas Rast <trast@inf.ethz.ch> writes:
>
>> IIUC, you are unconditionally storing only marks to commit objects.
>>
>> Are you allowed to do that at this point?  I notice that
>> git-fast-export(1) says
>>
>>    --export-marks=<file>
>>        Dumps the internal marks table to <file> when complete. Marks are
>>        written one per line as :markid SHA-1. Only marks for revisions
>>        are dumped[...]
>>
>> But git-fast-import(1) says nothing of the sort; I would even claim that
>>
>>    --export-marks=<file>
>>        Dumps the internal marks table to <file> when complete.
>>
>> means that the *full* marks table is dumped.
>>
>> How do we know that this doesn't break any users of fast-import?  Your
>> comment isn't very reassuring:
>>
>>> the vast majority of them will never be used again
>>
>> So what's with the minority?
>>
>> In any case, if this does go in, please update the documentation to
>> match, probably by copying the sentence from git-fast-export(1).
>
> A safe and sane approach may be to teach these an option to tell
> them to omit non-commits or to emit all kinds, and make remote-bzr
> use that to exclude non-commits.

This has nothing to do with remote-bzr, or any remote helper. These
objects are not useful, not even to 'git fast-export'.

> If the defaults is matched to the
> current behaviour, nobody will get hurt

Changing nothing always ensures that nobody will get hurt, but that
doesn't improve anything either.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-03 22:19       ` Felipe Contreras
@ 2013-05-03 23:45         ` Junio C Hamano
  2013-05-04  0:01           ` Felipe Contreras
  0 siblings, 1 reply; 38+ messages in thread
From: Junio C Hamano @ 2013-05-03 23:45 UTC (permalink / raw)
  To: Felipe Contreras; +Cc: Thomas Rast, git, Antoine Pelisse, Johannes Schindelin

Felipe Contreras <felipe.contreras@gmail.com> writes:

>> A safe and sane approach may be to teach these an option to tell
>> them to omit non-commits or to emit all kinds, and make remote-bzr
>> use that to exclude non-commits.
>
> This has nothing to do with remote-bzr, or any remote helper. These
> objects are not useful, not even to 'git fast-export'.
>
>> If the defaults is matched to the
>> current behaviour, nobody will get hurt
>
> Changing nothing always ensures that nobody will get hurt, but that
> doesn't improve anything either.

I do not quite follow.  Allowing existing code to depend on old
behaviour, while letting new code that knows it does not need
anything but commits, will get the best of both worlds. The new code
will definitely improve (otherwise you won't be writing these
patches), and the old code will not get hurt. Where is this "doesn't
improve anything" coming from?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-03 23:45         ` Junio C Hamano
@ 2013-05-04  0:01           ` Felipe Contreras
  0 siblings, 0 replies; 38+ messages in thread
From: Felipe Contreras @ 2013-05-04  0:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Thomas Rast, git, Antoine Pelisse, Johannes Schindelin

On Fri, May 3, 2013 at 6:45 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Felipe Contreras <felipe.contreras@gmail.com> writes:
>
>>> A safe and sane approach may be to teach these an option to tell
>>> them to omit non-commits or to emit all kinds, and make remote-bzr
>>> use that to exclude non-commits.
>>
>> This has nothing to do with remote-bzr, or any remote helper. These
>> objects are not useful, not even to 'git fast-export'.
>>
>>> If the defaults is matched to the
>>> current behaviour, nobody will get hurt
>>
>> Changing nothing always ensures that nobody will get hurt, but that
>> doesn't improve anything either.
>
> I do not quite follow.  Allowing existing code to depend on old
> behaviour, while letting new code that knows it does not need
> anything but commits, will get the best of both worlds. The new code
> will definitely improve (otherwise you won't be writing these
> patches), and the old code will not get hurt. Where is this "doesn't
> improve anything" coming from?

So let's suppose we add a new 'feature no-blob-store' to 'git
fast-import', and some clients of 'git fast-import' enable it, and
benefit from it.

How does that benefit the rest of fast-import clients? You are
worrying about clients that most likely don't exist, and don't let
existing real clients benefit for free. This is like premature
optimization.

But whatever, let's keep writing and discarding these blobs, at least
the previous patches do make such operations fast.

You can drop this patch, because I'm not going to write code for
clients that don't exist. And it seems clear that even if I
investigate client apps of 'git fast-import' and I'm unable to find a
single one that utilizes blobs, you still wouldn't apply this patch.
So there's no point in investigating what are the *actual* users, if
all we are every going to do is worry about hypothetical ones.

My main interest is the previous patches, if anybody has any issues
with the way this patch is handled, they can either work on the patch
itself, or start a new thread in which I will not participate, I will
rather concentrate on issues that affect *real* users, and have real
fixes reachable today, and the previous patches in this series fit
that bill.

Cheers.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] fast-export: don't parse all the commits
  2013-05-03 21:54   ` Junio C Hamano
@ 2013-05-04  0:06     ` Felipe Contreras
  2013-05-04 19:22       ` Junio C Hamano
  0 siblings, 1 reply; 38+ messages in thread
From: Felipe Contreras @ 2013-05-04  0:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Antoine Pelisse, Johannes Schindelin

On Fri, May 3, 2013 at 4:54 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Felipe Contreras <felipe.contreras@gmail.com> writes:
>
>> We don't need the parsed objects at this point, merely the information
>> that they have marks.
>>
>> Seems to be three times faster in my setup with lots of objects.
>>
>> Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
>> ---
>>  builtin/fast-export.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
>> index a5b8da8..3c5a701 100644
>> --- a/builtin/fast-export.c
>> +++ b/builtin/fast-export.c
>> @@ -636,7 +636,7 @@ static void import_marks(char *input_file)
>>                       /* only commits */
>>                       continue;
>>
>> -             object = parse_object(sha1);
>> +             object = lookup_unknown_object(sha1);
>
> This updates the parse_object() moved by the previous patch. At this
> point in the codeflow, unlike the original, we already _know_ the
> object must be a commit; wouldn't an equivalent of:
>
>         object = &(lookup_commit(sha1)->object)
>
> be more correct here?

Maybe, if we want to run some extra code we don't care about.

The only actual difference is that object->type will be OBJ_COMMIT,
but a) this is not going to be used anywhere, and b) we can set that
ourselves.

In fact, my original code was:

	object = lookup_object(sha1);
	if (!object)
		object = create_object(sha1, OBJ_COMMIT, alloc_object_node());

But I figured there's no need for those extra lines of code.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] fast-export: don't parse all the commits
  2013-05-04  0:06     ` Felipe Contreras
@ 2013-05-04 19:22       ` Junio C Hamano
  0 siblings, 0 replies; 38+ messages in thread
From: Junio C Hamano @ 2013-05-04 19:22 UTC (permalink / raw)
  To: Felipe Contreras; +Cc: git, Antoine Pelisse, Johannes Schindelin

Felipe Contreras <felipe.contreras@gmail.com> writes:

>> This updates the parse_object() moved by the previous patch. At this
>> point in the codeflow, unlike the original, we already _know_ the
>> object must be a commit; wouldn't an equivalent of:
>>
>>         object = &(lookup_commit(sha1)->object)
>>
>> be more correct here?
>
> Maybe, if we want to run some extra code we don't care about.
>
> The only actual difference is that object->type will be OBJ_COMMIT,
> ...

The reason to prefer lookup_commit() over lookup_unknown_object() is
primarily that we allocate the storage for known type, not union,
out of the allocation pool (alloc.c) for the known type.  Also we
mark the result with the type so that we can catch mistakes in data
when later code somehow expects an object with the same object name
to be of different type; while the latter is a good safety measure,
I do not think it matters that much in this codepath. The former is
more important in general.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-03 18:23     ` Felipe Contreras
@ 2013-05-06 10:28       ` Michael Haggerty
  2013-05-06 10:32         ` Thomas Rast
  0 siblings, 1 reply; 38+ messages in thread
From: Michael Haggerty @ 2013-05-06 10:28 UTC (permalink / raw)
  To: Felipe Contreras
  Cc: Thomas Rast, git, Junio C Hamano, Antoine Pelisse, Johannes Schindelin

On 05/03/2013 08:23 PM, Felipe Contreras wrote:
> On Fri, May 3, 2013 at 12:56 PM, Thomas Rast <trast@inf.ethz.ch> wrote:
>> Felipe Contreras <felipe.contreras@gmail.com> writes:
> 
>> How do we know that this doesn't break any users of fast-import?  Your
>> comment isn't very reassuring:
>>
>>> the vast majority of them will never be used again
>>
>> So what's with the minority?
> 
> Actually I don't think there's any minority. If the client program
> doesn't store blobs, the blob marks are not used anyway. So there's no
> change.

I haven't been following this conversation in detail, but your proposed
change sounds like something that would break cvs2git [1].  Let me
explain what cvs2git does and why:

CVS stores all of the revisions of a single file in a single filename,v
file in rcsfile(5) format.  The revisions are stored as deltas ordered
so that a single revision can be reconstructed from a single serial read
of the file.

cvs2git reads each of these files once, reconstructing *all* of the
revisions for a file in a single go.  It then pours them into a
git-fast-import stream as blobs and sets a mark on each blob.

Only much later in the conversion does it have enough information to
reconstruct tree-wide commits.  At that time it outputs git-fast-import
data (to a second file) defining the git commits and their ancestry.
The contents are defined by referring to the marks of blobs from the
first git-fast-import stream file.

This strategy speeds up the conversion *enormously*.

So if I understand correctly that you are proposing to stop allowing
marks on blob objects to be set and/or referred to later, then I object
vociferously.

If I've misunderstood then I'll go back into my hole :-)

Michael

[1] http://cvs2svn.tigris.org/cvs2git.html

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-06 10:28       ` Michael Haggerty
@ 2013-05-06 10:32         ` Thomas Rast
  2013-05-06 10:45           ` Michael Haggerty
  2013-05-06 12:20           ` Johannes Schindelin
  0 siblings, 2 replies; 38+ messages in thread
From: Thomas Rast @ 2013-05-06 10:32 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Felipe Contreras, git, Junio C Hamano, Antoine Pelisse,
	Johannes Schindelin

Michael Haggerty <mhagger@alum.mit.edu> writes:

> On 05/03/2013 08:23 PM, Felipe Contreras wrote:
>> On Fri, May 3, 2013 at 12:56 PM, Thomas Rast <trast@inf.ethz.ch> wrote:
>>> Felipe Contreras <felipe.contreras@gmail.com> writes:
>> 
>>> How do we know that this doesn't break any users of fast-import?  Your
>>> comment isn't very reassuring:
>>>
>>>> the vast majority of them will never be used again
>>>
>>> So what's with the minority?
>> 
>> Actually I don't think there's any minority. If the client program
>> doesn't store blobs, the blob marks are not used anyway. So there's no
>> change.
>
> I haven't been following this conversation in detail, but your proposed
> change sounds like something that would break cvs2git [1].  Let me
> explain what cvs2git does and why:
>
> CVS stores all of the revisions of a single file in a single filename,v
> file in rcsfile(5) format.  The revisions are stored as deltas ordered
> so that a single revision can be reconstructed from a single serial read
> of the file.
>
> cvs2git reads each of these files once, reconstructing *all* of the
> revisions for a file in a single go.  It then pours them into a
> git-fast-import stream as blobs and sets a mark on each blob.
>
> Only much later in the conversion does it have enough information to
> reconstruct tree-wide commits.  At that time it outputs git-fast-import
> data (to a second file) defining the git commits and their ancestry.
> The contents are defined by referring to the marks of blobs from the
> first git-fast-import stream file.
>
> This strategy speeds up the conversion *enormously*.
>
> So if I understand correctly that you are proposing to stop allowing
> marks on blob objects to be set and/or referred to later, then I object
> vociferously.

The proposed patch wants to stop writing marks (in --export-marks) for
anything but commits.  Does cvs2git depend on that?  I.e., are you using
two separate fast-import processes for the blob and tree/commit phases
you describe above?

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-06 10:32         ` Thomas Rast
@ 2013-05-06 10:45           ` Michael Haggerty
  2013-05-06 15:18             ` Junio C Hamano
  2013-05-06 21:04             ` Felipe Contreras
  2013-05-06 12:20           ` Johannes Schindelin
  1 sibling, 2 replies; 38+ messages in thread
From: Michael Haggerty @ 2013-05-06 10:45 UTC (permalink / raw)
  To: Thomas Rast
  Cc: Felipe Contreras, git, Junio C Hamano, Antoine Pelisse,
	Johannes Schindelin

On 05/06/2013 12:32 PM, Thomas Rast wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
> 
>> On 05/03/2013 08:23 PM, Felipe Contreras wrote:
>>> On Fri, May 3, 2013 at 12:56 PM, Thomas Rast <trast@inf.ethz.ch> wrote:
>>>> Felipe Contreras <felipe.contreras@gmail.com> writes:
>>>
>>>> How do we know that this doesn't break any users of fast-import?  Your
>>>> comment isn't very reassuring:
>>>>
>>>>> the vast majority of them will never be used again
>>>>
>>>> So what's with the minority?
>>>
>>> Actually I don't think there's any minority. If the client program
>>> doesn't store blobs, the blob marks are not used anyway. So there's no
>>> change.
>>
>> I haven't been following this conversation in detail, but your proposed
>> change sounds like something that would break cvs2git [1].  Let me
>> explain what cvs2git does and why:
>>
>> CVS stores all of the revisions of a single file in a single filename,v
>> file in rcsfile(5) format.  The revisions are stored as deltas ordered
>> so that a single revision can be reconstructed from a single serial read
>> of the file.
>>
>> cvs2git reads each of these files once, reconstructing *all* of the
>> revisions for a file in a single go.  It then pours them into a
>> git-fast-import stream as blobs and sets a mark on each blob.
>>
>> Only much later in the conversion does it have enough information to
>> reconstruct tree-wide commits.  At that time it outputs git-fast-import
>> data (to a second file) defining the git commits and their ancestry.
>> The contents are defined by referring to the marks of blobs from the
>> first git-fast-import stream file.
>>
>> This strategy speeds up the conversion *enormously*.
>>
>> So if I understand correctly that you are proposing to stop allowing
>> marks on blob objects to be set and/or referred to later, then I object
>> vociferously.
> 
> The proposed patch wants to stop writing marks (in --export-marks) for
> anything but commits.  Does cvs2git depend on that?  I.e., are you using
> two separate fast-import processes for the blob and tree/commit phases
> you describe above?

Yes, it can be handy to start loading the first "blobfile" in parallel
with the later stages of the conversion, before the second "dumpfile" is
ready.  In that case the user needs to pass --export-marks to the first
fast-import process to export marks on blobs so that the marks can be
passed to the second fast-import via --import-marks.

So the proposed change would break a documented use of cvs2git.

Making the export of blob marks optional would of course be OK, as long
as the default is to export them.

Michael


-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-06 10:32         ` Thomas Rast
  2013-05-06 10:45           ` Michael Haggerty
@ 2013-05-06 12:20           ` Johannes Schindelin
  2013-05-06 21:06             ` Felipe Contreras
  1 sibling, 1 reply; 38+ messages in thread
From: Johannes Schindelin @ 2013-05-06 12:20 UTC (permalink / raw)
  To: Thomas Rast
  Cc: Michael Haggerty, Felipe Contreras, git, Junio C Hamano, Antoine Pelisse

Hi Thomas,

On Mon, 6 May 2013, Thomas Rast wrote:

> The proposed patch wants to stop writing marks (in --export-marks) for
> anything but commits.

Then it should not go in.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-06 10:45           ` Michael Haggerty
@ 2013-05-06 15:18             ` Junio C Hamano
  2013-05-06 21:19               ` Felipe Contreras
  2013-05-06 21:04             ` Felipe Contreras
  1 sibling, 1 reply; 38+ messages in thread
From: Junio C Hamano @ 2013-05-06 15:18 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Thomas Rast, Felipe Contreras, git, Antoine Pelisse, Johannes Schindelin

Michael Haggerty <mhagger@alum.mit.edu> writes:

> Yes, it can be handy to start loading the first "blobfile" in parallel
> with the later stages of the conversion, before the second "dumpfile" is
> ready.  In that case the user needs to pass --export-marks to the first
> fast-import process to export marks on blobs so that the marks can be
> passed to the second fast-import via --import-marks.
>
> So the proposed change would break a documented use of cvs2git.
>
> Making the export of blob marks optional would of course be OK, as long
> as the default is to export them.

Thanks for a concise summary.  Your use case fits exactly what
Felipe conjectured as the nonexistent minority.

An option that lets the caller say "I only care about marks on these
types of objects to be written to (and read from) the exported marks
file" would help Felipe's use case without harming your use case,
and would be a sane and safe way to go.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-06 10:45           ` Michael Haggerty
  2013-05-06 15:18             ` Junio C Hamano
@ 2013-05-06 21:04             ` Felipe Contreras
  2013-05-07  3:27               ` Michael Haggerty
  1 sibling, 1 reply; 38+ messages in thread
From: Felipe Contreras @ 2013-05-06 21:04 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Thomas Rast, git, Junio C Hamano, Antoine Pelisse, Johannes Schindelin

On Mon, May 6, 2013 at 5:45 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> On 05/06/2013 12:32 PM, Thomas Rast wrote:
>> Michael Haggerty <mhagger@alum.mit.edu> writes:
>>
>>> On 05/03/2013 08:23 PM, Felipe Contreras wrote:
>>>> On Fri, May 3, 2013 at 12:56 PM, Thomas Rast <trast@inf.ethz.ch> wrote:
>>>>> Felipe Contreras <felipe.contreras@gmail.com> writes:
>>>>
>>>>> How do we know that this doesn't break any users of fast-import?  Your
>>>>> comment isn't very reassuring:
>>>>>
>>>>>> the vast majority of them will never be used again
>>>>>
>>>>> So what's with the minority?
>>>>
>>>> Actually I don't think there's any minority. If the client program
>>>> doesn't store blobs, the blob marks are not used anyway. So there's no
>>>> change.
>>>
>>> I haven't been following this conversation in detail, but your proposed
>>> change sounds like something that would break cvs2git [1].  Let me
>>> explain what cvs2git does and why:
>>>
>>> CVS stores all of the revisions of a single file in a single filename,v
>>> file in rcsfile(5) format.  The revisions are stored as deltas ordered
>>> so that a single revision can be reconstructed from a single serial read
>>> of the file.
>>>
>>> cvs2git reads each of these files once, reconstructing *all* of the
>>> revisions for a file in a single go.  It then pours them into a
>>> git-fast-import stream as blobs and sets a mark on each blob.
>>>
>>> Only much later in the conversion does it have enough information to
>>> reconstruct tree-wide commits.  At that time it outputs git-fast-import
>>> data (to a second file) defining the git commits and their ancestry.
>>> The contents are defined by referring to the marks of blobs from the
>>> first git-fast-import stream file.
>>>
>>> This strategy speeds up the conversion *enormously*.
>>>
>>> So if I understand correctly that you are proposing to stop allowing
>>> marks on blob objects to be set and/or referred to later, then I object
>>> vociferously.
>>
>> The proposed patch wants to stop writing marks (in --export-marks) for
>> anything but commits.  Does cvs2git depend on that?  I.e., are you using
>> two separate fast-import processes for the blob and tree/commit phases
>> you describe above?
>
> Yes, it can be handy to start loading the first "blobfile" in parallel
> with the later stages of the conversion, before the second "dumpfile" is
> ready.  In that case the user needs to pass --export-marks to the first
> fast-import process to export marks on blobs so that the marks can be
> passed to the second fast-import via --import-marks.

It can be used that way, but it doesn't have to be.

> So the proposed change would break a documented use of cvs2git.

It's documented as an alternative. How many people actually use this
form over the other? Is there really any advantage? It's possibly that
basically nobody is using this form, and there's no benefits.

> Making the export of blob marks optional would of course be OK, as long
> as the default is to export them.

Nobody benefits from leaving the default as it is.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-06 12:20           ` Johannes Schindelin
@ 2013-05-06 21:06             ` Felipe Contreras
  0 siblings, 0 replies; 38+ messages in thread
From: Felipe Contreras @ 2013-05-06 21:06 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Thomas Rast, Michael Haggerty, git, Junio C Hamano, Antoine Pelisse

On Mon, May 6, 2013 at 7:20 AM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> Hi Thomas,
>
> On Mon, 6 May 2013, Thomas Rast wrote:
>
>> The proposed patch wants to stop writing marks (in --export-marks) for
>> anything but commits.
>
> Then it should not go in.

If that rationale was valid, no change in behavior should ever go in.
Of course, that is not the case, we want to move forward, so changes
in behavior do happen.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-06 15:18             ` Junio C Hamano
@ 2013-05-06 21:19               ` Felipe Contreras
  2013-05-06 21:36                 ` Felipe Contreras
  2013-05-07  2:58                 ` Michael Haggerty
  0 siblings, 2 replies; 38+ messages in thread
From: Felipe Contreras @ 2013-05-06 21:19 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Michael Haggerty, Thomas Rast, git, Antoine Pelisse, Johannes Schindelin

On Mon, May 6, 2013 at 10:18 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
>
>> Yes, it can be handy to start loading the first "blobfile" in parallel
>> with the later stages of the conversion, before the second "dumpfile" is
>> ready.  In that case the user needs to pass --export-marks to the first
>> fast-import process to export marks on blobs so that the marks can be
>> passed to the second fast-import via --import-marks.
>>
>> So the proposed change would break a documented use of cvs2git.
>>
>> Making the export of blob marks optional would of course be OK, as long
>> as the default is to export them.
>
> Thanks for a concise summary.  Your use case fits exactly what
> Felipe conjectured as the nonexistent minority.

Not true. cvs2git does *not* rely on the blobs being stored in a marks
file, because cvs2git does not rely on mark files at all.

> An option that lets the caller say "I only care about marks on these
> types of objects to be written to (and read from) the exported marks
> file" would help Felipe's use case without harming your use case,
> and would be a sane and safe way to go.

His case is not harmed at all. It's only the unfortunate command that
is mentioned in the documentation that didn't need to be mentioned at
all in the first place.

It should be the other way around, if it's only this documentation
that is affected, we could add a switch for that particular command,
and the documentation should be updated, but it's overkill to add a
switch for one odd command in some documentation somewhere, it would
be much better to update the odd command to avoid using marks at all,
which is what the more appropriate command does, right below in the
same documentation.

  cat ../cvs2svn-tmp/git-blob.dat ../cvs2svn-tmp/git-dump.dat | git fast-import

Should the rest of the real world be punished because somebody added a
command in some documentation somewhere, which wasn't actually needed
in the first place?

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-06 21:19               ` Felipe Contreras
@ 2013-05-06 21:36                 ` Felipe Contreras
  2013-05-07  3:14                   ` Michael Haggerty
  2013-05-07  2:58                 ` Michael Haggerty
  1 sibling, 1 reply; 38+ messages in thread
From: Felipe Contreras @ 2013-05-06 21:36 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Michael Haggerty, Thomas Rast, git, Antoine Pelisse, Johannes Schindelin

On Mon, May 6, 2013 at 4:19 PM, Felipe Contreras
<felipe.contreras@gmail.com> wrote:
> On Mon, May 6, 2013 at 10:18 AM, Junio C Hamano <gitster@pobox.com> wrote:
>> Michael Haggerty <mhagger@alum.mit.edu> writes:
>>
>>> Yes, it can be handy to start loading the first "blobfile" in parallel
>>> with the later stages of the conversion, before the second "dumpfile" is
>>> ready.  In that case the user needs to pass --export-marks to the first
>>> fast-import process to export marks on blobs so that the marks can be
>>> passed to the second fast-import via --import-marks.
>>>
>>> So the proposed change would break a documented use of cvs2git.
>>>
>>> Making the export of blob marks optional would of course be OK, as long
>>> as the default is to export them.
>>
>> Thanks for a concise summary.  Your use case fits exactly what
>> Felipe conjectured as the nonexistent minority.
>
> Not true. cvs2git does *not* rely on the blobs being stored in a marks
> file, because cvs2git does not rely on mark files at all.
>
>> An option that lets the caller say "I only care about marks on these
>> types of objects to be written to (and read from) the exported marks
>> file" would help Felipe's use case without harming your use case,
>> and would be a sane and safe way to go.
>
> His case is not harmed at all. It's only the unfortunate command that
> is mentioned in the documentation that didn't need to be mentioned at
> all in the first place.
>
> It should be the other way around, if it's only this documentation
> that is affected, we could add a switch for that particular command,
> and the documentation should be updated, but it's overkill to add a
> switch for one odd command in some documentation somewhere, it would
> be much better to update the odd command to avoid using marks at all,
> which is what the more appropriate command does, right below in the
> same documentation.

This would simplify the documentation, and obliterate the need to use
mark files at all:

diff -ur cvs2svn-2.4.0/www/cvs2git.html cvs2svn-2.4.0-mod/www/cvs2git.html
--- cvs2svn-2.4.0/www/cvs2git.html	2012-09-22 01:49:55.000000000 -0500
+++ cvs2svn-2.4.0-mod/www/cvs2git.html	2013-05-06 16:33:12.070189985 -0500
@@ -355,14 +355,13 @@
       fast-import</tt>:</p>

 <pre>
-git fast-import --export-marks=../cvs2svn-tmp/git-marks.dat &lt;
../cvs2svn-tmp/git-blob.dat
-git fast-import --import-marks=../cvs2svn-tmp/git-marks.dat &lt;
../cvs2svn-tmp/git-dump.dat
+cat ../cvs2svn-tmp/git-blob.dat ../cvs2svn-tmp/git-dump.dat | git fast-import
 </pre>

-    <p>On Linux/Unix this can be shortened to:</p>
+    <p>On Windows you should use type instead:</p>

 <pre>
-cat ../cvs2svn-tmp/git-blob.dat ../cvs2svn-tmp/git-dump.dat | git fast-import
+type ../cvs2svn-tmp/git-blob.dat ../cvs2svn-tmp/git-dump.dat | git fast-import
 </pre>

   </li>
Only in cvs2svn-2.4.0-mod/www: .cvs2git.html.swp


-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-06 21:19               ` Felipe Contreras
  2013-05-06 21:36                 ` Felipe Contreras
@ 2013-05-07  2:58                 ` Michael Haggerty
  2013-05-07  4:37                   ` Felipe Contreras
  1 sibling, 1 reply; 38+ messages in thread
From: Michael Haggerty @ 2013-05-07  2:58 UTC (permalink / raw)
  To: Felipe Contreras
  Cc: Junio C Hamano, Thomas Rast, git, Antoine Pelisse, Johannes Schindelin

On 05/06/2013 11:19 PM, Felipe Contreras wrote:
> On Mon, May 6, 2013 at 10:18 AM, Junio C Hamano <gitster@pobox.com> wrote:
>> Michael Haggerty <mhagger@alum.mit.edu> writes:
>>
>>> Yes, it can be handy to start loading the first "blobfile" in parallel
>>> with the later stages of the conversion, before the second "dumpfile" is
>>> ready.  In that case the user needs to pass --export-marks to the first
>>> fast-import process to export marks on blobs so that the marks can be
>>> passed to the second fast-import via --import-marks.
>>>
>>> So the proposed change would break a documented use of cvs2git.
>>>
>>> Making the export of blob marks optional would of course be OK, as long
>>> as the default is to export them.
>>
>> Thanks for a concise summary.  Your use case fits exactly what
>> Felipe conjectured as the nonexistent minority.
> 
> Not true. cvs2git does *not* rely on the blobs being stored in a marks
> file, because cvs2git does not rely on mark files at all.
> 
>> An option that lets the caller say "I only care about marks on these
>> types of objects to be written to (and read from) the exported marks
>> file" would help Felipe's use case without harming your use case,
>> and would be a sane and safe way to go.
> 
> His case is not harmed at all. It's only the unfortunate command that
> is mentioned in the documentation that didn't need to be mentioned at
> all in the first place.
> 
> It should be the other way around, if it's only this documentation
> that is affected, we could add a switch for that particular command,
> and the documentation should be updated, but it's overkill to add a
> switch for one odd command in some documentation somewhere, it would
> be much better to update the odd command to avoid using marks at all,
> which is what the more appropriate command does, right below in the
> same documentation.
> 
>   cat ../cvs2svn-tmp/git-blob.dat ../cvs2svn-tmp/git-dump.dat | git fast-import
> 
> Should the rest of the real world be punished because somebody added a
> command in some documentation somewhere, which wasn't actually needed
> in the first place?

Don't get too fixated on the documentation.  The documentation just
gives some examples of how cvs2git can be used.

The reason that cvs2git outputs two files is that the first file is
emitted at the very beginning of the conversion and the second at the
very end.  These conversions can take a long time (> 1 day for very big
repos), can be interrupted and restarted between "passes", and passes
can even be re-run with changed configurations.

CVS write access has to be turned off before the start of the final
conversion, so no VCS is possible until the conversion is over.  So
users are very interested in keeping the downtime minimal.  The blobfile
can also be unwieldy (its size is approximately the sum of the sizes of
all revisions of all files in the project).  Being able to load the
blobfile into one fast-import process and the dumpfile into a different
process (which relies on the feature that you propose removing) opens up
a lot of possibilities:

* The first fast-import of the blobfile can be started as soon as the
blobfile is complete and run in parallel with the rest of the conversion.

* If the blobfile needs to be transferred over the network (e.g.,
because Git will be served from a different server than the one doing
the conversion) the network transfer can also be done in parallel with
the rest of the conversion.

* The blobfile could be written to a named pipe that is being read by a
git-fast-import process, to avoid having to write the blobfile to disk
in the first place.

* The user could run "git repack" between loading the blobfile and
loading the dumpfile.

These are just the ways that cvs2git does and/or could benefit from the
flexibility that is now in git-fast-import.  Other tools might also be
using git-fast-import in ways that would be broken by your proposed change.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-06 21:36                 ` Felipe Contreras
@ 2013-05-07  3:14                   ` Michael Haggerty
  2013-05-07  4:32                     ` Johannes Schindelin
  0 siblings, 1 reply; 38+ messages in thread
From: Michael Haggerty @ 2013-05-07  3:14 UTC (permalink / raw)
  To: Felipe Contreras
  Cc: Junio C Hamano, Thomas Rast, git, Antoine Pelisse, Johannes Schindelin

On 05/06/2013 11:36 PM, Felipe Contreras wrote:
> This would simplify the documentation, and obliterate the need to use
> mark files at all:

As explained in my other email, this documentation change does not
remove all of the reasons that users might want to use mark files.  I
would still like to show users how they can load the files into Git as
two separate steps.

> diff -ur cvs2svn-2.4.0/www/cvs2git.html cvs2svn-2.4.0-mod/www/cvs2git.html
> --- cvs2svn-2.4.0/www/cvs2git.html	2012-09-22 01:49:55.000000000 -0500
> +++ cvs2svn-2.4.0-mod/www/cvs2git.html	2013-05-06 16:33:12.070189985 -0500
> @@ -355,14 +355,13 @@
>        fast-import</tt>:</p>
> 
>  <pre>
> -git fast-import --export-marks=../cvs2svn-tmp/git-marks.dat &lt;
> ../cvs2svn-tmp/git-blob.dat
> -git fast-import --import-marks=../cvs2svn-tmp/git-marks.dat &lt;
> ../cvs2svn-tmp/git-dump.dat
> +cat ../cvs2svn-tmp/git-blob.dat ../cvs2svn-tmp/git-dump.dat | git fast-import
>  </pre>
> 
> -    <p>On Linux/Unix this can be shortened to:</p>
> +    <p>On Windows you should use type instead:</p>
> 
>  <pre>
> -cat ../cvs2svn-tmp/git-blob.dat ../cvs2svn-tmp/git-dump.dat | git fast-import
> +type ../cvs2svn-tmp/git-blob.dat ../cvs2svn-tmp/git-dump.dat | git fast-import
>  </pre>
> 
>    </li>
> Only in cvs2svn-2.4.0-mod/www: .cvs2git.html.swp
> 
> 

Nevertheless, it *would* be nice to add the Windows equivalent of the
"cat" pipeline.  I knew about the "type" command but I was under the
impression that it is intended for text files and can corrupt binary
files.  Are you sure that using "type" as you suggest is binary-clean?

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-06 21:04             ` Felipe Contreras
@ 2013-05-07  3:27               ` Michael Haggerty
  2013-05-07  4:39                 ` Johannes Schindelin
                                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Michael Haggerty @ 2013-05-07  3:27 UTC (permalink / raw)
  To: Felipe Contreras
  Cc: Thomas Rast, git, Junio C Hamano, Antoine Pelisse, Johannes Schindelin

On 05/06/2013 11:04 PM, Felipe Contreras wrote:
> On Mon, May 6, 2013 at 5:45 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> On 05/06/2013 12:32 PM, Thomas Rast wrote:
>>> Michael Haggerty <mhagger@alum.mit.edu> writes:
>>>
>>>> On 05/03/2013 08:23 PM, Felipe Contreras wrote:
>>>>> On Fri, May 3, 2013 at 12:56 PM, Thomas Rast <trast@inf.ethz.ch> wrote:
>>>>>> Felipe Contreras <felipe.contreras@gmail.com> writes:
>>>>>
>>>>>> How do we know that this doesn't break any users of fast-import?  Your
>>>>>> comment isn't very reassuring:
>>>>>>
>>>>>>> the vast majority of them will never be used again
>>>>>>
>>>>>> So what's with the minority?
>>>>>
>>>>> Actually I don't think there's any minority. If the client program
>>>>> doesn't store blobs, the blob marks are not used anyway. So there's no
>>>>> change.
>>>>
>>>> I haven't been following this conversation in detail, but your proposed
>>>> change sounds like something that would break cvs2git [1].  Let me
>>>> explain what cvs2git does and why:
>>>>
>>>> CVS stores all of the revisions of a single file in a single filename,v
>>>> file in rcsfile(5) format.  The revisions are stored as deltas ordered
>>>> so that a single revision can be reconstructed from a single serial read
>>>> of the file.
>>>>
>>>> cvs2git reads each of these files once, reconstructing *all* of the
>>>> revisions for a file in a single go.  It then pours them into a
>>>> git-fast-import stream as blobs and sets a mark on each blob.
>>>>
>>>> Only much later in the conversion does it have enough information to
>>>> reconstruct tree-wide commits.  At that time it outputs git-fast-import
>>>> data (to a second file) defining the git commits and their ancestry.
>>>> The contents are defined by referring to the marks of blobs from the
>>>> first git-fast-import stream file.
>>>>
>>>> This strategy speeds up the conversion *enormously*.
>>>>
>>>> So if I understand correctly that you are proposing to stop allowing
>>>> marks on blob objects to be set and/or referred to later, then I object
>>>> vociferously.
>>>
>>> The proposed patch wants to stop writing marks (in --export-marks) for
>>> anything but commits.  Does cvs2git depend on that?  I.e., are you using
>>> two separate fast-import processes for the blob and tree/commit phases
>>> you describe above?
>>
>> Yes, it can be handy to start loading the first "blobfile" in parallel
>> with the later stages of the conversion, before the second "dumpfile" is
>> ready.  In that case the user needs to pass --export-marks to the first
>> fast-import process to export marks on blobs so that the marks can be
>> passed to the second fast-import via --import-marks.
> 
> It can be used that way, but it doesn't have to be.

Please see my other email for an explanation of how the flexibility of
loading the two files separately can bring concrete benefits.

>> So the proposed change would break a documented use of cvs2git.
> 
> It's documented as an alternative. How many people actually use this
> form over the other? Is there really any advantage? It's possibly that
> basically nobody is using this form, and there's no benefits.

You conjectured earlier that nobody uses blob marks, and I provided a
counterexample.  Then you proposed a workaround that would require
changes to the cvs2git documentation, and I even explained how your
proposed workaround is not as flexible as the status quo.  Do you want
to go through the same argument with every possible user of git-fast-import?

It would be insanity to change the default behavior when a workaround on
the Git side (namely adding an option that tells git-fast-import *not*
to emit blob marks) would be quite straightforward to implement and have
little maintenance cost.

>> Making the export of blob marks optional would of course be OK, as long
>> as the default is to export them.
> 
> Nobody benefits from leaving the default as it is.

Sure they do.  Any tool that *knows* that it doesn't need blob marks can
pass the new option and benefit.  Other tools benefit from not being
broken by your change.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-07  3:14                   ` Michael Haggerty
@ 2013-05-07  4:32                     ` Johannes Schindelin
  2013-05-07  4:36                       ` Felipe Contreras
  0 siblings, 1 reply; 38+ messages in thread
From: Johannes Schindelin @ 2013-05-07  4:32 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Felipe Contreras, Junio C Hamano, Thomas Rast, git, Antoine Pelisse

Hi Michael,

On Tue, 7 May 2013, Michael Haggerty wrote:

> I knew about the "type" command but I was under the impression that it
> is intended for text files and can corrupt binary files.  Are you sure
> that using "type" as you suggest is binary-clean?

"type" is not binary-clean. At least on some Windows versions, "type" also
has a limit on file size.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-07  4:32                     ` Johannes Schindelin
@ 2013-05-07  4:36                       ` Felipe Contreras
  0 siblings, 0 replies; 38+ messages in thread
From: Felipe Contreras @ 2013-05-07  4:36 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Michael Haggerty, Junio C Hamano, Thomas Rast, git, Antoine Pelisse

On Mon, May 6, 2013 at 11:32 PM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> Hi Michael,
>
> On Tue, 7 May 2013, Michael Haggerty wrote:
>
>> I knew about the "type" command but I was under the impression that it
>> is intended for text files and can corrupt binary files.  Are you sure
>> that using "type" as you suggest is binary-clean?
>
> "type" is not binary-clean. At least on some Windows versions, "type" also
> has a limit on file size.

copy /b file1+file2 destfile

Then.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-07  2:58                 ` Michael Haggerty
@ 2013-05-07  4:37                   ` Felipe Contreras
  0 siblings, 0 replies; 38+ messages in thread
From: Felipe Contreras @ 2013-05-07  4:37 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Junio C Hamano, Thomas Rast, git, Antoine Pelisse, Johannes Schindelin

On Mon, May 6, 2013 at 9:58 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> On 05/06/2013 11:19 PM, Felipe Contreras wrote:
>> On Mon, May 6, 2013 at 10:18 AM, Junio C Hamano <gitster@pobox.com> wrote:
>>> Michael Haggerty <mhagger@alum.mit.edu> writes:
>>>
>>>> Yes, it can be handy to start loading the first "blobfile" in parallel
>>>> with the later stages of the conversion, before the second "dumpfile" is
>>>> ready.  In that case the user needs to pass --export-marks to the first
>>>> fast-import process to export marks on blobs so that the marks can be
>>>> passed to the second fast-import via --import-marks.
>>>>
>>>> So the proposed change would break a documented use of cvs2git.
>>>>
>>>> Making the export of blob marks optional would of course be OK, as long
>>>> as the default is to export them.
>>>
>>> Thanks for a concise summary.  Your use case fits exactly what
>>> Felipe conjectured as the nonexistent minority.
>>
>> Not true. cvs2git does *not* rely on the blobs being stored in a marks
>> file, because cvs2git does not rely on mark files at all.
>>
>>> An option that lets the caller say "I only care about marks on these
>>> types of objects to be written to (and read from) the exported marks
>>> file" would help Felipe's use case without harming your use case,
>>> and would be a sane and safe way to go.
>>
>> His case is not harmed at all. It's only the unfortunate command that
>> is mentioned in the documentation that didn't need to be mentioned at
>> all in the first place.
>>
>> It should be the other way around, if it's only this documentation
>> that is affected, we could add a switch for that particular command,
>> and the documentation should be updated, but it's overkill to add a
>> switch for one odd command in some documentation somewhere, it would
>> be much better to update the odd command to avoid using marks at all,
>> which is what the more appropriate command does, right below in the
>> same documentation.
>>
>>   cat ../cvs2svn-tmp/git-blob.dat ../cvs2svn-tmp/git-dump.dat | git fast-import
>>
>> Should the rest of the real world be punished because somebody added a
>> command in some documentation somewhere, which wasn't actually needed
>> in the first place?
>
> Don't get too fixated on the documentation.  The documentation just
> gives some examples of how cvs2git can be used.
>
> The reason that cvs2git outputs two files is that the first file is
> emitted at the very beginning of the conversion and the second at the
> very end.  These conversions can take a long time (> 1 day for very big
> repos), can be interrupted and restarted between "passes", and passes
> can even be re-run with changed configurations.
>
> CVS write access has to be turned off before the start of the final
> conversion, so no VCS is possible until the conversion is over.  So
> users are very interested in keeping the downtime minimal.

Of course VCS is possible before the conversion is over, you can do
all the cvs2git commands first, turn on CVS, and let 'git fast-import'
do it's thing afterwards, can you not?

> The blobfile
> can also be unwieldy (its size is approximately the sum of the sizes of
> all revisions of all files in the project).  Being able to load the
> blobfile into one fast-import process and the dumpfile into a different
> process (which relies on the feature that you propose removing) opens up
> a lot of possibilities:
>
> * The first fast-import of the blobfile can be started as soon as the
> blobfile is complete and run in parallel with the rest of the conversion.

Is that reason enough to punish everyone else?

> * If the blobfile needs to be transferred over the network (e.g.,
> because Git will be served from a different server than the one doing
> the conversion) the network transfer can also be done in parallel with
> the rest of the conversion.

Ditto.

> * The blobfile could be written to a named pipe that is being read by a
> git-fast-import process, to avoid having to write the blobfile to disk
> in the first place.

And it would still be possible.

{ cat file1; cat file2; } > pipe

What am I missing?

> * The user could run "git repack" between loading the blobfile and
> loading the dumpfile.

Is that reason enough to punish everyone else?

> These are just the ways that cvs2git does and/or could benefit from the
> flexibility that is now in git-fast-import.

Sure, all those things _migt_ be nice, but nothing would be broken,
would it? cvs2git can still achieve it's goal without it.

And if this "feature" is removed by *default*, it would be trivial to
enable an option to tell 'git fast-import' to store blobs from
cvs2git, would it not?

> Other tools might also be
> using git-fast-import in ways that would be broken by your proposed change.

Yeah, everything is possible, but the key word is *might*. Will
anything be *actually* broken by this change? My guess is, very few
things, if any.

Cheers.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-07  3:27               ` Michael Haggerty
@ 2013-05-07  4:39                 ` Johannes Schindelin
  2013-05-07  4:49                   ` Felipe Contreras
  2013-05-07  4:47                 ` Felipe Contreras
  2013-05-07  7:12                 ` Junio C Hamano
  2 siblings, 1 reply; 38+ messages in thread
From: Johannes Schindelin @ 2013-05-07  4:39 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Felipe Contreras, Thomas Rast, git, Junio C Hamano, Antoine Pelisse

Hi,

On Tue, 7 May 2013, Michael Haggerty wrote:

> On 05/06/2013 11:04 PM, Felipe Contreras wrote:
> > On Mon, May 6, 2013 at 5:45 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> >> On 05/06/2013 12:32 PM, Thomas Rast wrote:

> >> So the proposed change would break a documented use of cvs2git.
> > 
> > It's documented as an alternative. How many people actually use this
> > form over the other? Is there really any advantage? It's possibly that
> > basically nobody is using this form, and there's no benefits.
> 
> You conjectured earlier that nobody uses blob marks, and I provided a
> counterexample.  Then you proposed a workaround that would require
> changes to the cvs2git documentation, and I even explained how your
> proposed workaround is not as flexible as the status quo.  Do you want
> to go through the same argument with every possible user of
> git-fast-import?
> 
> It would be insanity to change the default behavior when a workaround on
> the Git side (namely adding an option that tells git-fast-import *not*
> to emit blob marks) would be quite straightforward to implement and have
> little maintenance cost.

I really wonder how many more counterexamples are required to settle this
discussion.

There is no good reason to artificially limit Git's capabilities here,
especially when it has been demonstrated that supporting that capability
is not only possible, but also outright easy.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-07  3:27               ` Michael Haggerty
  2013-05-07  4:39                 ` Johannes Schindelin
@ 2013-05-07  4:47                 ` Felipe Contreras
  2013-05-07  6:47                   ` Michael Haggerty
  2013-05-07  7:12                 ` Junio C Hamano
  2 siblings, 1 reply; 38+ messages in thread
From: Felipe Contreras @ 2013-05-07  4:47 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Thomas Rast, git, Junio C Hamano, Antoine Pelisse, Johannes Schindelin

On Mon, May 6, 2013 at 10:27 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:

> You conjectured earlier that nobody uses blob marks, and I provided a
> counterexample.  Then you proposed a workaround that would require
> changes to the cvs2git documentation, and I even explained how your
> proposed workaround is not as flexible as the status quo.

cvs2git does *not* need blob marks, it does not need marks at all.

The use-case that you mentioned has nothing to do with cvs2git, in
fact. I can be described as this:

% ./generate-blobs > blobs
% git fast-import --export-marks=marks < blobs
% ./generate-commits > commits
% git fast-import --import-marks=marks < commits

In this example 'generate-commits' has no notion of marks at all, and
'git fast-import' doesn't need marks to process both blobs and
commits.

> Do you want
> to go through the same argument with every possible user of git-fast-import?

I don't care about possible users, I care about *real* users.

> It would be insanity to change the default behavior when a workaround on
> the Git side (namely adding an option that tells git-fast-import *not*
> to emit blob marks) would be quite straightforward to implement and have
> little maintenance cost.

And nobody would benefit from that.

>>> Making the export of blob marks optional would of course be OK, as long
>>> as the default is to export them.
>>
>> Nobody benefits from leaving the default as it is.
>
> Sure they do.  Any tool that *knows* that it doesn't need blob marks can
> pass the new option and benefit.  Other tools benefit from not being
> broken by your change.

Which the *vastly* more common case? That blobs are needed, or that
they are not?

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-07  4:39                 ` Johannes Schindelin
@ 2013-05-07  4:49                   ` Felipe Contreras
  0 siblings, 0 replies; 38+ messages in thread
From: Felipe Contreras @ 2013-05-07  4:49 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Michael Haggerty, Thomas Rast, git, Junio C Hamano, Antoine Pelisse

On Mon, May 6, 2013 at 11:39 PM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Tue, 7 May 2013, Michael Haggerty wrote:
>
>> On 05/06/2013 11:04 PM, Felipe Contreras wrote:
>> > On Mon, May 6, 2013 at 5:45 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> >> On 05/06/2013 12:32 PM, Thomas Rast wrote:
>
>> >> So the proposed change would break a documented use of cvs2git.
>> >
>> > It's documented as an alternative. How many people actually use this
>> > form over the other? Is there really any advantage? It's possibly that
>> > basically nobody is using this form, and there's no benefits.
>>
>> You conjectured earlier that nobody uses blob marks, and I provided a
>> counterexample.  Then you proposed a workaround that would require
>> changes to the cvs2git documentation, and I even explained how your
>> proposed workaround is not as flexible as the status quo.  Do you want
>> to go through the same argument with every possible user of
>> git-fast-import?
>>
>> It would be insanity to change the default behavior when a workaround on
>> the Git side (namely adding an option that tells git-fast-import *not*
>> to emit blob marks) would be quite straightforward to implement and have
>> little maintenance cost.
>
> I really wonder how many more counterexamples are required to settle this
> discussion.

One. I haven't seen one use-case that *requires* blob marks, I haven't
seen one tool that utilizes blob marks. And no, cvs2git doesn't use
marks at all.

> There is no good reason to artificially limit Git's capabilities here,
> especially when it has been demonstrated that supporting that capability
> is not only possible, but also outright easy.

Strawman. Nobody is arguing that there shouldn't be an option to
enable blob exporting, the argument is that there's no point in making
that the *default*.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-07  4:47                 ` Felipe Contreras
@ 2013-05-07  6:47                   ` Michael Haggerty
  2013-05-07  7:07                     ` Felipe Contreras
  0 siblings, 1 reply; 38+ messages in thread
From: Michael Haggerty @ 2013-05-07  6:47 UTC (permalink / raw)
  To: Felipe Contreras
  Cc: Thomas Rast, git, Junio C Hamano, Antoine Pelisse, Johannes Schindelin

On 05/07/2013 06:47 AM, Felipe Contreras wrote:
> On Mon, May 6, 2013 at 10:27 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> 
>> You conjectured earlier that nobody uses blob marks, and I provided a
>> counterexample.  Then you proposed a workaround that would require
>> changes to the cvs2git documentation, and I even explained how your
>> proposed workaround is not as flexible as the status quo.
> 
> cvs2git does *not* need blob marks, it does not need marks at all.
> 
> The use-case that you mentioned has nothing to do with cvs2git, in
> fact. I can be described as this:
> 
> % ./generate-blobs > blobs
> % git fast-import --export-marks=marks < blobs
> % ./generate-commits > commits
> % git fast-import --import-marks=marks < commits
> 
> In this example 'generate-commits' has no notion of marks at all, and
> 'git fast-import' doesn't need marks to process both blobs and
> commits.

The "generate-blobs" program generates a mark for each blob and
remembers the marks in a database.  The "generate-commits" program reads
the marks from the database and incorporates them in the definitions of
the commits that it writes to its output.  So yes, the program pair
*does* rely on marks for blobs being exported correctly.

I've tired of this discussion.  I am quite sure that your change will
not be accepted, so I see no need to participate further.  Please do not
interpret my silence as agreement with your quarrelsome arguments.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-07  6:47                   ` Michael Haggerty
@ 2013-05-07  7:07                     ` Felipe Contreras
  0 siblings, 0 replies; 38+ messages in thread
From: Felipe Contreras @ 2013-05-07  7:07 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Thomas Rast, git, Junio C Hamano, Antoine Pelisse, Johannes Schindelin

On Tue, May 7, 2013 at 1:47 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> On 05/07/2013 06:47 AM, Felipe Contreras wrote:
>> On Mon, May 6, 2013 at 10:27 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>>
>>> You conjectured earlier that nobody uses blob marks, and I provided a
>>> counterexample.  Then you proposed a workaround that would require
>>> changes to the cvs2git documentation, and I even explained how your
>>> proposed workaround is not as flexible as the status quo.
>>
>> cvs2git does *not* need blob marks, it does not need marks at all.
>>
>> The use-case that you mentioned has nothing to do with cvs2git, in
>> fact. I can be described as this:
>>
>> % ./generate-blobs > blobs
>> % git fast-import --export-marks=marks < blobs
>> % ./generate-commits > commits
>> % git fast-import --import-marks=marks < commits
>>
>> In this example 'generate-commits' has no notion of marks at all, and
>> 'git fast-import' doesn't need marks to process both blobs and
>> commits.
>
> The "generate-blobs" program generates a mark for each blob and
> remembers the marks in a database.  The "generate-commits" program reads
> the marks from the database and incorporates them in the definitions of
> the commits that it writes to its output.  So yes, the program pair
> *does* rely on marks for blobs being exported correctly.

Fine:

% ./generate-data > data
% ./split-blobs data > blobs
% ./split-commits data > commits

How exactly the files 'blobs' and 'commits' are generated is
irrelevant, 'git fast-import' will need them *once*, and never again,
in fact, doesn't even need to know they are two files. There's no need
for marks.

> I've tired of this discussion.  I am quite sure that your change will
> not be accepted, so I see no need to participate further.  Please do not
> interpret my silence as agreement with your quarrelsome arguments.

How convenient. I ask the though questions, and you suddenly get tired
of the discussion. So, let me answer for you:

* Which the *vastly* more common case? That blobs are needed, or that
they are not?

That they are not. To suggest otherwise would be ridiculous.

And of course this patch will not be accepted, because nobody is
interested in improving things in the most easy and sensible way.

Yours and everybody else's argument is one and the same: status quo,
which is of course, not an argument at all.

But it's pointless to try to convince you, because a) you are not
interested in changing your mind b) you are not interested in seeking
the most sensible solution c) you are only interested in doing what
you are used to do, which is to not change anything, ever d) you know
making this the default is only slightly risky, at best, and not only
the world is not going to end, but most likely nobody would even
notice, and the ones that would, are probably already participating in
this conversation, but you don't even want to do something slightly
risky, because it would show that others were right, and your concerns
don't actually affect any real users at all.

But as I said, drop the patch. Who needs progress?

Cheers.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-07  3:27               ` Michael Haggerty
  2013-05-07  4:39                 ` Johannes Schindelin
  2013-05-07  4:47                 ` Felipe Contreras
@ 2013-05-07  7:12                 ` Junio C Hamano
  2013-05-07  7:34                   ` Michael Haggerty
  2 siblings, 1 reply; 38+ messages in thread
From: Junio C Hamano @ 2013-05-07  7:12 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Felipe Contreras, Thomas Rast, git, Antoine Pelisse, Johannes Schindelin

Michael Haggerty <mhagger@alum.mit.edu> writes:

>>>>> CVS stores all of the revisions of a single file in a single filename,v
>>>>> file in rcsfile(5) format.  The revisions are stored as deltas ordered
>>>>> so that a single revision can be reconstructed from a single serial read
>>>>> of the file.
>>>>>
>>>>> cvs2git reads each of these files once, reconstructing *all* of the
>>>>> revisions for a file in a single go.  It then pours them into a
>>>>> git-fast-import stream as blobs and sets a mark on each blob.

This is more or less off-topic but in the bigger picture it is more
interesting and important X-<.

The way you describe how cvs2git handles the blobs is the more
efficient way, given that fast-import does not even attempt to
bother to create good deltas. The only thing it does is to see if
the current data deltifies against the last object.

IIRC, CVS's backend storage is mostly recorded in backward delta, so
if you are feeding the blob data from new to old, then the resulting
pack would follow Linus's law (the file generally grows over time)
and would generally give you a good deltified chain of objects.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] fast-import: only store commit objects
  2013-05-07  7:12                 ` Junio C Hamano
@ 2013-05-07  7:34                   ` Michael Haggerty
  0 siblings, 0 replies; 38+ messages in thread
From: Michael Haggerty @ 2013-05-07  7:34 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Felipe Contreras, Thomas Rast, git, Antoine Pelisse, Johannes Schindelin

On 05/07/2013 09:12 AM, Junio C Hamano wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
> 
>>>>>> CVS stores all of the revisions of a single file in a single filename,v
>>>>>> file in rcsfile(5) format.  The revisions are stored as deltas ordered
>>>>>> so that a single revision can be reconstructed from a single serial read
>>>>>> of the file.
>>>>>>
>>>>>> cvs2git reads each of these files once, reconstructing *all* of the
>>>>>> revisions for a file in a single go.  It then pours them into a
>>>>>> git-fast-import stream as blobs and sets a mark on each blob.
> 
> This is more or less off-topic but in the bigger picture it is more
> interesting and important X-<.
> 
> The way you describe how cvs2git handles the blobs is the more
> efficient way, given that fast-import does not even attempt to
> bother to create good deltas. The only thing it does is to see if
> the current data deltifies against the last object.
> 
> IIRC, CVS's backend storage is mostly recorded in backward delta, so
> if you are feeding the blob data from new to old, then the resulting
> pack would follow Linus's law (the file generally grows over time)
> and would generally give you a good deltified chain of objects.

Yes, you are correct about how CVS orders commits on the mainline.
Branches are stored in the opposite order--oldest to newest--but CVS
users don't tend to get carried away with branches anyway, and if the
changes are small deltafication should help a lot anyway.

Cool.  I knew that fast-import didn't do much in the way of compression,
but I didn't realize that it can compute deltas only between adjacent
blobs.  So cvs2git happily hits the sweet-spot of fast-import.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2013-05-07  7:34 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-03  4:31 [PATCH 0/4] fast-export: speed improvements Felipe Contreras
2013-05-03  4:31 ` [PATCH 1/4] fast-{import,export}: use get_sha1_hex() directly Felipe Contreras
2013-05-03 21:50   ` Junio C Hamano
2013-05-03  4:31 ` [PATCH 2/4] fast-export: improve speed by skipping blobs Felipe Contreras
2013-05-03 21:51   ` Junio C Hamano
2013-05-03  4:31 ` [PATCH 3/4] fast-export: don't parse all the commits Felipe Contreras
2013-05-03 21:54   ` Junio C Hamano
2013-05-04  0:06     ` Felipe Contreras
2013-05-04 19:22       ` Junio C Hamano
2013-05-03  4:31 ` [PATCH 4/4] fast-import: only store commit objects Felipe Contreras
2013-05-03 17:56   ` Thomas Rast
2013-05-03 18:23     ` Felipe Contreras
2013-05-06 10:28       ` Michael Haggerty
2013-05-06 10:32         ` Thomas Rast
2013-05-06 10:45           ` Michael Haggerty
2013-05-06 15:18             ` Junio C Hamano
2013-05-06 21:19               ` Felipe Contreras
2013-05-06 21:36                 ` Felipe Contreras
2013-05-07  3:14                   ` Michael Haggerty
2013-05-07  4:32                     ` Johannes Schindelin
2013-05-07  4:36                       ` Felipe Contreras
2013-05-07  2:58                 ` Michael Haggerty
2013-05-07  4:37                   ` Felipe Contreras
2013-05-06 21:04             ` Felipe Contreras
2013-05-07  3:27               ` Michael Haggerty
2013-05-07  4:39                 ` Johannes Schindelin
2013-05-07  4:49                   ` Felipe Contreras
2013-05-07  4:47                 ` Felipe Contreras
2013-05-07  6:47                   ` Michael Haggerty
2013-05-07  7:07                     ` Felipe Contreras
2013-05-07  7:12                 ` Junio C Hamano
2013-05-07  7:34                   ` Michael Haggerty
2013-05-06 12:20           ` Johannes Schindelin
2013-05-06 21:06             ` Felipe Contreras
2013-05-03 22:08     ` Junio C Hamano
2013-05-03 22:19       ` Felipe Contreras
2013-05-03 23:45         ` Junio C Hamano
2013-05-04  0:01           ` Felipe Contreras

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.