git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] fast-import: tighten parsing of paths
@ 2024-03-22  0:03 Thalia Archibald
  2024-03-22  0:03 ` [PATCH 1/6] " Thalia Archibald
                   ` (6 more replies)
  0 siblings, 7 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-03-22  0:03 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

fast-import has subtle differences in how it parses file paths between each
occurrence of <path> in the grammar. Many errors are suppressed or not checked,
which could lead to silent data corruption. A particularly bad case is when a
front-end sent escapes that Git doesn't recognize (e.g., hex escapes are not
supported), it would be treated as literal bytes instead of a quoted string.

Bring path parsing into line with the documented behavior and improve
documentation to fill in missing details.

This patch series is patterned after 06454cb9a3 (fast-import: tighten parsing of
datarefs, 2012-04-07), which did similar fixes across the grammar, but for
marks.

This is my first contribution to Git, so please let me know if there's something
I've missed. I'm working on a tool for advanced repo transformations (like a
union of filter-repo and Reposurgeon workflows), so I've been living in
fast-import code and I have more parsing fixes planned.

Thalia



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 1/6] fast-import: tighten parsing of paths
  2024-03-22  0:03 [PATCH 0/6] fast-import: tighten parsing of paths Thalia Archibald
@ 2024-03-22  0:03 ` Thalia Archibald
  2024-03-22  0:11   ` Thalia Archibald
  2024-03-28  8:21   ` Patrick Steinhardt
  2024-03-22  0:03 ` [PATCH 2/6] fast-import: directly use strbufs for paths Thalia Archibald
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-03-22  0:03 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Thalia Archibald

Path parsing in fast-import is inconsistent and many unquoting errors
are suppressed.

`<path>` appears in the grammar in these places:

    filemodify ::= 'M' SP <mode> (<dataref> | 'inline') SP <path> LF
    filedelete ::= 'D' SP <path> LF
    filecopy   ::= 'C' SP <path> SP <path> LF
    filerename ::= 'R' SP <path> SP <path> LF
    ls         ::= 'ls' SP <dataref> SP <path> LF
    ls-commit  ::= 'ls' SP <path> LF

and fast-import.c parses them in five different ways:

1. For filemodify and filedelete:
   If `<path>` is a valid quoted string, unquote it;
   otherwise, treat it as literal bytes (including any number of SP).
2. For filecopy (source) and filerename (source):
   If `<path>` is a valid quoted string, unquote it;
   otherwise, treat it as literal bytes until the next SP.
3. For filecopy (dest) and filerename (dest):
   Like 1., but an unquoted empty string is an error.
4. For ls:
   If `<path>` starts with `"`, unquote it and report parse errors;
   otherwise, treat it as literal bytes (including any number of SP).
5. For ls-commit:
   Unquote `<path>` and report parse errors.
   (It must start with `"` to disambiguate from ls.)

In the first three, any errors from trying to unquote a string are
suppressed, so a quoted string that contains invalid escapes would be
interpreted as literal bytes. For example, `"\xff"` would fail to
unquote (because hex escapes are not supported), and it would instead be
interpreted as the byte sequence `"` `\` `x` `f` `f` `"`, which is
certainly not intended. Some front-ends erroneously use their language's
standard quoting routine and could silently introduce escapes that would
be incorrectly parsed due to this.

The documentation states that “To use a source path that contains SP the
path must be quoted.”, so it is expected that some implementations
depend on spaces being allowed in paths in the final position. Thus we
have two documented ways to parse paths, so simplify the implementation
to that.

Now we have:

1. `parse_path_eol` for filemodify, filedelete, filecopy (dest),
   filerename (dest), ls, and ls-commit:

   If `<path>` starts with `"`, unquote it and report parse errors;
   otherwise, treat it as literal bytes (including any number of SP).
   Garbage after a quoted string or an unquoted empty string are errors.
   (In ls-commit, it must be quoted to disambiguate from ls.)

2. `parse_path_space` for filecopy (source) and filerename (source):

   If `<path>` starts with `"`, unquote it and report parse errors;
   otherwise, treat it as literal bytes until the next SP.
   It must be followed by a SP. An unquoted empty string is an error.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 Documentation/git-fast-import.txt |   3 +-
 builtin/fast-import.c             | 115 ++++++++------
 t/t9300-fast-import.sh            | 252 +++++++++++++++++++++++++++++-
 3 files changed, 316 insertions(+), 54 deletions(-)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index b2607366b9..271bd63a10 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -649,7 +649,8 @@ The value of `<path>` must be in canonical form. That is it must not:
 * contain the special component `.` or `..` (e.g. `foo/./bar` and
   `foo/../bar` are invalid).
 
-The root of the tree can be represented by an empty string as `<path>`.
+The root of the tree can be represented by a quoted empty string (`""`)
+as `<path>`.
 
 It is recommended that `<path>` always be encoded using UTF-8.
 
diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 71a195ca22..b2adec8d9a 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2224,7 +2224,7 @@ static int parse_mapped_oid_hex(const char *hex, struct object_id *oid, const ch
  *
  *   idnum ::= ':' bigint;
  *
- * Return the first character after the value in *endptr.
+ * Update *endptr to point to the first character after the value.
  *
  * Complain if the following character is not what is expected,
  * either a space or end of the string.
@@ -2257,8 +2257,8 @@ static uintmax_t parse_mark_ref_eol(const char *p)
 }
 
 /*
- * Parse the mark reference, demanding a trailing space.  Return a
- * pointer to the space.
+ * Parse the mark reference, demanding a trailing space. Update *p to
+ * point to the first character after the space.
  */
 static uintmax_t parse_mark_ref_space(const char **p)
 {
@@ -2272,10 +2272,57 @@ static uintmax_t parse_mark_ref_space(const char **p)
 	return mark;
 }
 
+/*
+ * Parse the path string into the strbuf. It may be quoted with escape sequences
+ * or unquoted without escape sequences. When unquoted, it may only contain a
+ * space if `allow_spaces` is nonzero.
+ */
+static void parse_path(struct strbuf *sb, const char *p, const char **endp, int allow_spaces, const char *field)
+{
+	strbuf_reset(sb);
+	if (*p == '"') {
+		if (unquote_c_style(sb, p, endp))
+			die("Invalid %s: %s", field, command_buf.buf);
+	} else {
+		if (allow_spaces)
+			*endp = p + strlen(p);
+		else
+			*endp = strchr(p, ' ');
+		if (*endp == p)
+			die("Missing %s: %s", field, command_buf.buf);
+		strbuf_add(sb, p, *endp - p);
+	}
+}
+
+/*
+ * Parse the path string into the strbuf, and complain if this is not the end of
+ * the string. It may contain spaces even when unquoted.
+ */
+static void parse_path_eol(struct strbuf *sb, const char *p, const char *field)
+{
+	const char *end;
+
+	parse_path(sb, p, &end, 1, field);
+	if (*end)
+		die("Garbage after %s: %s", field, command_buf.buf);
+}
+
+/*
+ * Parse the path string into the strbuf, and ensure it is followed by a space.
+ * It may not contain spaces when unquoted. Update *endp to point to the first
+ * character after the space.
+ */
+static void parse_path_space(struct strbuf *sb, const char *p, const char **endp, const char *field)
+{
+	parse_path(sb, p, endp, 0, field);
+	if (**endp != ' ')
+		die("Missing space after %s: %s", field, command_buf.buf);
+	(*endp)++;
+}
+
 static void file_change_m(const char *p, struct branch *b)
 {
 	static struct strbuf uq = STRBUF_INIT;
-	const char *endp;
 	struct object_entry *oe;
 	struct object_id oid;
 	uint16_t mode, inline_data = 0;
@@ -2312,12 +2359,8 @@ static void file_change_m(const char *p, struct branch *b)
 			die("Missing space after SHA1: %s", command_buf.buf);
 	}
 
-	strbuf_reset(&uq);
-	if (!unquote_c_style(&uq, p, &endp)) {
-		if (*endp)
-			die("Garbage after path in: %s", command_buf.buf);
-		p = uq.buf;
-	}
+	parse_path_eol(&uq, p, "path");
+	p = uq.buf;
 
 	/* Git does not track empty, non-toplevel directories. */
 	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *p) {
@@ -2381,48 +2424,23 @@ static void file_change_m(const char *p, struct branch *b)
 static void file_change_d(const char *p, struct branch *b)
 {
 	static struct strbuf uq = STRBUF_INIT;
-	const char *endp;
 
-	strbuf_reset(&uq);
-	if (!unquote_c_style(&uq, p, &endp)) {
-		if (*endp)
-			die("Garbage after path in: %s", command_buf.buf);
-		p = uq.buf;
-	}
+	parse_path_eol(&uq, p, "path");
+	p = uq.buf;
 	tree_content_remove(&b->branch_tree, p, NULL, 1);
 }
 
-static void file_change_cr(const char *s, struct branch *b, int rename)
+static void file_change_cr(const char *p, struct branch *b, int rename)
 {
-	const char *d;
+	const char *s, *d;
 	static struct strbuf s_uq = STRBUF_INIT;
 	static struct strbuf d_uq = STRBUF_INIT;
-	const char *endp;
 	struct tree_entry leaf;
 
-	strbuf_reset(&s_uq);
-	if (!unquote_c_style(&s_uq, s, &endp)) {
-		if (*endp != ' ')
-			die("Missing space after source: %s", command_buf.buf);
-	} else {
-		endp = strchr(s, ' ');
-		if (!endp)
-			die("Missing space after source: %s", command_buf.buf);
-		strbuf_add(&s_uq, s, endp - s);
-	}
+	parse_path_space(&s_uq, p, &p, "source");
+	parse_path_eol(&d_uq, p, "dest");
 	s = s_uq.buf;
-
-	endp++;
-	if (!*endp)
-		die("Missing dest: %s", command_buf.buf);
-
-	d = endp;
-	strbuf_reset(&d_uq);
-	if (!unquote_c_style(&d_uq, d, &endp)) {
-		if (*endp)
-			die("Garbage after dest in: %s", command_buf.buf);
-		d = d_uq.buf;
-	}
+	d = d_uq.buf;
 
 	memset(&leaf, 0, sizeof(leaf));
 	if (rename)
@@ -3168,6 +3186,7 @@ static void parse_ls(const char *p, struct branch *b)
 {
 	struct tree_entry *root = NULL;
 	struct tree_entry leaf = {NULL};
+	static struct strbuf uq = STRBUF_INIT;
 
 	/* ls SP (<tree-ish> SP)? <path> */
 	if (*p == '"') {
@@ -3182,16 +3201,8 @@ static void parse_ls(const char *p, struct branch *b)
 			root->versions[1].mode = S_IFDIR;
 		load_tree(root);
 	}
-	if (*p == '"') {
-		static struct strbuf uq = STRBUF_INIT;
-		const char *endp;
-		strbuf_reset(&uq);
-		if (unquote_c_style(&uq, p, &endp))
-			die("Invalid path: %s", command_buf.buf);
-		if (*endp)
-			die("Garbage after path in: %s", command_buf.buf);
-		p = uq.buf;
-	}
+	parse_path_eol(&uq, p, "path");
+	p = uq.buf;
 	tree_content_get(root, p, &leaf, 1);
 	/*
 	 * A directory in preparation would have a sha1 of zero
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index dbb5042b0b..ef04b43f46 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -2146,6 +2146,7 @@ test_expect_success 'Q: deny note on empty branch' '
 	EOF
 	test_must_fail git fast-import <input
 '
+
 ###
 ### series R (feature and option)
 ###
@@ -2794,7 +2795,7 @@ test_expect_success 'R: blob appears only once' '
 '
 
 ###
-### series S
+### series S (mark and path parsing)
 ###
 #
 # Make sure missing spaces and EOLs after mark references
@@ -3064,6 +3065,255 @@ test_expect_success 'S: ls with garbage after sha1 must fail' '
 	test_grep "space after tree-ish" err
 '
 
+#
+# Path parsing
+#
+# There are two sorts of ways a path can be parsed, depending on whether it is
+# the last field on the line. Additionally, ls without a <tree-ish> has a
+# special case. Test every occurrence of <path> in the grammar against every
+# error case.
+#
+
+#
+# Valid paths at the end of a line: filemodify, filedelete, filecopy (dest),
+# filerename (dest), and ls.
+#
+# commit :301 from root -- modify hello.c
+# commit :302 from :301 -- modify $path
+# commit :303 from :302 -- delete $path
+# commit :304 from :301 -- copy hello.c $path
+# commit :305 from :301 -- rename hello.c $path
+# ls :305 $path
+#
+test_path_eol_success () {
+	test="$1" path="$2" unquoted_path="$3"
+	test_expect_success "S: paths at EOL with $test must work" '
+		git fast-import --export-marks=marks.out <<-EOF >out 2>err &&
+		blob
+		mark :401
+		data <<BLOB
+		hello world
+		BLOB
+
+		blob
+		mark :402
+		data <<BLOB
+		hallo welt
+		BLOB
+
+		commit refs/heads/path-eol
+		mark :301
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		initial commit
+		COMMIT
+		M 100644 :401 hello.c
+
+		commit refs/heads/path-eol
+		mark :302
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filemodify
+		COMMIT
+		from :301
+		M 100644 :402 '"$path"'
+
+		commit refs/heads/path-eol
+		mark :303
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filedelete
+		COMMIT
+		from :302
+		D '"$path"'
+
+		commit refs/heads/path-eol
+		mark :304
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filecopy dest
+		COMMIT
+		from :301
+		C hello.c '"$path"'
+
+		commit refs/heads/path-eol
+		mark :305
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filerename dest
+		COMMIT
+		from :301
+		R hello.c '"$path"'
+
+		ls :305 '"$path"'
+		EOF
+
+		commit_m=$(grep :302 marks.out | cut -d\  -f2) &&
+		commit_d=$(grep :303 marks.out | cut -d\  -f2) &&
+		commit_c=$(grep :304 marks.out | cut -d\  -f2) &&
+		commit_r=$(grep :305 marks.out | cut -d\  -f2) &&
+		blob1=$(grep :401 marks.out | cut -d\  -f2) &&
+		blob2=$(grep :402 marks.out | cut -d\  -f2) &&
+
+		( printf "100644 blob $blob2\t'"$unquoted_path"'\n" &&
+		  printf "100644 blob $blob1\thello.c\n" ) | sort >tree_m.exp &&
+		git ls-tree $commit_m | sort >tree_m.out &&
+		test_cmp tree_m.exp tree_m.out &&
+
+		printf "100644 blob $blob1\thello.c\n" >tree_d.exp &&
+		git ls-tree $commit_d >tree_d.out &&
+		test_cmp tree_d.exp tree_d.out &&
+
+		( printf "100644 blob $blob1\t'"$unquoted_path"'\n" &&
+		  printf "100644 blob $blob1\thello.c\n" ) | sort >tree_c.exp &&
+		git ls-tree $commit_c | sort >tree_c.out &&
+		test_cmp tree_c.exp tree_c.out &&
+
+		printf "100644 blob $blob1\t'"$unquoted_path"'\n" >tree_r.exp &&
+		git ls-tree $commit_r >tree_r.out &&
+		test_cmp tree_r.exp tree_r.out &&
+
+		test_cmp out tree_r.exp &&
+
+		git branch -D path-eol
+	'
+}
+
+test_path_eol_success 'quoted spaces'   '" hello world.c "' ' hello world.c '
+test_path_eol_success 'unquoted spaces' ' hello world.c '   ' hello world.c '
+
+#
+# Valid paths before a space: filecopy (source) and filerename (source).
+#
+# commit :301 from root -- modify $path
+# commit :302 from :301 -- copy $path hello2.c
+# commit :303 from :301 -- rename $path hello2.c
+#
+test_path_space_success () {
+	test="$1" path="$2" unquoted_path="$3"
+	test_expect_success "S: paths before space with $test must work" '
+		git fast-import --export-marks=marks.out <<-EOF 2>err &&
+		blob
+		mark :401
+		data <<BLOB
+		hello world
+		BLOB
+
+		commit refs/heads/path-space
+		mark :301
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		initial commit
+		COMMIT
+		M 100644 :401 '"$path"'
+
+		commit refs/heads/path-space
+		mark :302
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filecopy source
+		COMMIT
+		from :301
+		C '"$path"' hello2.c
+
+		commit refs/heads/path-space
+		mark :303
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filerename source
+		COMMIT
+		from :301
+		R '"$path"' hello2.c
+
+		EOF
+
+		commit_c=$(grep :302 marks.out | cut -d\  -f2) &&
+		commit_r=$(grep :303 marks.out | cut -d\  -f2) &&
+		blob=$(grep :401 marks.out | cut -d\  -f2) &&
+
+		( printf "100644 blob $blob\t'"$unquoted_path"'\n" &&
+		  printf "100644 blob $blob\thello2.c\n" ) | sort >tree_c.exp &&
+		git ls-tree $commit_c | sort >tree_c.out &&
+		test_cmp tree_c.exp tree_c.out &&
+
+		printf "100644 blob $blob\thello2.c\n" >tree_r.exp &&
+		git ls-tree $commit_r >tree_r.out &&
+		test_cmp tree_r.exp tree_r.out &&
+
+		git branch -D path-space
+	'
+}
+
+test_path_space_success 'quoted spaces'      '" hello world.c "' ' hello world.c '
+test_path_space_success 'no unquoted spaces' 'hello_world.c'     'hello_world.c'
+
+#
+# Test a single commit change with an invalid path. Run it with all occurrences
+# of <path> in the grammar against all error kinds.
+#
+test_path_fail () {
+	what="$1" path="$2" err_grep="$3"
+	test_expect_success "S: $change with $what must fail" '
+		test_must_fail git fast-import <<-EOF 2>err &&
+		blob
+		mark :1
+		data <<BLOB
+		hello world
+		BLOB
+
+		commit refs/heads/S-path-fail
+		mark :2
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit setup
+		COMMIT
+		M 100644 :1 hello.c
+
+		commit refs/heads/S-path-fail
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit with bad path
+		COMMIT
+		from :2
+		'"$prefix$path$suffix"'
+		EOF
+
+		test_grep '"'$err_grep'"' err
+	'
+}
+
+test_path_base_fail () {
+	test_path_fail 'unclosed " in '"$field"          '"hello.c'    "Invalid $field"
+	test_path_fail "invalid escape in quoted $field" '"hello\xff"' "Invalid $field"
+}
+test_path_eol_quoted_fail () {
+	test_path_base_fail
+	test_path_fail "garbage after quoted $field" '"hello.c"x' "Garbage after $field"
+	test_path_fail "space after quoted $field"   '"hello.c" ' "Garbage after $field"
+}
+test_path_eol_fail () {
+	test_path_eol_quoted_fail
+	test_path_fail 'empty unquoted path' '' "Missing $field"
+}
+test_path_space_fail () {
+	test_path_base_fail
+	test_path_fail 'empty unquoted path' '' "Missing $field"
+	test_path_fail "missing space after quoted $field" '"hello.c"x' "Missing space after $field"
+}
+
+change=filemodify       prefix='M 100644 :1 ' field=path   suffix=''         test_path_eol_fail
+change=filedelete       prefix='D '           field=path   suffix=''         test_path_eol_fail
+change=filecopy         prefix='C '           field=source suffix=' world.c' test_path_space_fail
+change=filecopy         prefix='C hello.c '   field=dest   suffix=''         test_path_eol_fail
+change=filerename       prefix='R '           field=source suffix=' world.c' test_path_space_fail
+change=filerename       prefix='R hello.c '   field=dest   suffix=''         test_path_eol_fail
+change='ls (in commit)' prefix='ls :2 '       field=path   suffix=''         test_path_eol_fail
+
+# When 'ls' has no <tree-ish>, the <path> must be quoted.
+change='ls (without tree-ish in commit)' prefix='ls ' field=path suffix='' \
+test_path_eol_quoted_fail &&
+test_path_fail 'empty unquoted path' '' "Invalid dataref"
+
 ###
 ### series T (ls)
 ###
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 2/6] fast-import: directly use strbufs for paths
  2024-03-22  0:03 [PATCH 0/6] fast-import: tighten parsing of paths Thalia Archibald
  2024-03-22  0:03 ` [PATCH 1/6] " Thalia Archibald
@ 2024-03-22  0:03 ` Thalia Archibald
  2024-03-28  8:21   ` Patrick Steinhardt
  2024-03-22  0:03 ` [PATCH 3/6] fast-import: release unfreed strbufs Thalia Archibald
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 84+ messages in thread
From: Thalia Archibald @ 2024-03-22  0:03 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Thalia Archibald

Previously, one case would not write the path to the strbuf: when the
path is unquoted and at the end of the string. It was essentially
copy-on-write. However, with the logic simplification of the previous
commit, this case was eliminated and the strbuf is always populated.

Directly use the strbufs now instead of an alias.

Since this already changes all the lines that use the strbufs, rename
them from `uq` to be more descriptive. That they are unquoted is not
their most important property, so name them after what they carry.

Additionally, `file_change_m` no longer needs to copy the path before
reading inline data.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c | 54 ++++++++++++++++++-------------------------
 1 file changed, 22 insertions(+), 32 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index b2adec8d9a..1b3d6784c1 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2322,7 +2322,7 @@ static void parse_path_space(struct strbuf *sb, const char *p, const char **endp
 
 static void file_change_m(const char *p, struct branch *b)
 {
-	static struct strbuf uq = STRBUF_INIT;
+	static struct strbuf path = STRBUF_INIT;
 	struct object_entry *oe;
 	struct object_id oid;
 	uint16_t mode, inline_data = 0;
@@ -2359,12 +2359,11 @@ static void file_change_m(const char *p, struct branch *b)
 			die("Missing space after SHA1: %s", command_buf.buf);
 	}
 
-	parse_path_eol(&uq, p, "path");
-	p = uq.buf;
+	parse_path_eol(&path, p, "path");
 
 	/* Git does not track empty, non-toplevel directories. */
-	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *p) {
-		tree_content_remove(&b->branch_tree, p, NULL, 0);
+	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *path.buf) {
+		tree_content_remove(&b->branch_tree, path.buf, NULL, 0);
 		return;
 	}
 
@@ -2385,10 +2384,6 @@ static void file_change_m(const char *p, struct branch *b)
 		if (S_ISDIR(mode))
 			die("Directories cannot be specified 'inline': %s",
 				command_buf.buf);
-		if (p != uq.buf) {
-			strbuf_addstr(&uq, p);
-			p = uq.buf;
-		}
 		while (read_next_command() != EOF) {
 			const char *v;
 			if (skip_prefix(command_buf.buf, "cat-blob ", &v))
@@ -2414,49 +2409,45 @@ static void file_change_m(const char *p, struct branch *b)
 				command_buf.buf);
 	}
 
-	if (!*p) {
+	if (!*path.buf) {
 		tree_content_replace(&b->branch_tree, &oid, mode, NULL);
 		return;
 	}
-	tree_content_set(&b->branch_tree, p, &oid, mode, NULL);
+	tree_content_set(&b->branch_tree, path.buf, &oid, mode, NULL);
 }
 
 static void file_change_d(const char *p, struct branch *b)
 {
-	static struct strbuf uq = STRBUF_INIT;
+	static struct strbuf path = STRBUF_INIT;
 
-	parse_path_eol(&uq, p, "path");
-	p = uq.buf;
-	tree_content_remove(&b->branch_tree, p, NULL, 1);
+	parse_path_eol(&path, p, "path");
+	tree_content_remove(&b->branch_tree, path.buf, NULL, 1);
 }
 
 static void file_change_cr(const char *p, struct branch *b, int rename)
 {
-	const char *s, *d;
-	static struct strbuf s_uq = STRBUF_INIT;
-	static struct strbuf d_uq = STRBUF_INIT;
+	static struct strbuf source = STRBUF_INIT;
+	static struct strbuf dest = STRBUF_INIT;
 	struct tree_entry leaf;
 
-	parse_path_space(&s_uq, p, &p, "source");
-	parse_path_eol(&d_uq, p, "dest");
-	s = s_uq.buf;
-	d = d_uq.buf;
+	parse_path_space(&source, p, &p, "source");
+	parse_path_eol(&dest, p, "dest");
 
 	memset(&leaf, 0, sizeof(leaf));
 	if (rename)
-		tree_content_remove(&b->branch_tree, s, &leaf, 1);
+		tree_content_remove(&b->branch_tree, source.buf, &leaf, 1);
 	else
-		tree_content_get(&b->branch_tree, s, &leaf, 1);
+		tree_content_get(&b->branch_tree, source.buf, &leaf, 1);
 	if (!leaf.versions[1].mode)
-		die("Path %s not in branch", s);
-	if (!*d) {	/* C "path/to/subdir" "" */
+		die("Path %s not in branch", source.buf);
+	if (!*dest.buf) {	/* C "path/to/subdir" "" */
 		tree_content_replace(&b->branch_tree,
 			&leaf.versions[1].oid,
 			leaf.versions[1].mode,
 			leaf.tree);
 		return;
 	}
-	tree_content_set(&b->branch_tree, d,
+	tree_content_set(&b->branch_tree, dest.buf,
 		&leaf.versions[1].oid,
 		leaf.versions[1].mode,
 		leaf.tree);
@@ -3186,7 +3177,7 @@ static void parse_ls(const char *p, struct branch *b)
 {
 	struct tree_entry *root = NULL;
 	struct tree_entry leaf = {NULL};
-	static struct strbuf uq = STRBUF_INIT;
+	static struct strbuf path = STRBUF_INIT;
 
 	/* ls SP (<tree-ish> SP)? <path> */
 	if (*p == '"') {
@@ -3201,9 +3192,8 @@ static void parse_ls(const char *p, struct branch *b)
 			root->versions[1].mode = S_IFDIR;
 		load_tree(root);
 	}
-	parse_path_eol(&uq, p, "path");
-	p = uq.buf;
-	tree_content_get(root, p, &leaf, 1);
+	parse_path_eol(&path, p, "path");
+	tree_content_get(root, path.buf, &leaf, 1);
 	/*
 	 * A directory in preparation would have a sha1 of zero
 	 * until it is saved.  Save, for simplicity.
@@ -3211,7 +3201,7 @@ static void parse_ls(const char *p, struct branch *b)
 	if (S_ISDIR(leaf.versions[1].mode))
 		store_tree(&leaf);
 
-	print_ls(leaf.versions[1].mode, leaf.versions[1].oid.hash, p);
+	print_ls(leaf.versions[1].mode, leaf.versions[1].oid.hash, path.buf);
 	if (leaf.tree)
 		release_tree_content_recursive(leaf.tree);
 	if (!b || root != &b->branch_tree)
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 3/6] fast-import: release unfreed strbufs
  2024-03-22  0:03 [PATCH 0/6] fast-import: tighten parsing of paths Thalia Archibald
  2024-03-22  0:03 ` [PATCH 1/6] " Thalia Archibald
  2024-03-22  0:03 ` [PATCH 2/6] fast-import: directly use strbufs for paths Thalia Archibald
@ 2024-03-22  0:03 ` Thalia Archibald
  2024-03-28  8:21   ` Patrick Steinhardt
  2024-03-22  0:03 ` [PATCH 4/6] fast-import: remove dead strbuf Thalia Archibald
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 84+ messages in thread
From: Thalia Archibald @ 2024-03-22  0:03 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Thalia Archibald

These strbufs are owned. Release them at the end of their scopes.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c | 29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 1b3d6784c1..d6f998f363 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2364,6 +2364,7 @@ static void file_change_m(const char *p, struct branch *b)
 	/* Git does not track empty, non-toplevel directories. */
 	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *path.buf) {
 		tree_content_remove(&b->branch_tree, path.buf, NULL, 0);
+		strbuf_release(&path);
 		return;
 	}
 
@@ -2409,11 +2410,11 @@ static void file_change_m(const char *p, struct branch *b)
 				command_buf.buf);
 	}
 
-	if (!*path.buf) {
+	if (*path.buf)
+		tree_content_set(&b->branch_tree, path.buf, &oid, mode, NULL);
+	else
 		tree_content_replace(&b->branch_tree, &oid, mode, NULL);
-		return;
-	}
-	tree_content_set(&b->branch_tree, path.buf, &oid, mode, NULL);
+	strbuf_release(&path);
 }
 
 static void file_change_d(const char *p, struct branch *b)
@@ -2422,6 +2423,7 @@ static void file_change_d(const char *p, struct branch *b)
 
 	parse_path_eol(&path, p, "path");
 	tree_content_remove(&b->branch_tree, path.buf, NULL, 1);
+	strbuf_release(&path);
 }
 
 static void file_change_cr(const char *p, struct branch *b, int rename)
@@ -2440,17 +2442,18 @@ static void file_change_cr(const char *p, struct branch *b, int rename)
 		tree_content_get(&b->branch_tree, source.buf, &leaf, 1);
 	if (!leaf.versions[1].mode)
 		die("Path %s not in branch", source.buf);
-	if (!*dest.buf) {	/* C "path/to/subdir" "" */
+	if (*dest.buf)
+		tree_content_set(&b->branch_tree, dest.buf,
+			&leaf.versions[1].oid,
+			leaf.versions[1].mode,
+			leaf.tree);
+	else	/* C "path/to/subdir" "" */
 		tree_content_replace(&b->branch_tree,
 			&leaf.versions[1].oid,
 			leaf.versions[1].mode,
 			leaf.tree);
-		return;
-	}
-	tree_content_set(&b->branch_tree, dest.buf,
-		&leaf.versions[1].oid,
-		leaf.versions[1].mode,
-		leaf.tree);
+	strbuf_release(&source);
+	strbuf_release(&dest);
 }
 
 static void note_change_n(const char *p, struct branch *b, unsigned char *old_fanout)
@@ -2804,6 +2807,7 @@ static void parse_new_commit(const char *arg)
 	free(author);
 	free(committer);
 	free(encoding);
+	strbuf_release(&msg);
 
 	if (!store_object(OBJ_COMMIT, &new_data, NULL, &b->oid, next_mark))
 		b->pack_id = pack_id;
@@ -2886,6 +2890,7 @@ static void parse_new_tag(const char *arg)
 	strbuf_addch(&new_data, '\n');
 	strbuf_addbuf(&new_data, &msg);
 	free(tagger);
+	strbuf_release(&msg);
 
 	if (store_object(OBJ_TAG, &new_data, NULL, &t->oid, next_mark))
 		t->pack_id = MAX_PACK_ID;
@@ -3171,6 +3176,7 @@ static void print_ls(int mode, const unsigned char *hash, const char *path)
 		strbuf_addch(&line, '\n');
 	}
 	cat_blob_write(line.buf, line.len);
+	strbuf_release(&line);
 }
 
 static void parse_ls(const char *p, struct branch *b)
@@ -3206,6 +3212,7 @@ static void parse_ls(const char *p, struct branch *b)
 		release_tree_content_recursive(leaf.tree);
 	if (!b || root != &b->branch_tree)
 		release_tree_entry(root);
+	strbuf_release(&path);
 }
 
 static void checkpoint(void)
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 4/6] fast-import: remove dead strbuf
  2024-03-22  0:03 [PATCH 0/6] fast-import: tighten parsing of paths Thalia Archibald
                   ` (2 preceding siblings ...)
  2024-03-22  0:03 ` [PATCH 3/6] fast-import: release unfreed strbufs Thalia Archibald
@ 2024-03-22  0:03 ` Thalia Archibald
  2024-03-28  8:21   ` Patrick Steinhardt
  2024-03-22  0:03 ` [PATCH 5/6] fast-import: document C-style escapes for paths Thalia Archibald
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 84+ messages in thread
From: Thalia Archibald @ 2024-03-22  0:03 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Thalia Archibald

The strbuf in `note_change_n` has been unused since the function was
created in a8dd2e7d2b (fast-import: Add support for importing commit
notes, 2009-10-09) and looks to be a fossil from adapting
`note_change_m`. Remove it.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index d6f998f363..ae8494d0ac 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2458,7 +2458,6 @@ static void file_change_cr(const char *p, struct branch *b, int rename)
 
 static void note_change_n(const char *p, struct branch *b, unsigned char *old_fanout)
 {
-	static struct strbuf uq = STRBUF_INIT;
 	struct object_entry *oe;
 	struct branch *s;
 	struct object_id oid, commit_oid;
@@ -2523,10 +2522,6 @@ static void note_change_n(const char *p, struct branch *b, unsigned char *old_fa
 		die("Invalid ref name or SHA1 expression: %s", p);
 
 	if (inline_data) {
-		if (p != uq.buf) {
-			strbuf_addstr(&uq, p);
-			p = uq.buf;
-		}
 		read_next_command();
 		parse_and_store_blob(&last_blob, &oid, 0);
 	} else if (oe) {
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 5/6] fast-import: document C-style escapes for paths
  2024-03-22  0:03 [PATCH 0/6] fast-import: tighten parsing of paths Thalia Archibald
                   ` (3 preceding siblings ...)
  2024-03-22  0:03 ` [PATCH 4/6] fast-import: remove dead strbuf Thalia Archibald
@ 2024-03-22  0:03 ` Thalia Archibald
  2024-03-28  8:21   ` Patrick Steinhardt
  2024-03-22  0:03 ` [PATCH 6/6] fast-import: forbid escaped NUL in paths Thalia Archibald
  2024-04-01  9:02 ` [PATCH v2 0/8] fast-import: tighten parsing of paths Thalia Archibald
  6 siblings, 1 reply; 84+ messages in thread
From: Thalia Archibald @ 2024-03-22  0:03 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Thalia Archibald

Simply saying “C-style” string quoting is imprecise, as only a subset of
C escapes are supported. Document the exact escapes.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 Documentation/git-fast-import.txt | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index 271bd63a10..4aa8ccbefd 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -630,18 +630,23 @@ in octal.  Git only supports the following modes:
 In both formats `<path>` is the complete path of the file to be added
 (if not already existing) or modified (if already existing).
 
-A `<path>` string must use UNIX-style directory separators (forward
-slash `/`), may contain any byte other than `LF`, and must not
-start with double quote (`"`).
+A `<path>` string may contain any byte other than `LF`, and must not
+start with double quote (`"`). It is interpreted as literal bytes
+without escaping.
 
 A path can use C-style string quoting; this is accepted in all cases
 and mandatory if the filename starts with double quote or contains
-`LF`. In C-style quoting, the complete name should be surrounded with
-double quotes, and any `LF`, backslash, or double quote characters
-must be escaped by preceding them with a backslash (e.g.,
-`"path/with\n, \\ and \" in it"`).
+`LF`. In C-style quoting, the complete name is surrounded with
+double quotes (`"`) and certain characters must be escaped by preceding
+them with a backslash: `LF` is written as `\n`, backslash as `\\`, and
+double quote as `\"`. Additionally, some characters may may optionally
+be written with escape sequences: `\a` for bell, `\b` for backspace,
+`\f` for form feed, `\n` for line feed, `\r` for carriage return, `\t`
+for horizontal tab, and `\v` for vertical tab. Any byte can be written
+with 3-digit octal codes (e.g., `\033`).
 
-The value of `<path>` must be in canonical form. That is it must not:
+A `<path>` must use UNIX-style directory separators (forward slash `/`)
+and must be in canonical form. That is it must not:
 
 * contain an empty directory component (e.g. `foo//bar` is invalid),
 * end with a directory separator (e.g. `foo/` is invalid),
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 6/6] fast-import: forbid escaped NUL in paths
  2024-03-22  0:03 [PATCH 0/6] fast-import: tighten parsing of paths Thalia Archibald
                   ` (4 preceding siblings ...)
  2024-03-22  0:03 ` [PATCH 5/6] fast-import: document C-style escapes for paths Thalia Archibald
@ 2024-03-22  0:03 ` Thalia Archibald
  2024-04-01  9:02 ` [PATCH v2 0/8] fast-import: tighten parsing of paths Thalia Archibald
  6 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-03-22  0:03 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Thalia Archibald

NUL cannot appear in paths. Even disregarding filesystem path
limitations, the tree object format delimits with NUL, so such a path
cannot be encoded by Git.

When a quoted path is unquoted, it could possibly contain NUL from
"\000". Forbid it so it isn't truncated.

fast-import still has other issues with NUL, but those will be addressed
later.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 Documentation/git-fast-import.txt | 1 +
 builtin/fast-import.c             | 2 ++
 t/t9300-fast-import.sh            | 1 +
 3 files changed, 4 insertions(+)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index 4aa8ccbefd..411413e8c3 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -657,6 +657,7 @@ and must be in canonical form. That is it must not:
 The root of the tree can be represented by a quoted empty string (`""`)
 as `<path>`.
 
+`<path>` cannot contain NUL, either literally or escaped as `\000`.
 It is recommended that `<path>` always be encoded using UTF-8.
 
 `filedelete`
diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index ae8494d0ac..e36f59084e 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2283,6 +2283,8 @@ static void parse_path(struct strbuf *sb, const char *p, const char **endp, int
 	if (*p == '"') {
 		if (unquote_c_style(sb, p, endp))
 			die("Invalid %s: %s", field, command_buf.buf);
+		if (strlen(sb->buf) != sb->len)
+			die("NUL in %s: %s", field, command_buf.buf);
 	} else {
 		if (allow_spaces)
 			*endp = p + strlen(p);
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index ef04b43f46..994a80e442 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -3285,6 +3285,7 @@ test_path_fail () {
 test_path_base_fail () {
 	test_path_fail 'unclosed " in '"$field"          '"hello.c'    "Invalid $field"
 	test_path_fail "invalid escape in quoted $field" '"hello\xff"' "Invalid $field"
+	test_path_fail "escaped NUL in quoted $field"    '"hello\000"' "NUL in $field"
 }
 test_path_eol_quoted_fail () {
 	test_path_base_fail
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH 1/6] fast-import: tighten parsing of paths
  2024-03-22  0:03 ` [PATCH 1/6] " Thalia Archibald
@ 2024-03-22  0:11   ` Thalia Archibald
  2024-03-28  8:21   ` Patrick Steinhardt
  1 sibling, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-03-22  0:11 UTC (permalink / raw)
  To: git
  Cc: Johannes Schindelin, Junio C Hamano, Jeff King, Elijah Newren,
	Thalia Rose Archibald

Looks like my cover letter was dropped and placing each Cc: on a separate line
only sends to the last one. Let’s try again. Here's my cover letter and the full
relevant list from contrib/contacts is now CC'd:

> fast-import has subtle differences in how it parses file paths between each
> occurrence of <path> in the grammar. Many errors were suppressed or not checked,
> which could lead to silent data corruption. A particularly bad case was when a
> front-end sent escapes that Git doesn't recognize (e.g., hex escapes are not
> supported), it would be treated as literal bytes instead of a quoted string.
> 
> Bring path parsing into line with the documented behavior and improve
> documentation to fill in missing details.
> 
> This patch series is patterned after 06454cb9a3 (fast-import: tighten parsing of
> datarefs, 2012-04-07), which did similar fixes across the grammar, but for
> marks.
> 
> This is my first contribution to Git, so please let me know if there's something
> I've missed. I'm working on a tool for advanced repo transformations (like a
> union of filter-repo and Reposurgeon workflows), so I've been living in
> fast-import code and I have more parsing fixes planned.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 1/6] fast-import: tighten parsing of paths
  2024-03-22  0:03 ` [PATCH 1/6] " Thalia Archibald
  2024-03-22  0:11   ` Thalia Archibald
@ 2024-03-28  8:21   ` Patrick Steinhardt
       [not found]     ` <E01C617F-3720-42C0-83EE-04BB01643C86@archibald.dev>
  1 sibling, 1 reply; 84+ messages in thread
From: Patrick Steinhardt @ 2024-03-28  8:21 UTC (permalink / raw)
  To: Thalia Archibald; +Cc: git, Elijah Newren

[-- Attachment #1: Type: text/plain, Size: 21317 bytes --]

On Fri, Mar 22, 2024 at 12:03:18AM +0000, Thalia Archibald wrote:
> Path parsing in fast-import is inconsistent and many unquoting errors
> are suppressed.
> 
> `<path>` appears in the grammar in these places:
> 
>     filemodify ::= 'M' SP <mode> (<dataref> | 'inline') SP <path> LF
>     filedelete ::= 'D' SP <path> LF
>     filecopy   ::= 'C' SP <path> SP <path> LF
>     filerename ::= 'R' SP <path> SP <path> LF
>     ls         ::= 'ls' SP <dataref> SP <path> LF
>     ls-commit  ::= 'ls' SP <path> LF
> 
> and fast-import.c parses them in five different ways:
> 
> 1. For filemodify and filedelete:
>    If `<path>` is a valid quoted string, unquote it;
>    otherwise, treat it as literal bytes (including any number of SP).
> 2. For filecopy (source) and filerename (source):
>    If `<path>` is a valid quoted string, unquote it;
>    otherwise, treat it as literal bytes until the next SP.
> 3. For filecopy (dest) and filerename (dest):
>    Like 1., but an unquoted empty string is an error.
> 4. For ls:
>    If `<path>` starts with `"`, unquote it and report parse errors;
>    otherwise, treat it as literal bytes (including any number of SP).
> 5. For ls-commit:
>    Unquote `<path>` and report parse errors.
>    (It must start with `"` to disambiguate from ls.)
> 
> In the first three, any errors from trying to unquote a string are
> suppressed, so a quoted string that contains invalid escapes would be
> interpreted as literal bytes. For example, `"\xff"` would fail to
> unquote (because hex escapes are not supported), and it would instead be
> interpreted as the byte sequence `"` `\` `x` `f` `f` `"`, which is
> certainly not intended. Some front-ends erroneously use their language's
> standard quoting routine and could silently introduce escapes that would
> be incorrectly parsed due to this.
> 
> The documentation states that “To use a source path that contains SP the
> path must be quoted.”, so it is expected that some implementations
> depend on spaces being allowed in paths in the final position. Thus we
> have two documented ways to parse paths, so simplify the implementation
> to that.
> 
> Now we have:
> 
> 1. `parse_path_eol` for filemodify, filedelete, filecopy (dest),
>    filerename (dest), ls, and ls-commit:
> 
>    If `<path>` starts with `"`, unquote it and report parse errors;
>    otherwise, treat it as literal bytes (including any number of SP).
>    Garbage after a quoted string or an unquoted empty string are errors.
>    (In ls-commit, it must be quoted to disambiguate from ls.)
> 
> 2. `parse_path_space` for filecopy (source) and filerename (source):
> 
>    If `<path>` starts with `"`, unquote it and report parse errors;
>    otherwise, treat it as literal bytes until the next SP.
>    It must be followed by a SP. An unquoted empty string is an error.
> 
> Signed-off-by: Thalia Archibald <thalia@archibald.dev>
> ---
>  Documentation/git-fast-import.txt |   3 +-
>  builtin/fast-import.c             | 115 ++++++++------
>  t/t9300-fast-import.sh            | 252 +++++++++++++++++++++++++++++-
>  3 files changed, 316 insertions(+), 54 deletions(-)
> 
> diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
> index b2607366b9..271bd63a10 100644
> --- a/Documentation/git-fast-import.txt
> +++ b/Documentation/git-fast-import.txt
> @@ -649,7 +649,8 @@ The value of `<path>` must be in canonical form. That is it must not:
>  * contain the special component `.` or `..` (e.g. `foo/./bar` and
>    `foo/../bar` are invalid).
>  
> -The root of the tree can be represented by an empty string as `<path>`.
> +The root of the tree can be represented by a quoted empty string (`""`)
> +as `<path>`.

So this is part of the "filemodify" section with the following syntax:

    'M' SP <mode> SP <dataref> SP <path> LF

The way I interpret this change is that <path> could previously be empty
(`SP LF`), but now it needs to be quoted (`SP '"' '"' LF). This seems to
be related to cases (1) and (3) of your commit messages, where
"filemodify" could contain an unquoted empty string whereas "filecopy"
and "filerename" would complain about such an unquoted string.

In any case I don't see a strong argument why exactly it should be
forbidden to have an unquoted empty path here, and I do wonder whether
it would break existing writers of the format when we retroactively
tighten this case. Isn't it possible to instead loosen it such that all
three of the above actions know to handle unquoted empty paths?

>  It is recommended that `<path>` always be encoded using UTF-8.
>  
> diff --git a/builtin/fast-import.c b/builtin/fast-import.c
> index 71a195ca22..b2adec8d9a 100644
> --- a/builtin/fast-import.c
> +++ b/builtin/fast-import.c
> @@ -2224,7 +2224,7 @@ static int parse_mapped_oid_hex(const char *hex, struct object_id *oid, const ch
>   *
>   *   idnum ::= ':' bigint;
>   *
> - * Return the first character after the value in *endptr.
> + * Update *endptr to point to the first character after the value.

I think it would make sense to put these improvements for comments into
a separate commit. Otherwise it makes you wonder whether this behaviour
is new now.

>   * Complain if the following character is not what is expected,
>   * either a space or end of the string.
> @@ -2257,8 +2257,8 @@ static uintmax_t parse_mark_ref_eol(const char *p)
>  }
>  
>  /*
> - * Parse the mark reference, demanding a trailing space.  Return a
> - * pointer to the space.
> + * Parse the mark reference, demanding a trailing space. Update *p to
> + * point to the first character after the space.
>   */

Same.

>  static uintmax_t parse_mark_ref_space(const char **p)
>  {
> @@ -2272,10 +2272,57 @@ static uintmax_t parse_mark_ref_space(const char **p)
>  	return mark;
>  }
>  
> +/*
> + * Parse the path string into the strbuf. It may be quoted with escape sequences
> + * or unquoted without escape sequences. When unquoted, it may only contain a
> + * space if `allow_spaces` is nonzero.
> + */
> +static void parse_path(struct strbuf *sb, const char *p, const char **endp, int allow_spaces, const char *field)
> +{
> +	strbuf_reset(sb);

It's not all that customary in our codebase to have the function reset
the `strbuf` for the caller because it does make the function less
flexible. I would either keep the `strbuf_reset()` on the caller side or
at least document this behaviour in the comment.

> +	if (*p == '"') {
> +		if (unquote_c_style(sb, p, endp))
> +			die("Invalid %s: %s", field, command_buf.buf);
> +	} else {
> +		if (allow_spaces)
> +			*endp = p + strlen(p);

I wonder whether `stop_at_unquoted_space` might be more telling. It's
not like we disallow spaces here, it's that we treat them as the
separator to the next field.

> +		else
> +			*endp = strchr(p, ' ');
> +		if (*endp == p)
> +			die("Missing %s: %s", field, command_buf.buf);

Error messages should start with a lower-case letter and be
translateable. But these are simply moved over from the previous code,
so I don't mind much if you want to keep them as-is.

> +		strbuf_add(sb, p, *endp - p);
> +	}
> +}
> +
> +/*
> + * Parse the path string into the strbuf, and complain if this is not the end of
> + * the string. It may contain spaces even when unquoted.
> + */
> +static void parse_path_eol(struct strbuf *sb, const char *p, const char *field)
> +{
> +	const char *end;
> +
> +	parse_path(sb, p, &end, 1, field);
> +	if (*end)
> +		die("Garbage after %s: %s", field, command_buf.buf);
> +}
> +
> +/*
> + * Parse the path string into the strbuf, and ensure it is followed by a space.
> + * It may not contain spaces when unquoted. Update *endp to point to the first
> + * character after the space.
> + */
> +static void parse_path_space(struct strbuf *sb, const char *p, const char **endp, const char *field)
> +{
> +	parse_path(sb, p, endp, 0, field);
> +	if (**endp != ' ')
> +		die("Missing space after %s: %s", field, command_buf.buf);
> +	(*endp)++;
> +}
> +
>  static void file_change_m(const char *p, struct branch *b)
>  {
>  	static struct strbuf uq = STRBUF_INIT;
> -	const char *endp;
>  	struct object_entry *oe;
>  	struct object_id oid;
>  	uint16_t mode, inline_data = 0;
> @@ -2312,12 +2359,8 @@ static void file_change_m(const char *p, struct branch *b)
>  			die("Missing space after SHA1: %s", command_buf.buf);
>  	}
>  
> -	strbuf_reset(&uq);
> -	if (!unquote_c_style(&uq, p, &endp)) {
> -		if (*endp)
> -			die("Garbage after path in: %s", command_buf.buf);
> -		p = uq.buf;
> -	}
> +	parse_path_eol(&uq, p, "path");
> +	p = uq.buf;

This is loosening the condition so that we also accept unquoted paths
now. Okay.

>  
>  	/* Git does not track empty, non-toplevel directories. */
>  	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *p) {
> @@ -2381,48 +2424,23 @@ static void file_change_m(const char *p, struct branch *b)
>  static void file_change_d(const char *p, struct branch *b)
>  {
>  	static struct strbuf uq = STRBUF_INIT;
> -	const char *endp;
>  
> -	strbuf_reset(&uq);
> -	if (!unquote_c_style(&uq, p, &endp)) {
> -		if (*endp)
> -			die("Garbage after path in: %s", command_buf.buf);
> -		p = uq.buf;
> -	}
> +	parse_path_eol(&uq, p, "path");
> +	p = uq.buf;
>  	tree_content_remove(&b->branch_tree, p, NULL, 1);
>  }

Same.

> -static void file_change_cr(const char *s, struct branch *b, int rename)
> +static void file_change_cr(const char *p, struct branch *b, int rename)
>  {
> -	const char *d;
> +	const char *s, *d;
>  	static struct strbuf s_uq = STRBUF_INIT;
>  	static struct strbuf d_uq = STRBUF_INIT;
> -	const char *endp;
>  	struct tree_entry leaf;
>  
> -	strbuf_reset(&s_uq);
> -	if (!unquote_c_style(&s_uq, s, &endp)) {
> -		if (*endp != ' ')
> -			die("Missing space after source: %s", command_buf.buf);
> -	} else {
> -		endp = strchr(s, ' ');
> -		if (!endp)
> -			die("Missing space after source: %s", command_buf.buf);
> -		strbuf_add(&s_uq, s, endp - s);
> -	}
> +	parse_path_space(&s_uq, p, &p, "source");
> +	parse_path_eol(&d_uq, p, "dest");
>  	s = s_uq.buf;
> -
> -	endp++;
> -	if (!*endp)
> -		die("Missing dest: %s", command_buf.buf);
> -
> -	d = endp;
> -	strbuf_reset(&d_uq);
> -	if (!unquote_c_style(&d_uq, d, &endp)) {
> -		if (*endp)
> -			die("Garbage after dest in: %s", command_buf.buf);
> -		d = d_uq.buf;
> -	}
> +	d = d_uq.buf;

Nice simplification. The source path should behave the same, and parsing
of the destination path has been loosened to also allow unquoted paths.

>  	memset(&leaf, 0, sizeof(leaf));
>  	if (rename)
> @@ -3168,6 +3186,7 @@ static void parse_ls(const char *p, struct branch *b)
>  {
>  	struct tree_entry *root = NULL;
>  	struct tree_entry leaf = {NULL};
> +	static struct strbuf uq = STRBUF_INIT;

I know the code had this as a static variable before, as well. But is
this really necessary? Can't we leave it as non-static and then release
the buffer at the end of this function?

>  	/* ls SP (<tree-ish> SP)? <path> */
>  	if (*p == '"') {
> @@ -3182,16 +3201,8 @@ static void parse_ls(const char *p, struct branch *b)
>  			root->versions[1].mode = S_IFDIR;
>  		load_tree(root);
>  	}
> -	if (*p == '"') {
> -		static struct strbuf uq = STRBUF_INIT;
> -		const char *endp;
> -		strbuf_reset(&uq);
> -		if (unquote_c_style(&uq, p, &endp))
> -			die("Invalid path: %s", command_buf.buf);
> -		if (*endp)
> -			die("Garbage after path in: %s", command_buf.buf);
> -		p = uq.buf;
> -	}
> +	parse_path_eol(&uq, p, "path");
> +	p = uq.buf;

And this case should behave the same.

>  	tree_content_get(root, p, &leaf, 1);
>  	/*
>  	 * A directory in preparation would have a sha1 of zero
> diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
> index dbb5042b0b..ef04b43f46 100755
> --- a/t/t9300-fast-import.sh
> +++ b/t/t9300-fast-import.sh
> @@ -2146,6 +2146,7 @@ test_expect_success 'Q: deny note on empty branch' '
>  	EOF
>  	test_must_fail git fast-import <input
>  '
> +
>  ###
>  ### series R (feature and option)
>  ###
> @@ -2794,7 +2795,7 @@ test_expect_success 'R: blob appears only once' '
>  '
>  
>  ###
> -### series S
> +### series S (mark and path parsing)
>  ###
>  #
>  # Make sure missing spaces and EOLs after mark references
> @@ -3064,6 +3065,255 @@ test_expect_success 'S: ls with garbage after sha1 must fail' '
>  	test_grep "space after tree-ish" err
>  '
>  
> +#
> +# Path parsing
> +#
> +# There are two sorts of ways a path can be parsed, depending on whether it is
> +# the last field on the line. Additionally, ls without a <tree-ish> has a
> +# special case. Test every occurrence of <path> in the grammar against every
> +# error case.
> +#
> +
> +#
> +# Valid paths at the end of a line: filemodify, filedelete, filecopy (dest),
> +# filerename (dest), and ls.
> +#
> +# commit :301 from root -- modify hello.c
> +# commit :302 from :301 -- modify $path
> +# commit :303 from :302 -- delete $path
> +# commit :304 from :301 -- copy hello.c $path
> +# commit :305 from :301 -- rename hello.c $path
> +# ls :305 $path
> +#
> +test_path_eol_success () {
> +	test="$1" path="$2" unquoted_path="$3"

Should these variables be local?

> +	test_expect_success "S: paths at EOL with $test must work" '
> +		git fast-import --export-marks=marks.out <<-EOF >out 2>err &&
> +		blob
> +		mark :401
> +		data <<BLOB
> +		hello world
> +		BLOB
> +
> +		blob
> +		mark :402
> +		data <<BLOB
> +		hallo welt
> +		BLOB
> +
> +		commit refs/heads/path-eol
> +		mark :301
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		initial commit
> +		COMMIT
> +		M 100644 :401 hello.c
> +
> +		commit refs/heads/path-eol
> +		mark :302
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit filemodify
> +		COMMIT
> +		from :301
> +		M 100644 :402 '"$path"'
> +
> +		commit refs/heads/path-eol
> +		mark :303
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit filedelete
> +		COMMIT
> +		from :302
> +		D '"$path"'
> +
> +		commit refs/heads/path-eol
> +		mark :304
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit filecopy dest
> +		COMMIT
> +		from :301
> +		C hello.c '"$path"'
> +
> +		commit refs/heads/path-eol
> +		mark :305
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit filerename dest
> +		COMMIT
> +		from :301
> +		R hello.c '"$path"'
> +
> +		ls :305 '"$path"'
> +		EOF
> +
> +		commit_m=$(grep :302 marks.out | cut -d\  -f2) &&
> +		commit_d=$(grep :303 marks.out | cut -d\  -f2) &&
> +		commit_c=$(grep :304 marks.out | cut -d\  -f2) &&
> +		commit_r=$(grep :305 marks.out | cut -d\  -f2) &&
> +		blob1=$(grep :401 marks.out | cut -d\  -f2) &&
> +		blob2=$(grep :402 marks.out | cut -d\  -f2) &&
> +
> +		( printf "100644 blob $blob2\t'"$unquoted_path"'\n" &&
> +		  printf "100644 blob $blob1\thello.c\n" ) | sort >tree_m.exp &&
> +		git ls-tree $commit_m | sort >tree_m.out &&
> +		test_cmp tree_m.exp tree_m.out &&
> +
> +		printf "100644 blob $blob1\thello.c\n" >tree_d.exp &&
> +		git ls-tree $commit_d >tree_d.out &&
> +		test_cmp tree_d.exp tree_d.out &&
> +
> +		( printf "100644 blob $blob1\t'"$unquoted_path"'\n" &&
> +		  printf "100644 blob $blob1\thello.c\n" ) | sort >tree_c.exp &&
> +		git ls-tree $commit_c | sort >tree_c.out &&
> +		test_cmp tree_c.exp tree_c.out &&
> +
> +		printf "100644 blob $blob1\t'"$unquoted_path"'\n" >tree_r.exp &&
> +		git ls-tree $commit_r >tree_r.out &&
> +		test_cmp tree_r.exp tree_r.out &&
> +
> +		test_cmp out tree_r.exp &&
> +
> +		git branch -D path-eol
> +	'
> +}
> +
> +test_path_eol_success 'quoted spaces'   '" hello world.c "' ' hello world.c '
> +test_path_eol_success 'unquoted spaces' ' hello world.c '   ' hello world.c '
> +
> +#
> +# Valid paths before a space: filecopy (source) and filerename (source).
> +#
> +# commit :301 from root -- modify $path
> +# commit :302 from :301 -- copy $path hello2.c
> +# commit :303 from :301 -- rename $path hello2.c
> +#
> +test_path_space_success () {
> +	test="$1" path="$2" unquoted_path="$3"

Same question here, should these be local?

> +	test_expect_success "S: paths before space with $test must work" '
> +		git fast-import --export-marks=marks.out <<-EOF 2>err &&
> +		blob
> +		mark :401
> +		data <<BLOB
> +		hello world
> +		BLOB
> +
> +		commit refs/heads/path-space
> +		mark :301
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		initial commit
> +		COMMIT
> +		M 100644 :401 '"$path"'
> +
> +		commit refs/heads/path-space
> +		mark :302
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit filecopy source
> +		COMMIT
> +		from :301
> +		C '"$path"' hello2.c
> +
> +		commit refs/heads/path-space
> +		mark :303
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit filerename source
> +		COMMIT
> +		from :301
> +		R '"$path"' hello2.c
> +
> +		EOF
> +
> +		commit_c=$(grep :302 marks.out | cut -d\  -f2) &&
> +		commit_r=$(grep :303 marks.out | cut -d\  -f2) &&
> +		blob=$(grep :401 marks.out | cut -d\  -f2) &&
> +
> +		( printf "100644 blob $blob\t'"$unquoted_path"'\n" &&
> +		  printf "100644 blob $blob\thello2.c\n" ) | sort >tree_c.exp &&
> +		git ls-tree $commit_c | sort >tree_c.out &&
> +		test_cmp tree_c.exp tree_c.out &&
> +
> +		printf "100644 blob $blob\thello2.c\n" >tree_r.exp &&
> +		git ls-tree $commit_r >tree_r.out &&
> +		test_cmp tree_r.exp tree_r.out &&
> +
> +		git branch -D path-space
> +	'
> +}
> +
> +test_path_space_success 'quoted spaces'      '" hello world.c "' ' hello world.c '
> +test_path_space_success 'no unquoted spaces' 'hello_world.c'     'hello_world.c'
> +
> +#
> +# Test a single commit change with an invalid path. Run it with all occurrences
> +# of <path> in the grammar against all error kinds.
> +#
> +test_path_fail () {
> +	what="$1" path="$2" err_grep="$3"

Same.

> +	test_expect_success "S: $change with $what must fail" '
> +		test_must_fail git fast-import <<-EOF 2>err &&
> +		blob
> +		mark :1
> +		data <<BLOB
> +		hello world
> +		BLOB
> +
> +		commit refs/heads/S-path-fail
> +		mark :2
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit setup
> +		COMMIT
> +		M 100644 :1 hello.c
> +
> +		commit refs/heads/S-path-fail
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit with bad path
> +		COMMIT
> +		from :2
> +		'"$prefix$path$suffix"'
> +		EOF
> +
> +		test_grep '"'$err_grep'"' err
> +	'
> +}
> +
> +test_path_base_fail () {
> +	test_path_fail 'unclosed " in '"$field"          '"hello.c'    "Invalid $field"
> +	test_path_fail "invalid escape in quoted $field" '"hello\xff"' "Invalid $field"
> +}
> +test_path_eol_quoted_fail () {
> +	test_path_base_fail
> +	test_path_fail "garbage after quoted $field" '"hello.c"x' "Garbage after $field"
> +	test_path_fail "space after quoted $field"   '"hello.c" ' "Garbage after $field"
> +}
> +test_path_eol_fail () {
> +	test_path_eol_quoted_fail
> +	test_path_fail 'empty unquoted path' '' "Missing $field"
> +}
> +test_path_space_fail () {
> +	test_path_base_fail
> +	test_path_fail 'empty unquoted path' '' "Missing $field"
> +	test_path_fail "missing space after quoted $field" '"hello.c"x' "Missing space after $field"
> +}
> +
> +change=filemodify       prefix='M 100644 :1 ' field=path   suffix=''         test_path_eol_fail
> +change=filedelete       prefix='D '           field=path   suffix=''         test_path_eol_fail
> +change=filecopy         prefix='C '           field=source suffix=' world.c' test_path_space_fail
> +change=filecopy         prefix='C hello.c '   field=dest   suffix=''         test_path_eol_fail
> +change=filerename       prefix='R '           field=source suffix=' world.c' test_path_space_fail
> +change=filerename       prefix='R hello.c '   field=dest   suffix=''         test_path_eol_fail
> +change='ls (in commit)' prefix='ls :2 '       field=path   suffix=''         test_path_eol_fail

This is quite confusing because you now mix two different styles, where
some of the functions take arguments while others pass arguments via
variables. I think it would be preferable to pass all arguments as
proper function arguments.

Patrick

> +# When 'ls' has no <tree-ish>, the <path> must be quoted.
> +change='ls (without tree-ish in commit)' prefix='ls ' field=path suffix='' \
> +test_path_eol_quoted_fail &&
> +test_path_fail 'empty unquoted path' '' "Invalid dataref"
> +
>  ###
>  ### series T (ls)
>  ###
> -- 
> 2.44.0
> 
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 2/6] fast-import: directly use strbufs for paths
  2024-03-22  0:03 ` [PATCH 2/6] fast-import: directly use strbufs for paths Thalia Archibald
@ 2024-03-28  8:21   ` Patrick Steinhardt
  0 siblings, 0 replies; 84+ messages in thread
From: Patrick Steinhardt @ 2024-03-28  8:21 UTC (permalink / raw)
  To: Thalia Archibald; +Cc: git, Elijah Newren

[-- Attachment #1: Type: text/plain, Size: 6398 bytes --]

On Fri, Mar 22, 2024 at 12:03:25AM +0000, Thalia Archibald wrote:
> Previously, one case would not write the path to the strbuf: when the
> path is unquoted and at the end of the string. It was essentially
> copy-on-write. However, with the logic simplification of the previous
> commit, this case was eliminated and the strbuf is always populated.
> 
> Directly use the strbufs now instead of an alias.
> 
> Since this already changes all the lines that use the strbufs, rename
> them from `uq` to be more descriptive. That they are unquoted is not
> their most important property, so name them after what they carry.
> 
> Additionally, `file_change_m` no longer needs to copy the path before
> reading inline data.
> 
> Signed-off-by: Thalia Archibald <thalia@archibald.dev>
> ---
>  builtin/fast-import.c | 54 ++++++++++++++++++-------------------------
>  1 file changed, 22 insertions(+), 32 deletions(-)
> 
> diff --git a/builtin/fast-import.c b/builtin/fast-import.c
> index b2adec8d9a..1b3d6784c1 100644
> --- a/builtin/fast-import.c
> +++ b/builtin/fast-import.c
> @@ -2322,7 +2322,7 @@ static void parse_path_space(struct strbuf *sb, const char *p, const char **endp
>  
>  static void file_change_m(const char *p, struct branch *b)
>  {
> -	static struct strbuf uq = STRBUF_INIT;
> +	static struct strbuf path = STRBUF_INIT;

I was about to propose that we should likely also change all of these
static variables to be local instead. I don't think that we use the
variables after the function calls. But now that I see that we do it
like this in all of these helpers I think what's going on is that this
is a memory optimization to avoid reallocating buffers all the time.

Ugly, but so be it. We could refactor the code to pass in scratch
buffers from the outside to remove those static variables. But that
certainly would be a bigger change and thus likely outside of the scope
of this patch series.

Patrick

>  	struct object_entry *oe;
>  	struct object_id oid;
>  	uint16_t mode, inline_data = 0;
> @@ -2359,12 +2359,11 @@ static void file_change_m(const char *p, struct branch *b)
>  			die("Missing space after SHA1: %s", command_buf.buf);
>  	}
>  
> -	parse_path_eol(&uq, p, "path");
> -	p = uq.buf;
> +	parse_path_eol(&path, p, "path");
>  
>  	/* Git does not track empty, non-toplevel directories. */
> -	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *p) {
> -		tree_content_remove(&b->branch_tree, p, NULL, 0);
> +	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *path.buf) {
> +		tree_content_remove(&b->branch_tree, path.buf, NULL, 0);
>  		return;
>  	}
>  
> @@ -2385,10 +2384,6 @@ static void file_change_m(const char *p, struct branch *b)
>  		if (S_ISDIR(mode))
>  			die("Directories cannot be specified 'inline': %s",
>  				command_buf.buf);
> -		if (p != uq.buf) {
> -			strbuf_addstr(&uq, p);
> -			p = uq.buf;
> -		}
>  		while (read_next_command() != EOF) {
>  			const char *v;
>  			if (skip_prefix(command_buf.buf, "cat-blob ", &v))
> @@ -2414,49 +2409,45 @@ static void file_change_m(const char *p, struct branch *b)
>  				command_buf.buf);
>  	}
>  
> -	if (!*p) {
> +	if (!*path.buf) {
>  		tree_content_replace(&b->branch_tree, &oid, mode, NULL);
>  		return;
>  	}
> -	tree_content_set(&b->branch_tree, p, &oid, mode, NULL);
> +	tree_content_set(&b->branch_tree, path.buf, &oid, mode, NULL);
>  }
>  
>  static void file_change_d(const char *p, struct branch *b)
>  {
> -	static struct strbuf uq = STRBUF_INIT;
> +	static struct strbuf path = STRBUF_INIT;
>  
> -	parse_path_eol(&uq, p, "path");
> -	p = uq.buf;
> -	tree_content_remove(&b->branch_tree, p, NULL, 1);
> +	parse_path_eol(&path, p, "path");
> +	tree_content_remove(&b->branch_tree, path.buf, NULL, 1);
>  }
>  
>  static void file_change_cr(const char *p, struct branch *b, int rename)
>  {
> -	const char *s, *d;
> -	static struct strbuf s_uq = STRBUF_INIT;
> -	static struct strbuf d_uq = STRBUF_INIT;
> +	static struct strbuf source = STRBUF_INIT;
> +	static struct strbuf dest = STRBUF_INIT;
>  	struct tree_entry leaf;
>  
> -	parse_path_space(&s_uq, p, &p, "source");
> -	parse_path_eol(&d_uq, p, "dest");
> -	s = s_uq.buf;
> -	d = d_uq.buf;
> +	parse_path_space(&source, p, &p, "source");
> +	parse_path_eol(&dest, p, "dest");
>  
>  	memset(&leaf, 0, sizeof(leaf));
>  	if (rename)
> -		tree_content_remove(&b->branch_tree, s, &leaf, 1);
> +		tree_content_remove(&b->branch_tree, source.buf, &leaf, 1);
>  	else
> -		tree_content_get(&b->branch_tree, s, &leaf, 1);
> +		tree_content_get(&b->branch_tree, source.buf, &leaf, 1);
>  	if (!leaf.versions[1].mode)
> -		die("Path %s not in branch", s);
> -	if (!*d) {	/* C "path/to/subdir" "" */
> +		die("Path %s not in branch", source.buf);
> +	if (!*dest.buf) {	/* C "path/to/subdir" "" */
>  		tree_content_replace(&b->branch_tree,
>  			&leaf.versions[1].oid,
>  			leaf.versions[1].mode,
>  			leaf.tree);
>  		return;
>  	}
> -	tree_content_set(&b->branch_tree, d,
> +	tree_content_set(&b->branch_tree, dest.buf,
>  		&leaf.versions[1].oid,
>  		leaf.versions[1].mode,
>  		leaf.tree);
> @@ -3186,7 +3177,7 @@ static void parse_ls(const char *p, struct branch *b)
>  {
>  	struct tree_entry *root = NULL;
>  	struct tree_entry leaf = {NULL};
> -	static struct strbuf uq = STRBUF_INIT;
> +	static struct strbuf path = STRBUF_INIT;
>  
>  	/* ls SP (<tree-ish> SP)? <path> */
>  	if (*p == '"') {
> @@ -3201,9 +3192,8 @@ static void parse_ls(const char *p, struct branch *b)
>  			root->versions[1].mode = S_IFDIR;
>  		load_tree(root);
>  	}
> -	parse_path_eol(&uq, p, "path");
> -	p = uq.buf;
> -	tree_content_get(root, p, &leaf, 1);
> +	parse_path_eol(&path, p, "path");
> +	tree_content_get(root, path.buf, &leaf, 1);
>  	/*
>  	 * A directory in preparation would have a sha1 of zero
>  	 * until it is saved.  Save, for simplicity.
> @@ -3211,7 +3201,7 @@ static void parse_ls(const char *p, struct branch *b)
>  	if (S_ISDIR(leaf.versions[1].mode))
>  		store_tree(&leaf);
>  
> -	print_ls(leaf.versions[1].mode, leaf.versions[1].oid.hash, p);
> +	print_ls(leaf.versions[1].mode, leaf.versions[1].oid.hash, path.buf);
>  	if (leaf.tree)
>  		release_tree_content_recursive(leaf.tree);
>  	if (!b || root != &b->branch_tree)
> -- 
> 2.44.0
> 
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 3/6] fast-import: release unfreed strbufs
  2024-03-22  0:03 ` [PATCH 3/6] fast-import: release unfreed strbufs Thalia Archibald
@ 2024-03-28  8:21   ` Patrick Steinhardt
  2024-04-01  9:06     ` Thalia Archibald
  0 siblings, 1 reply; 84+ messages in thread
From: Patrick Steinhardt @ 2024-03-28  8:21 UTC (permalink / raw)
  To: Thalia Archibald; +Cc: git, Elijah Newren

[-- Attachment #1: Type: text/plain, Size: 4225 bytes --]

On Fri, Mar 22, 2024 at 12:03:33AM +0000, Thalia Archibald wrote:
> These strbufs are owned. Release them at the end of their scopes.
> 
> Signed-off-by: Thalia Archibald <thalia@archibald.dev>
> ---
>  builtin/fast-import.c | 29 ++++++++++++++++++-----------
>  1 file changed, 18 insertions(+), 11 deletions(-)
> 
> diff --git a/builtin/fast-import.c b/builtin/fast-import.c
> index 1b3d6784c1..d6f998f363 100644
> --- a/builtin/fast-import.c
> +++ b/builtin/fast-import.c
> @@ -2364,6 +2364,7 @@ static void file_change_m(const char *p, struct branch *b)
>  	/* Git does not track empty, non-toplevel directories. */
>  	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *path.buf) {
>  		tree_content_remove(&b->branch_tree, path.buf, NULL, 0);
> +		strbuf_release(&path);
>  		return;
>  	}

Oh, now you get to my comment in the preceding patch. With this patch
we're now in a somewhat weird in-between state where the buffers are
still static, but we release their memory after each call. So we kind of
get the worst of both worlds: static variables without being able to
reuse the buffer's memory.

If we were to change this then we should definitely mark the buffers as
non-static. If so, it would be great to demonstrate that this does not
significantly impact performance.

The same is true for all the other instances.

Patrick

> @@ -2409,11 +2410,11 @@ static void file_change_m(const char *p, struct branch *b)
>  				command_buf.buf);
>  	}
>  
> -	if (!*path.buf) {
> +	if (*path.buf)
> +		tree_content_set(&b->branch_tree, path.buf, &oid, mode, NULL);
> +	else
>  		tree_content_replace(&b->branch_tree, &oid, mode, NULL);
> -		return;
> -	}
> -	tree_content_set(&b->branch_tree, path.buf, &oid, mode, NULL);
> +	strbuf_release(&path);
>  }
>  
>  static void file_change_d(const char *p, struct branch *b)
> @@ -2422,6 +2423,7 @@ static void file_change_d(const char *p, struct branch *b)
>  
>  	parse_path_eol(&path, p, "path");
>  	tree_content_remove(&b->branch_tree, path.buf, NULL, 1);
> +	strbuf_release(&path);
>  }
>  
>  static void file_change_cr(const char *p, struct branch *b, int rename)
> @@ -2440,17 +2442,18 @@ static void file_change_cr(const char *p, struct branch *b, int rename)
>  		tree_content_get(&b->branch_tree, source.buf, &leaf, 1);
>  	if (!leaf.versions[1].mode)
>  		die("Path %s not in branch", source.buf);
> -	if (!*dest.buf) {	/* C "path/to/subdir" "" */
> +	if (*dest.buf)
> +		tree_content_set(&b->branch_tree, dest.buf,
> +			&leaf.versions[1].oid,
> +			leaf.versions[1].mode,
> +			leaf.tree);
> +	else	/* C "path/to/subdir" "" */
>  		tree_content_replace(&b->branch_tree,
>  			&leaf.versions[1].oid,
>  			leaf.versions[1].mode,
>  			leaf.tree);
> -		return;
> -	}
> -	tree_content_set(&b->branch_tree, dest.buf,
> -		&leaf.versions[1].oid,
> -		leaf.versions[1].mode,
> -		leaf.tree);
> +	strbuf_release(&source);
> +	strbuf_release(&dest);
>  }
>  
>  static void note_change_n(const char *p, struct branch *b, unsigned char *old_fanout)
> @@ -2804,6 +2807,7 @@ static void parse_new_commit(const char *arg)
>  	free(author);
>  	free(committer);
>  	free(encoding);
> +	strbuf_release(&msg);
>  
>  	if (!store_object(OBJ_COMMIT, &new_data, NULL, &b->oid, next_mark))
>  		b->pack_id = pack_id;
> @@ -2886,6 +2890,7 @@ static void parse_new_tag(const char *arg)
>  	strbuf_addch(&new_data, '\n');
>  	strbuf_addbuf(&new_data, &msg);
>  	free(tagger);
> +	strbuf_release(&msg);
>  
>  	if (store_object(OBJ_TAG, &new_data, NULL, &t->oid, next_mark))
>  		t->pack_id = MAX_PACK_ID;
> @@ -3171,6 +3176,7 @@ static void print_ls(int mode, const unsigned char *hash, const char *path)
>  		strbuf_addch(&line, '\n');
>  	}
>  	cat_blob_write(line.buf, line.len);
> +	strbuf_release(&line);
>  }
>  
>  static void parse_ls(const char *p, struct branch *b)
> @@ -3206,6 +3212,7 @@ static void parse_ls(const char *p, struct branch *b)
>  		release_tree_content_recursive(leaf.tree);
>  	if (!b || root != &b->branch_tree)
>  		release_tree_entry(root);
> +	strbuf_release(&path);
>  }
>  
>  static void checkpoint(void)
> -- 
> 2.44.0
> 
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 4/6] fast-import: remove dead strbuf
  2024-03-22  0:03 ` [PATCH 4/6] fast-import: remove dead strbuf Thalia Archibald
@ 2024-03-28  8:21   ` Patrick Steinhardt
  0 siblings, 0 replies; 84+ messages in thread
From: Patrick Steinhardt @ 2024-03-28  8:21 UTC (permalink / raw)
  To: Thalia Archibald; +Cc: git, Elijah Newren

[-- Attachment #1: Type: text/plain, Size: 1582 bytes --]

On Fri, Mar 22, 2024 at 12:03:40AM +0000, Thalia Archibald wrote:
> The strbuf in `note_change_n` has been unused since the function was
> created in a8dd2e7d2b (fast-import: Add support for importing commit
> notes, 2009-10-09) and looks to be a fossil from adapting
> `note_change_m`. Remove it.

Just from inspecting the diff it's not clear that it is actually unused
given that we assign `p = uq.buf`. The message here should probably
mention the important detail that `p` is not actually used after the
assignment.

Patrick

> Signed-off-by: Thalia Archibald <thalia@archibald.dev>
> ---
>  builtin/fast-import.c | 5 -----
>  1 file changed, 5 deletions(-)
> 
> diff --git a/builtin/fast-import.c b/builtin/fast-import.c
> index d6f998f363..ae8494d0ac 100644
> --- a/builtin/fast-import.c
> +++ b/builtin/fast-import.c
> @@ -2458,7 +2458,6 @@ static void file_change_cr(const char *p, struct branch *b, int rename)
>  
>  static void note_change_n(const char *p, struct branch *b, unsigned char *old_fanout)
>  {
> -	static struct strbuf uq = STRBUF_INIT;
>  	struct object_entry *oe;
>  	struct branch *s;
>  	struct object_id oid, commit_oid;
> @@ -2523,10 +2522,6 @@ static void note_change_n(const char *p, struct branch *b, unsigned char *old_fa
>  		die("Invalid ref name or SHA1 expression: %s", p);
>  
>  	if (inline_data) {
> -		if (p != uq.buf) {
> -			strbuf_addstr(&uq, p);
> -			p = uq.buf;
> -		}
>  		read_next_command();
>  		parse_and_store_blob(&last_blob, &oid, 0);
>  	} else if (oe) {
> -- 
> 2.44.0
> 
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 5/6] fast-import: document C-style escapes for paths
  2024-03-22  0:03 ` [PATCH 5/6] fast-import: document C-style escapes for paths Thalia Archibald
@ 2024-03-28  8:21   ` Patrick Steinhardt
  2024-04-01  9:06     ` Thalia Archibald
  0 siblings, 1 reply; 84+ messages in thread
From: Patrick Steinhardt @ 2024-03-28  8:21 UTC (permalink / raw)
  To: Thalia Archibald; +Cc: git, Elijah Newren

[-- Attachment #1: Type: text/plain, Size: 2571 bytes --]

On Fri, Mar 22, 2024 at 12:03:47AM +0000, Thalia Archibald wrote:
> Simply saying “C-style” string quoting is imprecise, as only a subset of
> C escapes are supported. Document the exact escapes.
> 
> Signed-off-by: Thalia Archibald <thalia@archibald.dev>
> ---
>  Documentation/git-fast-import.txt | 21 +++++++++++++--------
>  1 file changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
> index 271bd63a10..4aa8ccbefd 100644
> --- a/Documentation/git-fast-import.txt
> +++ b/Documentation/git-fast-import.txt
> @@ -630,18 +630,23 @@ in octal.  Git only supports the following modes:
>  In both formats `<path>` is the complete path of the file to be added
>  (if not already existing) or modified (if already existing).
>  
> -A `<path>` string must use UNIX-style directory separators (forward
> -slash `/`), may contain any byte other than `LF`, and must not
> -start with double quote (`"`).
> +A `<path>` string may contain any byte other than `LF`, and must not
> +start with double quote (`"`). It is interpreted as literal bytes
> +without escaping.

Paths also mustn't start with a space in many cases, right?

Patrick

>  A path can use C-style string quoting; this is accepted in all cases
>  and mandatory if the filename starts with double quote or contains
> -`LF`. In C-style quoting, the complete name should be surrounded with
> -double quotes, and any `LF`, backslash, or double quote characters
> -must be escaped by preceding them with a backslash (e.g.,
> -`"path/with\n, \\ and \" in it"`).
> +`LF`. In C-style quoting, the complete name is surrounded with
> +double quotes (`"`) and certain characters must be escaped by preceding
> +them with a backslash: `LF` is written as `\n`, backslash as `\\`, and
> +double quote as `\"`. Additionally, some characters may may optionally
> +be written with escape sequences: `\a` for bell, `\b` for backspace,
> +`\f` for form feed, `\n` for line feed, `\r` for carriage return, `\t`
> +for horizontal tab, and `\v` for vertical tab. Any byte can be written
> +with 3-digit octal codes (e.g., `\033`).
>  
> -The value of `<path>` must be in canonical form. That is it must not:
> +A `<path>` must use UNIX-style directory separators (forward slash `/`)
> +and must be in canonical form. That is it must not:
>  
>  * contain an empty directory component (e.g. `foo//bar` is invalid),
>  * end with a directory separator (e.g. `foo/` is invalid),
> -- 
> 2.44.0
> 
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v2 0/8] fast-import: tighten parsing of paths
  2024-03-22  0:03 [PATCH 0/6] fast-import: tighten parsing of paths Thalia Archibald
                   ` (5 preceding siblings ...)
  2024-03-22  0:03 ` [PATCH 6/6] fast-import: forbid escaped NUL in paths Thalia Archibald
@ 2024-04-01  9:02 ` Thalia Archibald
  2024-04-01  9:02   ` [PATCH v2 1/8] fast-import: tighten path unquoting Thalia Archibald
                     ` (9 more replies)
  6 siblings, 10 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-01  9:02 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, Elijah Newren, Thalia Archibald

> fast-import has subtle differences in how it parses file paths between each
> occurrence of <path> in the grammar. Many errors are suppressed or not checked,
> which could lead to silent data corruption. A particularly bad case is when a
> front-end sent escapes that Git doesn't recognize (e.g., hex escapes are not
> supported), it would be treated as literal bytes instead of a quoted string.
>
> Bring path parsing into line with the documented behavior and improve
> documentation to fill in missing details.

Thanks for the review, Patrick. I've made several changes, which I think address
your feedback. What's the protocol for adding `Reviewed-by`? Since I don't know
whether I, you, or Junio add it, I've refrained from attaching your name to
these patches.

Changes since v1:
* In fast-import:
  * Move `strbuf_release` outside of `parse_path_space` and `parse_path_eol`.
  * Retain allocations for static `strbuf`s.
  * Rename `allow_spaces` parameter of `parse_path` to `include_spaces`.
  * Extract change to neighboring comments as patch 8.
* In tests:
  * Test `` for the root path additionally in all tests using `""`.
  * Pass all arguments by positional variables.
  * Use `local`.
  * Use `test_when_finished` instead of manual cleanup.
* In documentation:
  * Better document conditions under which a path is considered quoted or
    unquoted.
* Reword commit messages.

Thalia


Thalia Archibald (8):
  fast-import: tighten path unquoting
  fast-import: directly use strbufs for paths
  fast-import: allow unquoted empty path for root
  fast-import: remove dead strbuf
  fast-import: improve documentation for unquoted paths
  fast-import: document C-style escapes for paths
  fast-import: forbid escaped NUL in paths
  fast-import: make comments more precise

 Documentation/git-fast-import.txt |  30 +-
 builtin/fast-import.c             | 156 ++++----
 t/t9300-fast-import.sh            | 617 +++++++++++++++++++++---------
 3 files changed, 541 insertions(+), 262 deletions(-)

Range-diff against v1:
1:  8d9e0b25cb ! 1:  e790bdf714 fast-import: tighten parsing of paths
    @@ Metadata
     Author: Thalia Archibald <thalia@archibald.dev>

      ## Commit message ##
    -    fast-import: tighten parsing of paths
    +    fast-import: tighten path unquoting

         Path parsing in fast-import is inconsistent and many unquoting errors
    -    are suppressed.
    +    are suppressed or not checked.

    -    `<path>` appears in the grammar in these places:
    +    <path> appears in the grammar in these places:

             filemodify ::= 'M' SP <mode> (<dataref> | 'inline') SP <path> LF
             filedelete ::= 'D' SP <path> LF
    @@ Commit message
         and fast-import.c parses them in five different ways:

         1. For filemodify and filedelete:
    -       If `<path>` is a valid quoted string, unquote it;
    -       otherwise, treat it as literal bytes (including any number of SP).
    +       Try to unquote <path>. If it unquotes without errors, use the
    +       unquoted version; otherwise, treat it as literal bytes to the end of
    +       the line (including any number of SP).
         2. For filecopy (source) and filerename (source):
    -       If `<path>` is a valid quoted string, unquote it;
    -       otherwise, treat it as literal bytes until the next SP.
    +       Try to unquote <path>. If it unquotes without errors, use the
    +       unquoted version; otherwise, treat it as literal bytes up to, but not
    +       including, the next SP.
         3. For filecopy (dest) and filerename (dest):
    -       Like 1., but an unquoted empty string is an error.
    +       Like 1., but an unquoted empty string is forbidden.
         4. For ls:
    -       If `<path>` starts with `"`, unquote it and report parse errors;
    -       otherwise, treat it as literal bytes (including any number of SP).
    +       If <path> starts with `"`, unquote it and report parse errors;
    +       otherwise, treat it as literal bytes to the end of the line
    +       (including any number of SP).
         5. For ls-commit:
    -       Unquote `<path>` and report parse errors.
    +       Unquote <path> and report parse errors.
            (It must start with `"` to disambiguate from ls.)

         In the first three, any errors from trying to unquote a string are
         suppressed, so a quoted string that contains invalid escapes would be
         interpreted as literal bytes. For example, `"\xff"` would fail to
         unquote (because hex escapes are not supported), and it would instead be
    -    interpreted as the byte sequence `"` `\` `x` `f` `f` `"`, which is
    +    interpreted as the byte sequence '"', '\\', 'x', 'f', 'f', '"', which is
         certainly not intended. Some front-ends erroneously use their language's
    -    standard quoting routine and could silently introduce escapes that would
    -    be incorrectly parsed due to this.
    +    standard quoting routine instead of matching Git's, which could silently
    +    introduce escapes that would be incorrectly parsed due to this and lead
    +    to data corruption.

    -    The documentation states that “To use a source path that contains SP the
    -    path must be quoted.”, so it is expected that some implementations
    -    depend on spaces being allowed in paths in the final position. Thus we
    -    have two documented ways to parse paths, so simplify the implementation
    -    to that.
    +    The documentation states “To use a source path that contains SP the path
    +    must be quoted.”, so it is expected that some implementations depend on
    +    spaces being allowed in paths in the final position. Thus we have two
    +    documented ways to parse paths, so simplify the implementation to that.

         Now we have:

         1. `parse_path_eol` for filemodify, filedelete, filecopy (dest),
            filerename (dest), ls, and ls-commit:

    -       If `<path>` starts with `"`, unquote it and report parse errors;
    -       otherwise, treat it as literal bytes (including any number of SP).
    -       Garbage after a quoted string or an unquoted empty string are errors.
    -       (In ls-commit, it must be quoted to disambiguate from ls.)
    +       If <path> starts with `"`, unquote it and report parse errors;
    +       otherwise, treat it as literal bytes to the end of the line
    +       (including any number of SP).

         2. `parse_path_space` for filecopy (source) and filerename (source):

    -       If `<path>` starts with `"`, unquote it and report parse errors;
    -       otherwise, treat it as literal bytes until the next SP.
    -       It must be followed by a SP. An unquoted empty string is an error.
    +       If <path> starts with `"`, unquote it and report parse errors;
    +       otherwise, treat it as literal bytes up to, but not including, the
    +       next SP. It must be followed by SP.
    +
    +    There remain two special cases: The dest <path> in filecopy and rename
    +    cannot be an unquoted empty string (this will be addressed subsequently)
    +    and <path> in ls-commit must be quoted to disambiguate it from ls.

         Signed-off-by: Thalia Archibald <thalia@archibald.dev>

    - ## Documentation/git-fast-import.txt ##
    -@@ Documentation/git-fast-import.txt: The value of `<path>` must be in canonical form. That is it must not:
    - * contain the special component `.` or `..` (e.g. `foo/./bar` and
    -   `foo/../bar` are invalid).
    -
    --The root of the tree can be represented by an empty string as `<path>`.
    -+The root of the tree can be represented by a quoted empty string (`""`)
    -+as `<path>`.
    -
    - It is recommended that `<path>` always be encoded using UTF-8.
    -
    -
      ## builtin/fast-import.c ##
    -@@ builtin/fast-import.c: static int parse_mapped_oid_hex(const char *hex, struct object_id *oid, const ch
    -  *
    -  *   idnum ::= ':' bigint;
    -  *
    -- * Return the first character after the value in *endptr.
    -+ * Update *endptr to point to the first character after the value.
    -  *
    -  * Complain if the following character is not what is expected,
    -  * either a space or end of the string.
    -@@ builtin/fast-import.c: static uintmax_t parse_mark_ref_eol(const char *p)
    - }
    -
    - /*
    -- * Parse the mark reference, demanding a trailing space.  Return a
    -- * pointer to the space.
    -+ * Parse the mark reference, demanding a trailing space. Update *p to
    -+ * point to the first character after the space.
    -  */
    - static uintmax_t parse_mark_ref_space(const char **p)
    - {
     @@ builtin/fast-import.c: static uintmax_t parse_mark_ref_space(const char **p)
      	return mark;
      }
    @@ builtin/fast-import.c: static uintmax_t parse_mark_ref_space(const char **p)
     +/*
     + * Parse the path string into the strbuf. It may be quoted with escape sequences
     + * or unquoted without escape sequences. When unquoted, it may only contain a
    -+ * space if `allow_spaces` is nonzero.
    ++ * space if `include_spaces` is nonzero.
     + */
    -+static void parse_path(struct strbuf *sb, const char *p, const char **endp, int allow_spaces, const char *field)
    ++static void parse_path(struct strbuf *sb, const char *p, const char **endp, int include_spaces, const char *field)
     +{
    -+	strbuf_reset(sb);
     +	if (*p == '"') {
     +		if (unquote_c_style(sb, p, endp))
     +			die("Invalid %s: %s", field, command_buf.buf);
     +	} else {
    -+		if (allow_spaces)
    ++		if (include_spaces)
     +			*endp = p + strlen(p);
     +		else
     +			*endp = strchr(p, ' ');
    -+		if (*endp == p)
    -+			die("Missing %s: %s", field, command_buf.buf);
     +		strbuf_add(sb, p, *endp - p);
     +	}
     +}
    @@ builtin/fast-import.c: static uintmax_t parse_mark_ref_space(const char **p)
      	struct object_id oid;
      	uint16_t mode, inline_data = 0;
     @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b)
    - 			die("Missing space after SHA1: %s", command_buf.buf);
      	}

    --	strbuf_reset(&uq);
    + 	strbuf_reset(&uq);
     -	if (!unquote_c_style(&uq, p, &endp)) {
     -		if (*endp)
     -			die("Garbage after path in: %s", command_buf.buf);
    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
      	static struct strbuf uq = STRBUF_INIT;
     -	const char *endp;

    --	strbuf_reset(&uq);
    + 	strbuf_reset(&uq);
     -	if (!unquote_c_style(&uq, p, &endp)) {
     -		if (*endp)
     -			die("Garbage after path in: %s", command_buf.buf);
    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
     -	const char *endp;
      	struct tree_entry leaf;

    --	strbuf_reset(&s_uq);
    + 	strbuf_reset(&s_uq);
     -	if (!unquote_c_style(&s_uq, s, &endp)) {
     -		if (*endp != ' ')
     -			die("Missing space after source: %s", command_buf.buf);
    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
     -		strbuf_add(&s_uq, s, endp - s);
     -	}
     +	parse_path_space(&s_uq, p, &p, "source");
    -+	parse_path_eol(&d_uq, p, "dest");
      	s = s_uq.buf;
    --
    +
     -	endp++;
     -	if (!*endp)
    --		die("Missing dest: %s", command_buf.buf);
    ++	if (!p)
    + 		die("Missing dest: %s", command_buf.buf);
     -
     -	d = endp;
    --	strbuf_reset(&d_uq);
    + 	strbuf_reset(&d_uq);
     -	if (!unquote_c_style(&d_uq, d, &endp)) {
     -		if (*endp)
     -			die("Garbage after dest in: %s", command_buf.buf);
     -		d = d_uq.buf;
     -	}
    ++	parse_path_eol(&d_uq, p, "dest");
     +	d = d_uq.buf;

      	memset(&leaf, 0, sizeof(leaf));
      	if (rename)
    -@@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b)
    +@@ builtin/fast-import.c: static void print_ls(int mode, const unsigned char *hash, const char *path)
    +
    + static void parse_ls(const char *p, struct branch *b)
      {
    ++	static struct strbuf uq = STRBUF_INIT;
      	struct tree_entry *root = NULL;
      	struct tree_entry leaf = {NULL};
    -+	static struct strbuf uq = STRBUF_INIT;

    - 	/* ls SP (<tree-ish> SP)? <path> */
    - 	if (*p == '"') {
     @@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b)
      			root->versions[1].mode = S_IFDIR;
      		load_tree(root);
    @@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b)
     -			die("Garbage after path in: %s", command_buf.buf);
     -		p = uq.buf;
     -	}
    ++	strbuf_reset(&uq);
     +	parse_path_eol(&uq, p, "path");
     +	p = uq.buf;
      	tree_content_get(root, p, &leaf, 1);
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +# Path parsing
     +#
     +# There are two sorts of ways a path can be parsed, depending on whether it is
    -+# the last field on the line. Additionally, ls without a <tree-ish> has a
    -+# special case. Test every occurrence of <path> in the grammar against every
    -+# error case.
    ++# the last field on the line. Additionally, ls without a <dataref> has a special
    ++# case. Test every occurrence of <path> in the grammar against every error case.
     +#
     +
     +#
     +# Valid paths at the end of a line: filemodify, filedelete, filecopy (dest),
     +# filerename (dest), and ls.
     +#
    -+# commit :301 from root -- modify hello.c
    ++# commit :301 from root -- modify hello.c (for setup)
     +# commit :302 from :301 -- modify $path
     +# commit :303 from :302 -- delete $path
     +# commit :304 from :301 -- copy hello.c $path
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +# ls :305 $path
     +#
     +test_path_eol_success () {
    -+	test="$1" path="$2" unquoted_path="$3"
    ++	local test="$1" path="$2" unquoted_path="$3"
     +	test_expect_success "S: paths at EOL with $test must work" '
    ++		test_when_finished "git branch -D S-path-eol" &&
    ++
     +		git fast-import --export-marks=marks.out <<-EOF >out 2>err &&
     +		blob
     +		mark :401
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		hallo welt
     +		BLOB
     +
    -+		commit refs/heads/path-eol
    ++		commit refs/heads/S-path-eol
     +		mark :301
     +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
     +		data <<COMMIT
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		COMMIT
     +		M 100644 :401 hello.c
     +
    -+		commit refs/heads/path-eol
    ++		commit refs/heads/S-path-eol
     +		mark :302
     +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
     +		data <<COMMIT
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		from :301
     +		M 100644 :402 '"$path"'
     +
    -+		commit refs/heads/path-eol
    ++		commit refs/heads/S-path-eol
     +		mark :303
     +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
     +		data <<COMMIT
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		from :302
     +		D '"$path"'
     +
    -+		commit refs/heads/path-eol
    ++		commit refs/heads/S-path-eol
     +		mark :304
     +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
     +		data <<COMMIT
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		from :301
     +		C hello.c '"$path"'
     +
    -+		commit refs/heads/path-eol
    ++		commit refs/heads/S-path-eol
     +		mark :305
     +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
     +		data <<COMMIT
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		git ls-tree $commit_r >tree_r.out &&
     +		test_cmp tree_r.exp tree_r.out &&
     +
    -+		test_cmp out tree_r.exp &&
    -+
    -+		git branch -D path-eol
    ++		test_cmp out tree_r.exp
     +	'
     +}
     +
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +#
     +# Valid paths before a space: filecopy (source) and filerename (source).
     +#
    -+# commit :301 from root -- modify $path
    ++# commit :301 from root -- modify $path (for setup)
     +# commit :302 from :301 -- copy $path hello2.c
     +# commit :303 from :301 -- rename $path hello2.c
     +#
     +test_path_space_success () {
    -+	test="$1" path="$2" unquoted_path="$3"
    ++	local test="$1" path="$2" unquoted_path="$3"
     +	test_expect_success "S: paths before space with $test must work" '
    ++		test_when_finished "git branch -D S-path-space" &&
    ++
     +		git fast-import --export-marks=marks.out <<-EOF 2>err &&
     +		blob
     +		mark :401
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		hello world
     +		BLOB
     +
    -+		commit refs/heads/path-space
    ++		commit refs/heads/S-path-space
     +		mark :301
     +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
     +		data <<COMMIT
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		COMMIT
     +		M 100644 :401 '"$path"'
     +
    -+		commit refs/heads/path-space
    ++		commit refs/heads/S-path-space
     +		mark :302
     +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
     +		data <<COMMIT
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		from :301
     +		C '"$path"' hello2.c
     +
    -+		commit refs/heads/path-space
    ++		commit refs/heads/S-path-space
     +		mark :303
     +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
     +		data <<COMMIT
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +
     +		printf "100644 blob $blob\thello2.c\n" >tree_r.exp &&
     +		git ls-tree $commit_r >tree_r.out &&
    -+		test_cmp tree_r.exp tree_r.out &&
    -+
    -+		git branch -D path-space
    ++		test_cmp tree_r.exp tree_r.out
     +	'
     +}
     +
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +# of <path> in the grammar against all error kinds.
     +#
     +test_path_fail () {
    -+	what="$1" path="$2" err_grep="$3"
    ++	local change="$1" what="$2" prefix="$3" path="$4" suffix="$5" err_grep="$6"
     +	test_expect_success "S: $change with $what must fail" '
     +		test_must_fail git fast-import <<-EOF 2>err &&
     +		blob
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +}
     +
     +test_path_base_fail () {
    -+	test_path_fail 'unclosed " in '"$field"          '"hello.c'    "Invalid $field"
    -+	test_path_fail "invalid escape in quoted $field" '"hello\xff"' "Invalid $field"
    ++	local change="$1" prefix="$2" field="$3" suffix="$4"
    ++	test_path_fail "$change" 'unclosed " in '"$field"          "$prefix" '"hello.c'    "$suffix" "Invalid $field"
    ++	test_path_fail "$change" "invalid escape in quoted $field" "$prefix" '"hello\xff"' "$suffix" "Invalid $field"
     +}
     +test_path_eol_quoted_fail () {
    -+	test_path_base_fail
    -+	test_path_fail "garbage after quoted $field" '"hello.c"x' "Garbage after $field"
    -+	test_path_fail "space after quoted $field"   '"hello.c" ' "Garbage after $field"
    ++	local change="$1" prefix="$2" field="$3" suffix="$4"
    ++	test_path_base_fail "$change" "$prefix" "$field" "$suffix"
    ++	test_path_fail "$change" "garbage after quoted $field" "$prefix" '"hello.c"x' "$suffix" "Garbage after $field"
    ++	test_path_fail "$change" "space after quoted $field"   "$prefix" '"hello.c" ' "$suffix" "Garbage after $field"
     +}
     +test_path_eol_fail () {
    -+	test_path_eol_quoted_fail
    -+	test_path_fail 'empty unquoted path' '' "Missing $field"
    ++	local change="$1" prefix="$2" field="$3" suffix="$4"
    ++	test_path_eol_quoted_fail "$change" "$prefix" "$field" "$suffix"
     +}
     +test_path_space_fail () {
    -+	test_path_base_fail
    -+	test_path_fail 'empty unquoted path' '' "Missing $field"
    -+	test_path_fail "missing space after quoted $field" '"hello.c"x' "Missing space after $field"
    ++	local change="$1" prefix="$2" field="$3" suffix="$4"
    ++	test_path_base_fail "$change" "$prefix" "$field" "$suffix"
    ++	test_path_fail "$change" "missing space after quoted $field" "$prefix" '"hello.c"x' "$suffix" "Missing space after $field"
     +}
     +
    -+change=filemodify       prefix='M 100644 :1 ' field=path   suffix=''         test_path_eol_fail
    -+change=filedelete       prefix='D '           field=path   suffix=''         test_path_eol_fail
    -+change=filecopy         prefix='C '           field=source suffix=' world.c' test_path_space_fail
    -+change=filecopy         prefix='C hello.c '   field=dest   suffix=''         test_path_eol_fail
    -+change=filerename       prefix='R '           field=source suffix=' world.c' test_path_space_fail
    -+change=filerename       prefix='R hello.c '   field=dest   suffix=''         test_path_eol_fail
    -+change='ls (in commit)' prefix='ls :2 '       field=path   suffix=''         test_path_eol_fail
    ++test_path_eol_fail   filemodify       'M 100644 :1 ' path   ''
    ++test_path_eol_fail   filedelete       'D '           path   ''
    ++test_path_space_fail filecopy         'C '           source ' world.c'
    ++test_path_eol_fail   filecopy         'C hello.c '   dest   ''
    ++test_path_space_fail filerename       'R '           source ' world.c'
    ++test_path_eol_fail   filerename       'R hello.c '   dest   ''
    ++test_path_eol_fail   'ls (in commit)' 'ls :2 '       path   ''
     +
    -+# When 'ls' has no <tree-ish>, the <path> must be quoted.
    -+change='ls (without tree-ish in commit)' prefix='ls ' field=path suffix='' \
    -+test_path_eol_quoted_fail &&
    -+test_path_fail 'empty unquoted path' '' "Invalid dataref"
    ++# When 'ls' has no <dataref>, the <path> must be quoted.
    ++test_path_eol_quoted_fail 'ls (without dataref in commit)' 'ls ' path ''
     +
      ###
      ### series T (ls)
2:  a2aca9f9e6 ! 2:  82a6f53c13 fast-import: directly use strbufs for paths
    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
      			die("Missing space after SHA1: %s", command_buf.buf);
      	}

    +-	strbuf_reset(&uq);
     -	parse_path_eol(&uq, p, "path");
     -	p = uq.buf;
    ++	strbuf_reset(&path);
     +	parse_path_eol(&path, p, "path");

      	/* Git does not track empty, non-toplevel directories. */
    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
     -	static struct strbuf uq = STRBUF_INIT;
     +	static struct strbuf path = STRBUF_INIT;

    +-	strbuf_reset(&uq);
     -	parse_path_eol(&uq, p, "path");
     -	p = uq.buf;
     -	tree_content_remove(&b->branch_tree, p, NULL, 1);
    ++	strbuf_reset(&path);
     +	parse_path_eol(&path, p, "path");
     +	tree_content_remove(&b->branch_tree, path.buf, NULL, 1);
      }
    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
     +	static struct strbuf dest = STRBUF_INIT;
      	struct tree_entry leaf;

    +-	strbuf_reset(&s_uq);
     -	parse_path_space(&s_uq, p, &p, "source");
    --	parse_path_eol(&d_uq, p, "dest");
     -	s = s_uq.buf;
    --	d = d_uq.buf;
    ++	strbuf_reset(&source);
     +	parse_path_space(&source, p, &p, "source");
    +
    + 	if (!p)
    + 		die("Missing dest: %s", command_buf.buf);
    +-	strbuf_reset(&d_uq);
    +-	parse_path_eol(&d_uq, p, "dest");
    +-	d = d_uq.buf;
    ++	strbuf_reset(&dest);
     +	parse_path_eol(&dest, p, "dest");

      	memset(&leaf, 0, sizeof(leaf));
    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
      		&leaf.versions[1].oid,
      		leaf.versions[1].mode,
      		leaf.tree);
    -@@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b)
    +@@ builtin/fast-import.c: static void print_ls(int mode, const unsigned char *hash, const char *path)
    +
    + static void parse_ls(const char *p, struct branch *b)
      {
    - 	struct tree_entry *root = NULL;
    - 	struct tree_entry leaf = {NULL};
     -	static struct strbuf uq = STRBUF_INIT;
     +	static struct strbuf path = STRBUF_INIT;
    + 	struct tree_entry *root = NULL;
    + 	struct tree_entry leaf = {NULL};

    - 	/* ls SP (<tree-ish> SP)? <path> */
    - 	if (*p == '"') {
     @@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b)
      			root->versions[1].mode = S_IFDIR;
      		load_tree(root);
      	}
    +-	strbuf_reset(&uq);
     -	parse_path_eol(&uq, p, "path");
     -	p = uq.buf;
     -	tree_content_get(root, p, &leaf, 1);
    ++	strbuf_reset(&path);
     +	parse_path_eol(&path, p, "path");
     +	tree_content_get(root, path.buf, &leaf, 1);
      	/*
3:  ecaf4883d1 < -:  ---------- fast-import: release unfreed strbufs
-:  ---------- > 3:  893bbf5e73 fast-import: allow unquoted empty path for root
4:  058a38416a ! 4:  cb05a184e6 fast-import: remove dead strbuf
    @@ Metadata
      ## Commit message ##
         fast-import: remove dead strbuf

    -    The strbuf in `note_change_n` has been unused since the function was
    +    The strbuf in `note_change_n` is to copy the remainder of `p` before
    +    potentially invalidating it when reading the next line. However, `p` is
    +    not used after that point. It has been unused since the function was
         created in a8dd2e7d2b (fast-import: Add support for importing commit
         notes, 2009-10-09) and looks to be a fossil from adapting
    -    `note_change_m`. Remove it.
    +    `file_change_m`. Remove it.

         Signed-off-by: Thalia Archibald <thalia@archibald.dev>

5:  a5e8df0759 < -:  ---------- fast-import: document C-style escapes for paths
6:  9792940ba9 < -:  ---------- fast-import: forbid escaped NUL in paths
-:  ---------- > 5:  1f34b632d7 fast-import: improve documentation for unquoted paths
-:  ---------- > 6:  82a4da68af fast-import: document C-style escapes for paths
-:  ---------- > 7:  c087c6a860 fast-import: forbid escaped NUL in paths
-:  ---------- > 8:  a503c55b83 fast-import: make comments more precise
--
2.44.0


^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v2 1/8] fast-import: tighten path unquoting
  2024-04-01  9:02 ` [PATCH v2 0/8] fast-import: tighten parsing of paths Thalia Archibald
@ 2024-04-01  9:02   ` Thalia Archibald
  2024-04-10  6:27     ` Patrick Steinhardt
  2024-04-01  9:03   ` [PATCH v2 2/8] fast-import: directly use strbufs for paths Thalia Archibald
                     ` (8 subsequent siblings)
  9 siblings, 1 reply; 84+ messages in thread
From: Thalia Archibald @ 2024-04-01  9:02 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, Elijah Newren, Thalia Archibald

Path parsing in fast-import is inconsistent and many unquoting errors
are suppressed or not checked.

<path> appears in the grammar in these places:

    filemodify ::= 'M' SP <mode> (<dataref> | 'inline') SP <path> LF
    filedelete ::= 'D' SP <path> LF
    filecopy   ::= 'C' SP <path> SP <path> LF
    filerename ::= 'R' SP <path> SP <path> LF
    ls         ::= 'ls' SP <dataref> SP <path> LF
    ls-commit  ::= 'ls' SP <path> LF

and fast-import.c parses them in five different ways:

1. For filemodify and filedelete:
   Try to unquote <path>. If it unquotes without errors, use the
   unquoted version; otherwise, treat it as literal bytes to the end of
   the line (including any number of SP).
2. For filecopy (source) and filerename (source):
   Try to unquote <path>. If it unquotes without errors, use the
   unquoted version; otherwise, treat it as literal bytes up to, but not
   including, the next SP.
3. For filecopy (dest) and filerename (dest):
   Like 1., but an unquoted empty string is forbidden.
4. For ls:
   If <path> starts with `"`, unquote it and report parse errors;
   otherwise, treat it as literal bytes to the end of the line
   (including any number of SP).
5. For ls-commit:
   Unquote <path> and report parse errors.
   (It must start with `"` to disambiguate from ls.)

In the first three, any errors from trying to unquote a string are
suppressed, so a quoted string that contains invalid escapes would be
interpreted as literal bytes. For example, `"\xff"` would fail to
unquote (because hex escapes are not supported), and it would instead be
interpreted as the byte sequence '"', '\\', 'x', 'f', 'f', '"', which is
certainly not intended. Some front-ends erroneously use their language's
standard quoting routine instead of matching Git's, which could silently
introduce escapes that would be incorrectly parsed due to this and lead
to data corruption.

The documentation states “To use a source path that contains SP the path
must be quoted.”, so it is expected that some implementations depend on
spaces being allowed in paths in the final position. Thus we have two
documented ways to parse paths, so simplify the implementation to that.

Now we have:

1. `parse_path_eol` for filemodify, filedelete, filecopy (dest),
   filerename (dest), ls, and ls-commit:

   If <path> starts with `"`, unquote it and report parse errors;
   otherwise, treat it as literal bytes to the end of the line
   (including any number of SP).

2. `parse_path_space` for filecopy (source) and filerename (source):

   If <path> starts with `"`, unquote it and report parse errors;
   otherwise, treat it as literal bytes up to, but not including, the
   next SP. It must be followed by SP.

There remain two special cases: The dest <path> in filecopy and rename
cannot be an unquoted empty string (this will be addressed subsequently)
and <path> in ls-commit must be quoted to disambiguate it from ls.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c  | 102 ++++++++++-------
 t/t9300-fast-import.sh | 251 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 309 insertions(+), 44 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 782bda007c..6f9048a037 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2258,10 +2258,54 @@ static uintmax_t parse_mark_ref_space(const char **p)
 	return mark;
 }
 
+/*
+ * Parse the path string into the strbuf. It may be quoted with escape sequences
+ * or unquoted without escape sequences. When unquoted, it may only contain a
+ * space if `include_spaces` is nonzero.
+ */
+static void parse_path(struct strbuf *sb, const char *p, const char **endp, int include_spaces, const char *field)
+{
+	if (*p == '"') {
+		if (unquote_c_style(sb, p, endp))
+			die("Invalid %s: %s", field, command_buf.buf);
+	} else {
+		if (include_spaces)
+			*endp = p + strlen(p);
+		else
+			*endp = strchr(p, ' ');
+		strbuf_add(sb, p, *endp - p);
+	}
+}
+
+/*
+ * Parse the path string into the strbuf, and complain if this is not the end of
+ * the string. It may contain spaces even when unquoted.
+ */
+static void parse_path_eol(struct strbuf *sb, const char *p, const char *field)
+{
+	const char *end;
+
+	parse_path(sb, p, &end, 1, field);
+	if (*end)
+		die("Garbage after %s: %s", field, command_buf.buf);
+}
+
+/*
+ * Parse the path string into the strbuf, and ensure it is followed by a space.
+ * It may not contain spaces when unquoted. Update *endp to point to the first
+ * character after the space.
+ */
+static void parse_path_space(struct strbuf *sb, const char *p, const char **endp, const char *field)
+{
+	parse_path(sb, p, endp, 0, field);
+	if (**endp != ' ')
+		die("Missing space after %s: %s", field, command_buf.buf);
+	(*endp)++;
+}
+
 static void file_change_m(const char *p, struct branch *b)
 {
 	static struct strbuf uq = STRBUF_INIT;
-	const char *endp;
 	struct object_entry *oe;
 	struct object_id oid;
 	uint16_t mode, inline_data = 0;
@@ -2299,11 +2343,8 @@ static void file_change_m(const char *p, struct branch *b)
 	}
 
 	strbuf_reset(&uq);
-	if (!unquote_c_style(&uq, p, &endp)) {
-		if (*endp)
-			die("Garbage after path in: %s", command_buf.buf);
-		p = uq.buf;
-	}
+	parse_path_eol(&uq, p, "path");
+	p = uq.buf;
 
 	/* Git does not track empty, non-toplevel directories. */
 	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *p) {
@@ -2367,48 +2408,29 @@ static void file_change_m(const char *p, struct branch *b)
 static void file_change_d(const char *p, struct branch *b)
 {
 	static struct strbuf uq = STRBUF_INIT;
-	const char *endp;
 
 	strbuf_reset(&uq);
-	if (!unquote_c_style(&uq, p, &endp)) {
-		if (*endp)
-			die("Garbage after path in: %s", command_buf.buf);
-		p = uq.buf;
-	}
+	parse_path_eol(&uq, p, "path");
+	p = uq.buf;
 	tree_content_remove(&b->branch_tree, p, NULL, 1);
 }
 
-static void file_change_cr(const char *s, struct branch *b, int rename)
+static void file_change_cr(const char *p, struct branch *b, int rename)
 {
-	const char *d;
+	const char *s, *d;
 	static struct strbuf s_uq = STRBUF_INIT;
 	static struct strbuf d_uq = STRBUF_INIT;
-	const char *endp;
 	struct tree_entry leaf;
 
 	strbuf_reset(&s_uq);
-	if (!unquote_c_style(&s_uq, s, &endp)) {
-		if (*endp != ' ')
-			die("Missing space after source: %s", command_buf.buf);
-	} else {
-		endp = strchr(s, ' ');
-		if (!endp)
-			die("Missing space after source: %s", command_buf.buf);
-		strbuf_add(&s_uq, s, endp - s);
-	}
+	parse_path_space(&s_uq, p, &p, "source");
 	s = s_uq.buf;
 
-	endp++;
-	if (!*endp)
+	if (!p)
 		die("Missing dest: %s", command_buf.buf);
-
-	d = endp;
 	strbuf_reset(&d_uq);
-	if (!unquote_c_style(&d_uq, d, &endp)) {
-		if (*endp)
-			die("Garbage after dest in: %s", command_buf.buf);
-		d = d_uq.buf;
-	}
+	parse_path_eol(&d_uq, p, "dest");
+	d = d_uq.buf;
 
 	memset(&leaf, 0, sizeof(leaf));
 	if (rename)
@@ -3152,6 +3174,7 @@ static void print_ls(int mode, const unsigned char *hash, const char *path)
 
 static void parse_ls(const char *p, struct branch *b)
 {
+	static struct strbuf uq = STRBUF_INIT;
 	struct tree_entry *root = NULL;
 	struct tree_entry leaf = {NULL};
 
@@ -3168,16 +3191,9 @@ static void parse_ls(const char *p, struct branch *b)
 			root->versions[1].mode = S_IFDIR;
 		load_tree(root);
 	}
-	if (*p == '"') {
-		static struct strbuf uq = STRBUF_INIT;
-		const char *endp;
-		strbuf_reset(&uq);
-		if (unquote_c_style(&uq, p, &endp))
-			die("Invalid path: %s", command_buf.buf);
-		if (*endp)
-			die("Garbage after path in: %s", command_buf.buf);
-		p = uq.buf;
-	}
+	strbuf_reset(&uq);
+	parse_path_eol(&uq, p, "path");
+	p = uq.buf;
 	tree_content_get(root, p, &leaf, 1);
 	/*
 	 * A directory in preparation would have a sha1 of zero
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 60e30fed3c..0fb5612b07 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -2142,6 +2142,7 @@ test_expect_success 'Q: deny note on empty branch' '
 	EOF
 	test_must_fail git fast-import <input
 '
+
 ###
 ### series R (feature and option)
 ###
@@ -2790,7 +2791,7 @@ test_expect_success 'R: blob appears only once' '
 '
 
 ###
-### series S
+### series S (mark and path parsing)
 ###
 #
 # Make sure missing spaces and EOLs after mark references
@@ -3060,6 +3061,254 @@ test_expect_success 'S: ls with garbage after sha1 must fail' '
 	test_grep "space after tree-ish" err
 '
 
+#
+# Path parsing
+#
+# There are two sorts of ways a path can be parsed, depending on whether it is
+# the last field on the line. Additionally, ls without a <dataref> has a special
+# case. Test every occurrence of <path> in the grammar against every error case.
+#
+
+#
+# Valid paths at the end of a line: filemodify, filedelete, filecopy (dest),
+# filerename (dest), and ls.
+#
+# commit :301 from root -- modify hello.c (for setup)
+# commit :302 from :301 -- modify $path
+# commit :303 from :302 -- delete $path
+# commit :304 from :301 -- copy hello.c $path
+# commit :305 from :301 -- rename hello.c $path
+# ls :305 $path
+#
+test_path_eol_success () {
+	local test="$1" path="$2" unquoted_path="$3"
+	test_expect_success "S: paths at EOL with $test must work" '
+		test_when_finished "git branch -D S-path-eol" &&
+
+		git fast-import --export-marks=marks.out <<-EOF >out 2>err &&
+		blob
+		mark :401
+		data <<BLOB
+		hello world
+		BLOB
+
+		blob
+		mark :402
+		data <<BLOB
+		hallo welt
+		BLOB
+
+		commit refs/heads/S-path-eol
+		mark :301
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		initial commit
+		COMMIT
+		M 100644 :401 hello.c
+
+		commit refs/heads/S-path-eol
+		mark :302
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filemodify
+		COMMIT
+		from :301
+		M 100644 :402 '"$path"'
+
+		commit refs/heads/S-path-eol
+		mark :303
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filedelete
+		COMMIT
+		from :302
+		D '"$path"'
+
+		commit refs/heads/S-path-eol
+		mark :304
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filecopy dest
+		COMMIT
+		from :301
+		C hello.c '"$path"'
+
+		commit refs/heads/S-path-eol
+		mark :305
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filerename dest
+		COMMIT
+		from :301
+		R hello.c '"$path"'
+
+		ls :305 '"$path"'
+		EOF
+
+		commit_m=$(grep :302 marks.out | cut -d\  -f2) &&
+		commit_d=$(grep :303 marks.out | cut -d\  -f2) &&
+		commit_c=$(grep :304 marks.out | cut -d\  -f2) &&
+		commit_r=$(grep :305 marks.out | cut -d\  -f2) &&
+		blob1=$(grep :401 marks.out | cut -d\  -f2) &&
+		blob2=$(grep :402 marks.out | cut -d\  -f2) &&
+
+		( printf "100644 blob $blob2\t'"$unquoted_path"'\n" &&
+		  printf "100644 blob $blob1\thello.c\n" ) | sort >tree_m.exp &&
+		git ls-tree $commit_m | sort >tree_m.out &&
+		test_cmp tree_m.exp tree_m.out &&
+
+		printf "100644 blob $blob1\thello.c\n" >tree_d.exp &&
+		git ls-tree $commit_d >tree_d.out &&
+		test_cmp tree_d.exp tree_d.out &&
+
+		( printf "100644 blob $blob1\t'"$unquoted_path"'\n" &&
+		  printf "100644 blob $blob1\thello.c\n" ) | sort >tree_c.exp &&
+		git ls-tree $commit_c | sort >tree_c.out &&
+		test_cmp tree_c.exp tree_c.out &&
+
+		printf "100644 blob $blob1\t'"$unquoted_path"'\n" >tree_r.exp &&
+		git ls-tree $commit_r >tree_r.out &&
+		test_cmp tree_r.exp tree_r.out &&
+
+		test_cmp out tree_r.exp
+	'
+}
+
+test_path_eol_success 'quoted spaces'   '" hello world.c "' ' hello world.c '
+test_path_eol_success 'unquoted spaces' ' hello world.c '   ' hello world.c '
+
+#
+# Valid paths before a space: filecopy (source) and filerename (source).
+#
+# commit :301 from root -- modify $path (for setup)
+# commit :302 from :301 -- copy $path hello2.c
+# commit :303 from :301 -- rename $path hello2.c
+#
+test_path_space_success () {
+	local test="$1" path="$2" unquoted_path="$3"
+	test_expect_success "S: paths before space with $test must work" '
+		test_when_finished "git branch -D S-path-space" &&
+
+		git fast-import --export-marks=marks.out <<-EOF 2>err &&
+		blob
+		mark :401
+		data <<BLOB
+		hello world
+		BLOB
+
+		commit refs/heads/S-path-space
+		mark :301
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		initial commit
+		COMMIT
+		M 100644 :401 '"$path"'
+
+		commit refs/heads/S-path-space
+		mark :302
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filecopy source
+		COMMIT
+		from :301
+		C '"$path"' hello2.c
+
+		commit refs/heads/S-path-space
+		mark :303
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filerename source
+		COMMIT
+		from :301
+		R '"$path"' hello2.c
+
+		EOF
+
+		commit_c=$(grep :302 marks.out | cut -d\  -f2) &&
+		commit_r=$(grep :303 marks.out | cut -d\  -f2) &&
+		blob=$(grep :401 marks.out | cut -d\  -f2) &&
+
+		( printf "100644 blob $blob\t'"$unquoted_path"'\n" &&
+		  printf "100644 blob $blob\thello2.c\n" ) | sort >tree_c.exp &&
+		git ls-tree $commit_c | sort >tree_c.out &&
+		test_cmp tree_c.exp tree_c.out &&
+
+		printf "100644 blob $blob\thello2.c\n" >tree_r.exp &&
+		git ls-tree $commit_r >tree_r.out &&
+		test_cmp tree_r.exp tree_r.out
+	'
+}
+
+test_path_space_success 'quoted spaces'      '" hello world.c "' ' hello world.c '
+test_path_space_success 'no unquoted spaces' 'hello_world.c'     'hello_world.c'
+
+#
+# Test a single commit change with an invalid path. Run it with all occurrences
+# of <path> in the grammar against all error kinds.
+#
+test_path_fail () {
+	local change="$1" what="$2" prefix="$3" path="$4" suffix="$5" err_grep="$6"
+	test_expect_success "S: $change with $what must fail" '
+		test_must_fail git fast-import <<-EOF 2>err &&
+		blob
+		mark :1
+		data <<BLOB
+		hello world
+		BLOB
+
+		commit refs/heads/S-path-fail
+		mark :2
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit setup
+		COMMIT
+		M 100644 :1 hello.c
+
+		commit refs/heads/S-path-fail
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit with bad path
+		COMMIT
+		from :2
+		'"$prefix$path$suffix"'
+		EOF
+
+		test_grep '"'$err_grep'"' err
+	'
+}
+
+test_path_base_fail () {
+	local change="$1" prefix="$2" field="$3" suffix="$4"
+	test_path_fail "$change" 'unclosed " in '"$field"          "$prefix" '"hello.c'    "$suffix" "Invalid $field"
+	test_path_fail "$change" "invalid escape in quoted $field" "$prefix" '"hello\xff"' "$suffix" "Invalid $field"
+}
+test_path_eol_quoted_fail () {
+	local change="$1" prefix="$2" field="$3" suffix="$4"
+	test_path_base_fail "$change" "$prefix" "$field" "$suffix"
+	test_path_fail "$change" "garbage after quoted $field" "$prefix" '"hello.c"x' "$suffix" "Garbage after $field"
+	test_path_fail "$change" "space after quoted $field"   "$prefix" '"hello.c" ' "$suffix" "Garbage after $field"
+}
+test_path_eol_fail () {
+	local change="$1" prefix="$2" field="$3" suffix="$4"
+	test_path_eol_quoted_fail "$change" "$prefix" "$field" "$suffix"
+}
+test_path_space_fail () {
+	local change="$1" prefix="$2" field="$3" suffix="$4"
+	test_path_base_fail "$change" "$prefix" "$field" "$suffix"
+	test_path_fail "$change" "missing space after quoted $field" "$prefix" '"hello.c"x' "$suffix" "Missing space after $field"
+}
+
+test_path_eol_fail   filemodify       'M 100644 :1 ' path   ''
+test_path_eol_fail   filedelete       'D '           path   ''
+test_path_space_fail filecopy         'C '           source ' world.c'
+test_path_eol_fail   filecopy         'C hello.c '   dest   ''
+test_path_space_fail filerename       'R '           source ' world.c'
+test_path_eol_fail   filerename       'R hello.c '   dest   ''
+test_path_eol_fail   'ls (in commit)' 'ls :2 '       path   ''
+
+# When 'ls' has no <dataref>, the <path> must be quoted.
+test_path_eol_quoted_fail 'ls (without dataref in commit)' 'ls ' path ''
+
 ###
 ### series T (ls)
 ###
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 2/8] fast-import: directly use strbufs for paths
  2024-04-01  9:02 ` [PATCH v2 0/8] fast-import: tighten parsing of paths Thalia Archibald
  2024-04-01  9:02   ` [PATCH v2 1/8] fast-import: tighten path unquoting Thalia Archibald
@ 2024-04-01  9:03   ` Thalia Archibald
  2024-04-10  6:27     ` Patrick Steinhardt
  2024-04-01  9:03   ` [PATCH v2 3/8] fast-import: allow unquoted empty path for root Thalia Archibald
                     ` (7 subsequent siblings)
  9 siblings, 1 reply; 84+ messages in thread
From: Thalia Archibald @ 2024-04-01  9:03 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, Elijah Newren, Thalia Archibald

Previously, one case would not write the path to the strbuf: when the
path is unquoted and at the end of the string. It was essentially
copy-on-write. However, with the logic simplification of the previous
commit, this case was eliminated and the strbuf is always populated.

Directly use the strbufs now instead of an alias.

Since this already changes all the lines that use the strbufs, rename
them from `uq` to be more descriptive. That they are unquoted is not
their most important property, so name them after what they carry.

Additionally, `file_change_m` no longer needs to copy the path before
reading inline data.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c | 64 ++++++++++++++++++-------------------------
 1 file changed, 27 insertions(+), 37 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 6f9048a037..fad9324e59 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2305,7 +2305,7 @@ static void parse_path_space(struct strbuf *sb, const char *p, const char **endp
 
 static void file_change_m(const char *p, struct branch *b)
 {
-	static struct strbuf uq = STRBUF_INIT;
+	static struct strbuf path = STRBUF_INIT;
 	struct object_entry *oe;
 	struct object_id oid;
 	uint16_t mode, inline_data = 0;
@@ -2342,13 +2342,12 @@ static void file_change_m(const char *p, struct branch *b)
 			die("Missing space after SHA1: %s", command_buf.buf);
 	}
 
-	strbuf_reset(&uq);
-	parse_path_eol(&uq, p, "path");
-	p = uq.buf;
+	strbuf_reset(&path);
+	parse_path_eol(&path, p, "path");
 
 	/* Git does not track empty, non-toplevel directories. */
-	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *p) {
-		tree_content_remove(&b->branch_tree, p, NULL, 0);
+	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *path.buf) {
+		tree_content_remove(&b->branch_tree, path.buf, NULL, 0);
 		return;
 	}
 
@@ -2369,10 +2368,6 @@ static void file_change_m(const char *p, struct branch *b)
 		if (S_ISDIR(mode))
 			die("Directories cannot be specified 'inline': %s",
 				command_buf.buf);
-		if (p != uq.buf) {
-			strbuf_addstr(&uq, p);
-			p = uq.buf;
-		}
 		while (read_next_command() != EOF) {
 			const char *v;
 			if (skip_prefix(command_buf.buf, "cat-blob ", &v))
@@ -2398,55 +2393,51 @@ static void file_change_m(const char *p, struct branch *b)
 				command_buf.buf);
 	}
 
-	if (!*p) {
+	if (!*path.buf) {
 		tree_content_replace(&b->branch_tree, &oid, mode, NULL);
 		return;
 	}
-	tree_content_set(&b->branch_tree, p, &oid, mode, NULL);
+	tree_content_set(&b->branch_tree, path.buf, &oid, mode, NULL);
 }
 
 static void file_change_d(const char *p, struct branch *b)
 {
-	static struct strbuf uq = STRBUF_INIT;
+	static struct strbuf path = STRBUF_INIT;
 
-	strbuf_reset(&uq);
-	parse_path_eol(&uq, p, "path");
-	p = uq.buf;
-	tree_content_remove(&b->branch_tree, p, NULL, 1);
+	strbuf_reset(&path);
+	parse_path_eol(&path, p, "path");
+	tree_content_remove(&b->branch_tree, path.buf, NULL, 1);
 }
 
 static void file_change_cr(const char *p, struct branch *b, int rename)
 {
-	const char *s, *d;
-	static struct strbuf s_uq = STRBUF_INIT;
-	static struct strbuf d_uq = STRBUF_INIT;
+	static struct strbuf source = STRBUF_INIT;
+	static struct strbuf dest = STRBUF_INIT;
 	struct tree_entry leaf;
 
-	strbuf_reset(&s_uq);
-	parse_path_space(&s_uq, p, &p, "source");
-	s = s_uq.buf;
+	strbuf_reset(&source);
+	parse_path_space(&source, p, &p, "source");
 
 	if (!p)
 		die("Missing dest: %s", command_buf.buf);
-	strbuf_reset(&d_uq);
-	parse_path_eol(&d_uq, p, "dest");
-	d = d_uq.buf;
+	strbuf_reset(&dest);
+	parse_path_eol(&dest, p, "dest");
 
 	memset(&leaf, 0, sizeof(leaf));
 	if (rename)
-		tree_content_remove(&b->branch_tree, s, &leaf, 1);
+		tree_content_remove(&b->branch_tree, source.buf, &leaf, 1);
 	else
-		tree_content_get(&b->branch_tree, s, &leaf, 1);
+		tree_content_get(&b->branch_tree, source.buf, &leaf, 1);
 	if (!leaf.versions[1].mode)
-		die("Path %s not in branch", s);
-	if (!*d) {	/* C "path/to/subdir" "" */
+		die("Path %s not in branch", source.buf);
+	if (!*dest.buf) {	/* C "path/to/subdir" "" */
 		tree_content_replace(&b->branch_tree,
 			&leaf.versions[1].oid,
 			leaf.versions[1].mode,
 			leaf.tree);
 		return;
 	}
-	tree_content_set(&b->branch_tree, d,
+	tree_content_set(&b->branch_tree, dest.buf,
 		&leaf.versions[1].oid,
 		leaf.versions[1].mode,
 		leaf.tree);
@@ -3174,7 +3165,7 @@ static void print_ls(int mode, const unsigned char *hash, const char *path)
 
 static void parse_ls(const char *p, struct branch *b)
 {
-	static struct strbuf uq = STRBUF_INIT;
+	static struct strbuf path = STRBUF_INIT;
 	struct tree_entry *root = NULL;
 	struct tree_entry leaf = {NULL};
 
@@ -3191,10 +3182,9 @@ static void parse_ls(const char *p, struct branch *b)
 			root->versions[1].mode = S_IFDIR;
 		load_tree(root);
 	}
-	strbuf_reset(&uq);
-	parse_path_eol(&uq, p, "path");
-	p = uq.buf;
-	tree_content_get(root, p, &leaf, 1);
+	strbuf_reset(&path);
+	parse_path_eol(&path, p, "path");
+	tree_content_get(root, path.buf, &leaf, 1);
 	/*
 	 * A directory in preparation would have a sha1 of zero
 	 * until it is saved.  Save, for simplicity.
@@ -3202,7 +3192,7 @@ static void parse_ls(const char *p, struct branch *b)
 	if (S_ISDIR(leaf.versions[1].mode))
 		store_tree(&leaf);
 
-	print_ls(leaf.versions[1].mode, leaf.versions[1].oid.hash, p);
+	print_ls(leaf.versions[1].mode, leaf.versions[1].oid.hash, path.buf);
 	if (leaf.tree)
 		release_tree_content_recursive(leaf.tree);
 	if (!b || root != &b->branch_tree)
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 3/8] fast-import: allow unquoted empty path for root
  2024-04-01  9:02 ` [PATCH v2 0/8] fast-import: tighten parsing of paths Thalia Archibald
  2024-04-01  9:02   ` [PATCH v2 1/8] fast-import: tighten path unquoting Thalia Archibald
  2024-04-01  9:03   ` [PATCH v2 2/8] fast-import: directly use strbufs for paths Thalia Archibald
@ 2024-04-01  9:03   ` Thalia Archibald
  2024-04-10  6:27     ` Patrick Steinhardt
  2024-04-01  9:03   ` [PATCH v2 4/8] fast-import: remove dead strbuf Thalia Archibald
                     ` (6 subsequent siblings)
  9 siblings, 1 reply; 84+ messages in thread
From: Thalia Archibald @ 2024-04-01  9:03 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, Elijah Newren, Thalia Archibald

Ever since filerename was added in f39a946a1f (Support wholesale
directory renames in fast-import, 2007-07-09) and filecopy in b6f3481bb4
(Teach fast-import to recursively copy files/directories, 2007-07-15),
both have produced an error when the destination path is empty. Later,
when support for targeting the root directory with an empty string was
added in 2794ad5244 (fast-import: Allow filemodify to set the root,
2010-10-10), this had the effect of allowing the quoted empty string
(`""`), but forbidding its unquoted variant (``). This seems to have
been intended as simple data validation for parsing two paths, rather
than a syntax restriction, because it was not extended to the other
operations.

All other occurrences of paths (in filemodify, filedelete, the source of
filecopy and filerename, and ls) allow both.

For most of this feature's lifetime, the documentation has not
prescribed the use of quoted empty strings. In e5959106d6
(Documentation/fast-import: put explanation of M 040000 <dataref> "" in
context, 2011-01-15), its documentation was changed from “`<path>` may
also be an empty string (`""`) to specify the root of the tree” to “The
root of the tree can be represented by an empty string as `<path>`”.

Thus, we can assume that some front-ends have depended on this behavior.

Remove this restriction for the destination paths of filecopy and
filerename and change tests targeting the root to test `""` and ``.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c  |   5 +-
 t/t9300-fast-import.sh | 363 +++++++++++++++++++++--------------------
 2 files changed, 191 insertions(+), 177 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index fad9324e59..58cc8d4ede 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2416,11 +2416,8 @@ static void file_change_cr(const char *p, struct branch *b, int rename)
 	struct tree_entry leaf;
 
 	strbuf_reset(&source);
-	parse_path_space(&source, p, &p, "source");
-
-	if (!p)
-		die("Missing dest: %s", command_buf.buf);
 	strbuf_reset(&dest);
+	parse_path_space(&source, p, &p, "source");
 	parse_path_eol(&dest, p, "dest");
 
 	memset(&leaf, 0, sizeof(leaf));
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 0fb5612b07..635b1b9af7 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -1059,30 +1059,33 @@ test_expect_success 'M: rename subdirectory to new subdirectory' '
 	compare_diff_raw expect actual
 '
 
-test_expect_success 'M: rename root to subdirectory' '
-	cat >input <<-INPUT_END &&
-	commit refs/heads/M4
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	rename root
-	COMMIT
+for root in '""' ''
+do
+	test_expect_success "M: rename root ($root) to subdirectory" '
+		cat >input <<-INPUT_END &&
+		commit refs/heads/M4
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		rename root
+		COMMIT
 
-	from refs/heads/M2^0
-	R "" sub
+		from refs/heads/M2^0
+		R '"$root"' sub
 
-	INPUT_END
+		INPUT_END
 
-	cat >expect <<-EOF &&
-	:100644 100644 $oldf $oldf R100	file2/oldf	sub/file2/oldf
-	:100755 100755 $f4id $f4id R100	file4	sub/file4
-	:100755 100755 $newf $newf R100	i/am/new/to/you	sub/i/am/new/to/you
-	:100755 100755 $f6id $f6id R100	newdir/exec.sh	sub/newdir/exec.sh
-	:100644 100644 $f5id $f5id R100	newdir/interesting	sub/newdir/interesting
-	EOF
-	git fast-import <input &&
-	git diff-tree -M -r M4^ M4 >actual &&
-	compare_diff_raw expect actual
-'
+		cat >expect <<-EOF &&
+		:100644 100644 $oldf $oldf R100	file2/oldf	sub/file2/oldf
+		:100755 100755 $f4id $f4id R100	file4	sub/file4
+		:100755 100755 $newf $newf R100	i/am/new/to/you	sub/i/am/new/to/you
+		:100755 100755 $f6id $f6id R100	newdir/exec.sh	sub/newdir/exec.sh
+		:100644 100644 $f5id $f5id R100	newdir/interesting	sub/newdir/interesting
+		EOF
+		git fast-import <input &&
+		git diff-tree -M -r M4^ M4 >actual &&
+		compare_diff_raw expect actual
+	'
+done
 
 ###
 ### series N
@@ -1259,49 +1262,52 @@ test_expect_success PIPE 'N: empty directory reads as missing' '
 	test_cmp expect actual
 '
 
-test_expect_success 'N: copy root directory by tree hash' '
-	cat >expect <<-EOF &&
-	:100755 000000 $newf $zero D	file3/newf
-	:100644 000000 $oldf $zero D	file3/oldf
-	EOF
-	root=$(git rev-parse refs/heads/branch^0^{tree}) &&
-	cat >input <<-INPUT_END &&
-	commit refs/heads/N6
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	copy root directory by tree hash
-	COMMIT
+for root in '""' ''
+do
+	test_expect_success "N: copy root ($root) by tree hash" '
+		cat >expect <<-EOF &&
+		:100755 000000 $newf $zero D	file3/newf
+		:100644 000000 $oldf $zero D	file3/oldf
+		EOF
+		root_tree=$(git rev-parse refs/heads/branch^0^{tree}) &&
+		cat >input <<-INPUT_END &&
+		commit refs/heads/N6
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		copy root directory by tree hash
+		COMMIT
 
-	from refs/heads/branch^0
-	M 040000 $root ""
-	INPUT_END
-	git fast-import <input &&
-	git diff-tree -C --find-copies-harder -r N4 N6 >actual &&
-	compare_diff_raw expect actual
-'
+		from refs/heads/branch^0
+		M 040000 $root_tree '"$root"'
+		INPUT_END
+		git fast-import <input &&
+		git diff-tree -C --find-copies-harder -r N4 N6 >actual &&
+		compare_diff_raw expect actual
+	'
 
-test_expect_success 'N: copy root by path' '
-	cat >expect <<-EOF &&
-	:100755 100755 $newf $newf C100	file2/newf	oldroot/file2/newf
-	:100644 100644 $oldf $oldf C100	file2/oldf	oldroot/file2/oldf
-	:100755 100755 $f4id $f4id C100	file4	oldroot/file4
-	:100755 100755 $f6id $f6id C100	newdir/exec.sh	oldroot/newdir/exec.sh
-	:100644 100644 $f5id $f5id C100	newdir/interesting	oldroot/newdir/interesting
-	EOF
-	cat >input <<-INPUT_END &&
-	commit refs/heads/N-copy-root-path
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	copy root directory by (empty) path
-	COMMIT
+	test_expect_success "N: copy root ($root) by path" '
+		cat >expect <<-EOF &&
+		:100755 100755 $newf $newf C100	file2/newf	oldroot/file2/newf
+		:100644 100644 $oldf $oldf C100	file2/oldf	oldroot/file2/oldf
+		:100755 100755 $f4id $f4id C100	file4	oldroot/file4
+		:100755 100755 $f6id $f6id C100	newdir/exec.sh	oldroot/newdir/exec.sh
+		:100644 100644 $f5id $f5id C100	newdir/interesting	oldroot/newdir/interesting
+		EOF
+		cat >input <<-INPUT_END &&
+		commit refs/heads/N-copy-root-path
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		copy root directory by (empty) path
+		COMMIT
 
-	from refs/heads/branch^0
-	C "" oldroot
-	INPUT_END
-	git fast-import <input &&
-	git diff-tree -C --find-copies-harder -r branch N-copy-root-path >actual &&
-	compare_diff_raw expect actual
-'
+		from refs/heads/branch^0
+		C '"$root"' oldroot
+		INPUT_END
+		git fast-import <input &&
+		git diff-tree -C --find-copies-harder -r branch N-copy-root-path >actual &&
+		compare_diff_raw expect actual
+	'
+done
 
 test_expect_success 'N: delete directory by copying' '
 	cat >expect <<-\EOF &&
@@ -1431,98 +1437,102 @@ test_expect_success 'N: reject foo/ syntax in ls argument' '
 	INPUT_END
 '
 
-test_expect_success 'N: copy to root by id and modify' '
-	echo "hello, world" >expect.foo &&
-	echo hello >expect.bar &&
-	git fast-import <<-SETUP_END &&
-	commit refs/heads/N7
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	hello, tree
-	COMMIT
+for root in '""' ''
+do
+	test_expect_success "N: copy to root ($root) by id and modify" '
+		echo "hello, world" >expect.foo &&
+		echo hello >expect.bar &&
+		git fast-import <<-SETUP_END &&
+		commit refs/heads/N7
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		hello, tree
+		COMMIT
 
-	deleteall
-	M 644 inline foo/bar
-	data <<EOF
-	hello
-	EOF
-	SETUP_END
+		deleteall
+		M 644 inline foo/bar
+		data <<EOF
+		hello
+		EOF
+		SETUP_END
 
-	tree=$(git rev-parse --verify N7:) &&
-	git fast-import <<-INPUT_END &&
-	commit refs/heads/N8
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	copy to root by id and modify
-	COMMIT
+		tree=$(git rev-parse --verify N7:) &&
+		git fast-import <<-INPUT_END &&
+		commit refs/heads/N8
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		copy to root by id and modify
+		COMMIT
 
-	M 040000 $tree ""
-	M 644 inline foo/foo
-	data <<EOF
-	hello, world
-	EOF
-	INPUT_END
-	git show N8:foo/foo >actual.foo &&
-	git show N8:foo/bar >actual.bar &&
-	test_cmp expect.foo actual.foo &&
-	test_cmp expect.bar actual.bar
-'
+		M 040000 $tree '"$root"'
+		M 644 inline foo/foo
+		data <<EOF
+		hello, world
+		EOF
+		INPUT_END
+		git show N8:foo/foo >actual.foo &&
+		git show N8:foo/bar >actual.bar &&
+		test_cmp expect.foo actual.foo &&
+		test_cmp expect.bar actual.bar
+	'
 
-test_expect_success 'N: extract subtree' '
-	branch=$(git rev-parse --verify refs/heads/branch^{tree}) &&
-	cat >input <<-INPUT_END &&
-	commit refs/heads/N9
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	extract subtree branch:newdir
-	COMMIT
+	test_expect_success "N: extract subtree to the root ($root)" '
+		branch=$(git rev-parse --verify refs/heads/branch^{tree}) &&
+		cat >input <<-INPUT_END &&
+		commit refs/heads/N9
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		extract subtree branch:newdir
+		COMMIT
 
-	M 040000 $branch ""
-	C "newdir" ""
-	INPUT_END
-	git fast-import <input &&
-	git diff --exit-code branch:newdir N9
-'
+		M 040000 $branch '"$root"'
+		C "newdir" '"$root"'
+		INPUT_END
+		git fast-import <input &&
+		git diff --exit-code branch:newdir N9
+	'
 
-test_expect_success 'N: modify subtree, extract it, and modify again' '
-	echo hello >expect.baz &&
-	echo hello, world >expect.qux &&
-	git fast-import <<-SETUP_END &&
-	commit refs/heads/N10
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	hello, tree
-	COMMIT
+	test_expect_success "N: modify subtree, extract it to the root ($root), and modify again" '
+		echo hello >expect.baz &&
+		echo hello, world >expect.qux &&
+		git fast-import <<-SETUP_END &&
+		commit refs/heads/N10
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		hello, tree
+		COMMIT
 
-	deleteall
-	M 644 inline foo/bar/baz
-	data <<EOF
-	hello
-	EOF
-	SETUP_END
+		deleteall
+		M 644 inline foo/bar/baz
+		data <<EOF
+		hello
+		EOF
+		SETUP_END
 
-	tree=$(git rev-parse --verify N10:) &&
-	git fast-import <<-INPUT_END &&
-	commit refs/heads/N11
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	copy to root by id and modify
-	COMMIT
+		tree=$(git rev-parse --verify N10:) &&
+		git fast-import <<-INPUT_END &&
+		commit refs/heads/N11
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		copy to root by id and modify
+		COMMIT
 
-	M 040000 $tree ""
-	M 100644 inline foo/bar/qux
-	data <<EOF
-	hello, world
-	EOF
-	R "foo" ""
-	C "bar/qux" "bar/quux"
-	INPUT_END
-	git show N11:bar/baz >actual.baz &&
-	git show N11:bar/qux >actual.qux &&
-	git show N11:bar/quux >actual.quux &&
-	test_cmp expect.baz actual.baz &&
-	test_cmp expect.qux actual.qux &&
-	test_cmp expect.qux actual.quux'
+		M 040000 $tree '"$root"'
+		M 100644 inline foo/bar/qux
+		data <<EOF
+		hello, world
+		EOF
+		R "foo" '"$root"'
+		C "bar/qux" "bar/quux"
+		INPUT_END
+		git show N11:bar/baz >actual.baz &&
+		git show N11:bar/qux >actual.qux &&
+		git show N11:bar/quux >actual.quux &&
+		test_cmp expect.baz actual.baz &&
+		test_cmp expect.qux actual.qux &&
+		test_cmp expect.qux actual.quux
+	'
+done
 
 ###
 ### series O
@@ -3067,6 +3077,7 @@ test_expect_success 'S: ls with garbage after sha1 must fail' '
 # There are two sorts of ways a path can be parsed, depending on whether it is
 # the last field on the line. Additionally, ls without a <dataref> has a special
 # case. Test every occurrence of <path> in the grammar against every error case.
+# Paths for the root (empty strings) are tested elsewhere.
 #
 
 #
@@ -3314,16 +3325,19 @@ test_path_eol_quoted_fail 'ls (without dataref in commit)' 'ls ' path ''
 ###
 # Setup is carried over from series S.
 
-test_expect_success 'T: ls root tree' '
-	sed -e "s/Z\$//" >expect <<-EOF &&
-	040000 tree $(git rev-parse S^{tree})	Z
-	EOF
-	sha1=$(git rev-parse --verify S) &&
-	git fast-import --import-marks=marks <<-EOF >actual &&
-	ls $sha1 ""
-	EOF
-	test_cmp expect actual
-'
+for root in '""' ''
+do
+	test_expect_success "T: ls root ($root) tree" '
+		sed -e "s/Z\$//" >expect <<-EOF &&
+		040000 tree $(git rev-parse S^{tree})	Z
+		EOF
+		sha1=$(git rev-parse --verify S) &&
+		git fast-import --import-marks=marks <<-EOF >actual &&
+		ls $sha1 $root
+		EOF
+		test_cmp expect actual
+	'
+done
 
 test_expect_success 'T: delete branch' '
 	git branch to-delete &&
@@ -3425,30 +3439,33 @@ test_expect_success 'U: validate directory delete result' '
 	compare_diff_raw expect actual
 '
 
-test_expect_success 'U: filedelete root succeeds' '
-	cat >input <<-INPUT_END &&
-	commit refs/heads/U
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	must succeed
-	COMMIT
-	from refs/heads/U^0
-	D ""
+for root in '""' ''
+do
+	test_expect_success "U: filedelete root ($root) succeeds" '
+		cat >input <<-INPUT_END &&
+		commit refs/heads/U-delete-root
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		must succeed
+		COMMIT
+		from refs/heads/U^0
+		D '"$root"'
 
-	INPUT_END
+		INPUT_END
 
-	git fast-import <input
-'
+		git fast-import <input
+	'
 
-test_expect_success 'U: validate root delete result' '
-	cat >expect <<-EOF &&
-	:100644 000000 $f7id $ZERO_OID D	hello.c
-	EOF
+	test_expect_success "U: validate root ($root) delete result" '
+		cat >expect <<-EOF &&
+		:100644 000000 $f7id $ZERO_OID D	hello.c
+		EOF
 
-	git diff-tree -M -r U^1 U >actual &&
+		git diff-tree -M -r U U-delete-root >actual &&
 
-	compare_diff_raw expect actual
-'
+		compare_diff_raw expect actual
+	'
+done
 
 ###
 ### series V (checkpoint)
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 4/8] fast-import: remove dead strbuf
  2024-04-01  9:02 ` [PATCH v2 0/8] fast-import: tighten parsing of paths Thalia Archibald
                     ` (2 preceding siblings ...)
  2024-04-01  9:03   ` [PATCH v2 3/8] fast-import: allow unquoted empty path for root Thalia Archibald
@ 2024-04-01  9:03   ` Thalia Archibald
  2024-04-01  9:03   ` [PATCH v2 5/8] fast-import: improve documentation for unquoted paths Thalia Archibald
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-01  9:03 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, Elijah Newren, Thalia Archibald

The strbuf in `note_change_n` is to copy the remainder of `p` before
potentially invalidating it when reading the next line. However, `p` is
not used after that point. It has been unused since the function was
created in a8dd2e7d2b (fast-import: Add support for importing commit
notes, 2009-10-09) and looks to be a fossil from adapting
`file_change_m`. Remove it.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 58cc8d4ede..fc6eeaf89c 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2442,7 +2442,6 @@ static void file_change_cr(const char *p, struct branch *b, int rename)
 
 static void note_change_n(const char *p, struct branch *b, unsigned char *old_fanout)
 {
-	static struct strbuf uq = STRBUF_INIT;
 	struct object_entry *oe;
 	struct branch *s;
 	struct object_id oid, commit_oid;
@@ -2507,10 +2506,6 @@ static void note_change_n(const char *p, struct branch *b, unsigned char *old_fa
 		die("Invalid ref name or SHA1 expression: %s", p);
 
 	if (inline_data) {
-		if (p != uq.buf) {
-			strbuf_addstr(&uq, p);
-			p = uq.buf;
-		}
 		read_next_command();
 		parse_and_store_blob(&last_blob, &oid, 0);
 	} else if (oe) {
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 5/8] fast-import: improve documentation for unquoted paths
  2024-04-01  9:02 ` [PATCH v2 0/8] fast-import: tighten parsing of paths Thalia Archibald
                     ` (3 preceding siblings ...)
  2024-04-01  9:03   ` [PATCH v2 4/8] fast-import: remove dead strbuf Thalia Archibald
@ 2024-04-01  9:03   ` Thalia Archibald
  2024-04-01  9:03   ` [PATCH v2 6/8] fast-import: document C-style escapes for paths Thalia Archibald
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-01  9:03 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, Elijah Newren, Thalia Archibald

It describes what cannot be in an unquoted path, but not what it is.
Reframe it as a definition of unquoted paths. The requirement that it
not start with `"` is the core element that implies the rest.

The restriction that the source paths of filecopy and filerename cannot
contain SP is only stated in their respective sections. Restate it in
the <path> section.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 Documentation/git-fast-import.txt | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index b2607366b9..f26d7a8571 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -630,18 +630,23 @@ in octal.  Git only supports the following modes:
 In both formats `<path>` is the complete path of the file to be added
 (if not already existing) or modified (if already existing).
 
-A `<path>` string must use UNIX-style directory separators (forward
-slash `/`), may contain any byte other than `LF`, and must not
-start with double quote (`"`).
+A `<path>` can be written as unquoted bytes or a C-style quoted string:
 
-A path can use C-style string quoting; this is accepted in all cases
-and mandatory if the filename starts with double quote or contains
-`LF`. In C-style quoting, the complete name should be surrounded with
+When a `<path>` does not start with double quote (`"`), it is an
+unquoted string and is parsed as literal bytes without any escape
+sequences. However, if the filename contains `LF` or starts with double
+quote, it must be written as a quoted string. Additionally, the source
+`<path>` in `filecopy` or `filerename` must be quoted if it contains SP.
+
+A `<path>` can use C-style string quoting; this is accepted in all cases
+and mandatory in the cases where the filename cannot be represented as
+an unquoted string. In C-style quoting, the complete name should be surrounded with
 double quotes, and any `LF`, backslash, or double quote characters
 must be escaped by preceding them with a backslash (e.g.,
 `"path/with\n, \\ and \" in it"`).
 
-The value of `<path>` must be in canonical form. That is it must not:
+A `<path>` must use UNIX-style directory separators (forward slash `/`)
+and must be in canonical form. That is it must not:
 
 * contain an empty directory component (e.g. `foo//bar` is invalid),
 * end with a directory separator (e.g. `foo/` is invalid),
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 6/8] fast-import: document C-style escapes for paths
  2024-04-01  9:02 ` [PATCH v2 0/8] fast-import: tighten parsing of paths Thalia Archibald
                     ` (4 preceding siblings ...)
  2024-04-01  9:03   ` [PATCH v2 5/8] fast-import: improve documentation for unquoted paths Thalia Archibald
@ 2024-04-01  9:03   ` Thalia Archibald
  2024-04-01  9:03   ` [PATCH v2 7/8] fast-import: forbid escaped NUL in paths Thalia Archibald
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-01  9:03 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, Elijah Newren, Thalia Archibald

Simply saying “C-style” string quoting is imprecise, as only a subset of
C escapes are supported. Document the exact escapes.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 Documentation/git-fast-import.txt | 12 ++++++++----
 t/t9300-fast-import.sh            | 10 ++++++----
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index f26d7a8571..db53b50268 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -640,10 +640,14 @@ quote, it must be written as a quoted string. Additionally, the source
 
 A `<path>` can use C-style string quoting; this is accepted in all cases
 and mandatory in the cases where the filename cannot be represented as
-an unquoted string. In C-style quoting, the complete name should be surrounded with
-double quotes, and any `LF`, backslash, or double quote characters
-must be escaped by preceding them with a backslash (e.g.,
-`"path/with\n, \\ and \" in it"`).
+an unquoted string. In C-style quoting, the complete filename is
+surrounded with double quote (`"`) and certain characters must be
+escaped by preceding them with a backslash: `LF` is written as `\n`,
+backslash as `\\`, and double quote as `\"`. Some characters may may
+optionally be written with escape sequences: `\a` for bell, `\b` for
+backspace, `\f` for form feed, `\n` for line feed, `\r` for carriage
+return, `\t` for horizontal tab, and `\v` for vertical tab. Any byte can
+be written with 3-digit octal codes (e.g., `\033`).
 
 A `<path>` must use UNIX-style directory separators (forward slash `/`)
 and must be in canonical form. That is it must not:
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 635b1b9af7..e10962dffe 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -3185,8 +3185,9 @@ test_path_eol_success () {
 	'
 }
 
-test_path_eol_success 'quoted spaces'   '" hello world.c "' ' hello world.c '
-test_path_eol_success 'unquoted spaces' ' hello world.c '   ' hello world.c '
+test_path_eol_success 'quoted spaces'   '" hello world.c "'  ' hello world.c '
+test_path_eol_success 'unquoted spaces' ' hello world.c '    ' hello world.c '
+test_path_eol_success 'octal escapes'   '"\150\151\056\143"' 'hi.c'
 
 #
 # Valid paths before a space: filecopy (source) and filerename (source).
@@ -3250,8 +3251,9 @@ test_path_space_success () {
 	'
 }
 
-test_path_space_success 'quoted spaces'      '" hello world.c "' ' hello world.c '
-test_path_space_success 'no unquoted spaces' 'hello_world.c'     'hello_world.c'
+test_path_space_success 'quoted spaces'      '" hello world.c "'  ' hello world.c '
+test_path_space_success 'no unquoted spaces' 'hello_world.c'      'hello_world.c'
+test_path_space_success 'octal escapes'      '"\150\151\056\143"' 'hi.c'
 
 #
 # Test a single commit change with an invalid path. Run it with all occurrences
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 7/8] fast-import: forbid escaped NUL in paths
  2024-04-01  9:02 ` [PATCH v2 0/8] fast-import: tighten parsing of paths Thalia Archibald
                     ` (5 preceding siblings ...)
  2024-04-01  9:03   ` [PATCH v2 6/8] fast-import: document C-style escapes for paths Thalia Archibald
@ 2024-04-01  9:03   ` Thalia Archibald
  2024-04-01  9:03   ` [PATCH v2 8/8] fast-import: make comments more precise Thalia Archibald
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-01  9:03 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, Elijah Newren, Thalia Archibald

NUL cannot appear in paths. Even disregarding filesystem path
limitations, the tree object format delimits with NUL, so such a path
cannot be encoded by Git.

When a quoted path is unquoted, it could possibly contain NUL from
"\000". Forbid it so it isn't truncated.

fast-import still has other issues with NUL, but those will be addressed
later.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 Documentation/git-fast-import.txt | 1 +
 builtin/fast-import.c             | 2 ++
 t/t9300-fast-import.sh            | 1 +
 3 files changed, 4 insertions(+)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index db53b50268..edda30f90c 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -660,6 +660,7 @@ and must be in canonical form. That is it must not:
 
 The root of the tree can be represented by an empty string as `<path>`.
 
+`<path>` cannot contain NUL, either literally or escaped as `\000`.
 It is recommended that `<path>` always be encoded using UTF-8.
 
 `filedelete`
diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index fc6eeaf89c..9d0f53fe04 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2268,6 +2268,8 @@ static void parse_path(struct strbuf *sb, const char *p, const char **endp, int
 	if (*p == '"') {
 		if (unquote_c_style(sb, p, endp))
 			die("Invalid %s: %s", field, command_buf.buf);
+		if (strlen(sb->buf) != sb->len)
+			die("NUL in %s: %s", field, command_buf.buf);
 	} else {
 		if (include_spaces)
 			*endp = p + strlen(p);
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index e10962dffe..794a96df38 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -3294,6 +3294,7 @@ test_path_base_fail () {
 	local change="$1" prefix="$2" field="$3" suffix="$4"
 	test_path_fail "$change" 'unclosed " in '"$field"          "$prefix" '"hello.c'    "$suffix" "Invalid $field"
 	test_path_fail "$change" "invalid escape in quoted $field" "$prefix" '"hello\xff"' "$suffix" "Invalid $field"
+	test_path_fail "$change" "escaped NUL in quoted $field"    "$prefix" '"hello\000"' "$suffix" "NUL in $field"
 }
 test_path_eol_quoted_fail () {
 	local change="$1" prefix="$2" field="$3" suffix="$4"
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 8/8] fast-import: make comments more precise
  2024-04-01  9:02 ` [PATCH v2 0/8] fast-import: tighten parsing of paths Thalia Archibald
                     ` (6 preceding siblings ...)
  2024-04-01  9:03   ` [PATCH v2 7/8] fast-import: forbid escaped NUL in paths Thalia Archibald
@ 2024-04-01  9:03   ` Thalia Archibald
  2024-04-07 21:19   ` [PATCH v2 0/8] fast-import: tighten parsing of paths Thalia Archibald
  2024-04-10  9:54   ` [PATCH v3 " Thalia Archibald
  9 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-01  9:03 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, Elijah Newren, Thalia Archibald

The former is somewhat imprecise. The latter became out of sync with the
behavior in e814c39c2f (fast-import: refactor parsing of spaces,
2014-06-18).

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 9d0f53fe04..9b66ffd2d0 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2210,7 +2210,7 @@ static int parse_mapped_oid_hex(const char *hex, struct object_id *oid, const ch
  *
  *   idnum ::= ':' bigint;
  *
- * Return the first character after the value in *endptr.
+ * Update *endptr to point to the first character after the value.
  *
  * Complain if the following character is not what is expected,
  * either a space or end of the string.
@@ -2243,8 +2243,8 @@ static uintmax_t parse_mark_ref_eol(const char *p)
 }
 
 /*
- * Parse the mark reference, demanding a trailing space.  Return a
- * pointer to the space.
+ * Parse the mark reference, demanding a trailing space. Update *p to
+ * point to the first character after the space.
  */
 static uintmax_t parse_mark_ref_space(const char **p)
 {
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH 1/6] fast-import: tighten parsing of paths
       [not found]     ` <E01C617F-3720-42C0-83EE-04BB01643C86@archibald.dev>
@ 2024-04-01  9:05       ` Thalia Archibald
  0 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-01  9:05 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Elijah Newren

(Sorry for first sending as HTML)

On Mar 28, 2024, at 01:21, Patrick Steinhardt <ps@pks.im> wrote:
> 
> So this is part of the "filemodify" section with the following syntax:
> 
>    'M' SP <mode> SP <dataref> SP <path> LF
> 
> The way I interpret this change is that <path> could previously be empty
> (`SP LF`), but now it needs to be quoted (`SP '"' '"' LF). This seems to
> be related to cases (1) and (3) of your commit messages, where
> "filemodify" could contain an unquoted empty string whereas "filecopy"
> and "filerename" would complain about such an unquoted string.
> 
> In any case I don't see a strong argument why exactly it should be
> forbidden to have an unquoted empty path here, and I do wonder whether
> it would break existing writers of the format when we retroactively
> tighten this case. Isn't it possible to instead loosen it such that all
> three of the above actions know to handle unquoted empty paths?

At first, I strongly thought that we should forbid this case of unquoted empty
paths. It's a somewhat peculiar case in that it refers to the root of the repo
and few front-ends use it. I surveyed git fast-export, git-filter-repo,
Reposurgeon, hg-fast-export, cvs-fast-export (by Eric S. Raymond),
cvs-fast-export (by Roger Vaughn), svn2git, bzr-fastexport, and bk fast-export,
but none of them ever target the root of the repo. I assumed that if an unquoted
empty path was ever emitted, it was likely an bug that should not be accepted
(e.g., a null byte array somehow).

However, most occurrences of <path> in the grammar have allowed unquoted empty
strings to mean the root for 14 years and documentation has implied that it's
allowed for 13 years. It's just the two cases of the destination paths of
filecopy and filerename that don't allow it, and those are less-used operations,
so front-ends may never encounter that error.

Some assumed errors in emitting empty paths are caught by validation with file
modes, so even if it's loosened it's still fairly safe. filemodify must be
040000 when it targets the root, and filecopy and filerename to the root must
have a source path that's a directory. The worst case is filedelete
unintentionally targeting the root, but that's been allowed to be an unquoted
empty path for almost its entire lifetime, so I don't think should be changed.

I've changed it to allow unquoted empty paths for all operations in patch v2
3/8 (fast-import: allow unquoted empty path for root).

> This is loosening the condition so that we also accept unquoted paths
> now. Okay.

On the contrary, v1 tightens all paths to forbid unquoted empty strings. v2 now
allows it and should make that change more clear.

> On Fri, Mar 22, 2024 at 12:03:18AM +0000, Thalia Archibald wrote:
>> +/*
>> + * Parse the path string into the strbuf. It may be quoted with escape sequences
>> + * or unquoted without escape sequences. When unquoted, it may only contain a
>> + * space if `allow_spaces` is nonzero.
>> + */
>> +static void parse_path(struct strbuf *sb, const char *p, const char **endp, int allow_spaces, const char *field)
>> +{
>> + strbuf_reset(sb);
>> + if (*p == '"') {
>> + if (unquote_c_style(sb, p, endp))
>> + die("Invalid %s: %s", field, command_buf.buf);
>> + } else {
>> + if (allow_spaces)
>> + *endp = p + strlen(p);
> 
> I wonder whether `stop_at_unquoted_space` might be more telling. It's
> not like we disallow spaces here, it's that we treat them as the
> separator to the next field.

I agree, but I’d rather something shorter, so I’ve changed it to `include_spaces`.

>> + else
>> + *endp = strchr(p, ' ');
>> + if (*endp == p)
>> + die("Missing %s: %s", field, command_buf.buf);
> 
> Error messages should start with a lower-case letter and be
> translateable. But these are simply moved over from the previous code,
> so I don't mind much if you want to keep them as-is.
> 
>> + strbuf_add(sb, p, *endp - p);
>> + }
>> +}


fast-import isn’t a porcelain command, AFAIK, so I thought the convention is
that its output isn't translated?

From po/README.md:
> 
> The output from Git's plumbing utilities will primarily be read by
> programs and would break scripts under non-C locales if it was
> translated. Plumbing strings should not be translated, since
> they're part of Git's API.


I would, however, like to improve its error messages. For example, diagnosing
errors more precisely or changing the outdated “GIT” to “Git”.

To what extent should the exact error messages be considered part of Git's API?

Thalia

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 3/6] fast-import: release unfreed strbufs
  2024-03-28  8:21   ` Patrick Steinhardt
@ 2024-04-01  9:06     ` Thalia Archibald
  0 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-01  9:06 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Elijah Newren

(Resending as plain text)

On Mar 28, 2024, at 01:21, Patrick Steinhardt <ps@pks.im> wrote:
> I was about to propose that we should likely also change all of these
> static variables to be local instead. I don't think that we use the
> variables after the function calls. But now that I see that we do it
> like this in all of these helpers I think what's going on is that this
> is a memory optimization to avoid reallocating buffers all the time.
> 
> Ugly, but so be it. We could refactor the code to pass in scratch
> buffers from the outside to remove those static variables. But that
> certainly would be a bigger change and thus likely outside of the scope
> of this patch series.


> Oh, now you get to my comment in the preceding patch. With this patch
> we're now in a somewhat weird in-between state where the buffers are
> still static, but we release their memory after each call. So we kind of
> get the worst of both worlds: static variables without being able to
> reuse the buffer's memory.
> 
> If we were to change this then we should definitely mark the buffers as
> non-static. If so, it would be great to demonstrate that this does not
> significantly impact performance.
> 
> The same is true for all the other instances.

I had glossed that they're `static`, since I've grown accustomed to Rust, where
this sort of non-reentrant code is discouraged. However, this pattern is great
for fast-import, because all of its data is simply freed when it exits at the
end of the stream. I dropped this patch in v2.

I don't think it's worth hoisting these `strbuf`s out. It would only reduce it
from 5 to 2 total static `strbuf`s for paths, but would make ownership less
clear.

Thalia



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 5/6] fast-import: document C-style escapes for paths
  2024-03-28  8:21   ` Patrick Steinhardt
@ 2024-04-01  9:06     ` Thalia Archibald
  0 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-01  9:06 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Elijah Newren

(Sending again as plain text)

On Mar 28, 2024, at 01:21, Patrick Steinhardt <ps@pks.im> wrote:
> On Fri, Mar 22, 2024 at 12:03:47AM +0000, Thalia Archibald wrote:
>> 
>> diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
>> index 271bd63a10..4aa8ccbefd 100644
>> --- a/Documentation/git-fast-import.txt
>> +++ b/Documentation/git-fast-import.txt
>> @@ -630,18 +630,23 @@ in octal.  Git only supports the following modes:
>> In both formats `<path>` is the complete path of the file to be added
>> (if not already existing) or modified (if already existing).
>> 
>> -A `<path>` string must use UNIX-style directory separators (forward
>> -slash `/`), may contain any byte other than `LF`, and must not
>> -start with double quote (`"`).
>> +A `<path>` string may contain any byte other than `LF`, and must not
>> +start with double quote (`"`). It is interpreted as literal bytes
>> +without escaping.
> 
> Paths also mustn't start with a space in many cases, right?

It talks about starting with double quote, because that's what determines
whether it's parsed as a quoted or unquoted string.

Containing spaces is different. When unquoted, a path can only contain a space
if it's the last field on the line; that's all paths except the source paths of
filecopy and filerename. That note was already remarked in the filecopy and
filerename sections, but it would help to note it in the general <note> section,
so I've done that and clarified quoting in patch v2 5/8 (fast-import: improve
documentation for unquoted paths).

Thalia

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 0/8] fast-import: tighten parsing of paths
  2024-04-01  9:02 ` [PATCH v2 0/8] fast-import: tighten parsing of paths Thalia Archibald
                     ` (7 preceding siblings ...)
  2024-04-01  9:03   ` [PATCH v2 8/8] fast-import: make comments more precise Thalia Archibald
@ 2024-04-07 21:19   ` Thalia Archibald
  2024-04-07 23:46     ` Eric Sunshine
  2024-04-10  9:54   ` [PATCH v3 " Thalia Archibald
  9 siblings, 1 reply; 84+ messages in thread
From: Thalia Archibald @ 2024-04-07 21:19 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Junio C Hamano, Elijah Newren, brian m. carlson, Jeff King

On Apr 1, 2024, at 02:02, Thalia Archibald wrote:
>> fast-import has subtle differences in how it parses file paths between each
>> occurrence of <path> in the grammar. Many errors are suppressed or not checked,
>> which could lead to silent data corruption. A particularly bad case is when a
>> front-end sent escapes that Git doesn't recognize (e.g., hex escapes are not
>> supported), it would be treated as literal bytes instead of a quoted string.
>> 
>> Bring path parsing into line with the documented behavior and improve
>> documentation to fill in missing details.
> 
> Thanks for the review, Patrick. I've made several changes, which I think address
> your feedback. What's the protocol for adding `Reviewed-by`? Since I don't know
> whether I, you, or Junio add it, I've refrained from attaching your name to
> these patches.

Hello! Friendly ping here. It’s been a week since the last emails for this patch
set, so I’d like to check in on the status.

As a new contributor to the project, I don’t yet have a full view of the
contribution flow, aside from what I’ve read. I suspect the latency is because I
may not have cc’d all the area experts. (When I sent v1, I used separate Cc
lines in send-email --compose, but it dropped all but the last. After Patrick
reviewed it, I figured I could leave the cc list as-is for v2, assuming I’d get
another review.) I’ve now cc’d everyone listed by contrib/contacts, as well as
the maintainer. For anyone not a part of the earlier discussion, the latest
version is at https://lore.kernel.org/git/cover.1711960552.git.thalia@archibald.dev/.
If that’s not the problem, I’d be keen to know what I could do better.

I have several more patch sets in the works, that I’ve held back on sending
until this one is finished, in case I’ve been doing something wrong. I want to
move forward. Thank you for your time.

Thalia

> Changes since v1:
> * In fast-import:
>  * Move `strbuf_release` outside of `parse_path_space` and `parse_path_eol`.
>  * Retain allocations for static `strbuf`s.
>  * Rename `allow_spaces` parameter of `parse_path` to `include_spaces`.
>  * Extract change to neighboring comments as patch 8.
> * In tests:
>  * Test `` for the root path additionally in all tests using `""`.
>  * Pass all arguments by positional variables.
>  * Use `local`.
>  * Use `test_when_finished` instead of manual cleanup.
> * In documentation:
>  * Better document conditions under which a path is considered quoted or
>    unquoted.
> * Reword commit messages.
> 
> Thalia
> 
> 
> Thalia Archibald (8):
>  fast-import: tighten path unquoting
>  fast-import: directly use strbufs for paths
>  fast-import: allow unquoted empty path for root
>  fast-import: remove dead strbuf
>  fast-import: improve documentation for unquoted paths
>  fast-import: document C-style escapes for paths
>  fast-import: forbid escaped NUL in paths
>  fast-import: make comments more precise
> 
> Documentation/git-fast-import.txt |  30 +-
> builtin/fast-import.c             | 156 ++++----
> t/t9300-fast-import.sh            | 617 +++++++++++++++++++++---------
> 3 files changed, 541 insertions(+), 262 deletions(-)
> 
> Range-diff against v1:
> 1:  8d9e0b25cb ! 1:  e790bdf714 fast-import: tighten parsing of paths
>    @@ Metadata
>     Author: Thalia Archibald <thalia@archibald.dev>
> 
>      ## Commit message ##
>    -    fast-import: tighten parsing of paths
>    +    fast-import: tighten path unquoting
> 
>         Path parsing in fast-import is inconsistent and many unquoting errors
>    -    are suppressed.
>    +    are suppressed or not checked.
> 
>    -    `<path>` appears in the grammar in these places:
>    +    <path> appears in the grammar in these places:
> 
>             filemodify ::= 'M' SP <mode> (<dataref> | 'inline') SP <path> LF
>             filedelete ::= 'D' SP <path> LF
>    @@ Commit message
>         and fast-import.c parses them in five different ways:
> 
>         1. For filemodify and filedelete:
>    -       If `<path>` is a valid quoted string, unquote it;
>    -       otherwise, treat it as literal bytes (including any number of SP).
>    +       Try to unquote <path>. If it unquotes without errors, use the
>    +       unquoted version; otherwise, treat it as literal bytes to the end of
>    +       the line (including any number of SP).
>         2. For filecopy (source) and filerename (source):
>    -       If `<path>` is a valid quoted string, unquote it;
>    -       otherwise, treat it as literal bytes until the next SP.
>    +       Try to unquote <path>. If it unquotes without errors, use the
>    +       unquoted version; otherwise, treat it as literal bytes up to, but not
>    +       including, the next SP.
>         3. For filecopy (dest) and filerename (dest):
>    -       Like 1., but an unquoted empty string is an error.
>    +       Like 1., but an unquoted empty string is forbidden.
>         4. For ls:
>    -       If `<path>` starts with `"`, unquote it and report parse errors;
>    -       otherwise, treat it as literal bytes (including any number of SP).
>    +       If <path> starts with `"`, unquote it and report parse errors;
>    +       otherwise, treat it as literal bytes to the end of the line
>    +       (including any number of SP).
>         5. For ls-commit:
>    -       Unquote `<path>` and report parse errors.
>    +       Unquote <path> and report parse errors.
>            (It must start with `"` to disambiguate from ls.)
> 
>         In the first three, any errors from trying to unquote a string are
>         suppressed, so a quoted string that contains invalid escapes would be
>         interpreted as literal bytes. For example, `"\xff"` would fail to
>         unquote (because hex escapes are not supported), and it would instead be
>    -    interpreted as the byte sequence `"` `\` `x` `f` `f` `"`, which is
>    +    interpreted as the byte sequence '"', '\\', 'x', 'f', 'f', '"', which is
>         certainly not intended. Some front-ends erroneously use their language's
>    -    standard quoting routine and could silently introduce escapes that would
>    -    be incorrectly parsed due to this.
>    +    standard quoting routine instead of matching Git's, which could silently
>    +    introduce escapes that would be incorrectly parsed due to this and lead
>    +    to data corruption.
> 
>    -    The documentation states that “To use a source path that contains SP the
>    -    path must be quoted.”, so it is expected that some implementations
>    -    depend on spaces being allowed in paths in the final position. Thus we
>    -    have two documented ways to parse paths, so simplify the implementation
>    -    to that.
>    +    The documentation states “To use a source path that contains SP the path
>    +    must be quoted.”, so it is expected that some implementations depend on
>    +    spaces being allowed in paths in the final position. Thus we have two
>    +    documented ways to parse paths, so simplify the implementation to that.
> 
>         Now we have:
> 
>         1. `parse_path_eol` for filemodify, filedelete, filecopy (dest),
>            filerename (dest), ls, and ls-commit:
> 
>    -       If `<path>` starts with `"`, unquote it and report parse errors;
>    -       otherwise, treat it as literal bytes (including any number of SP).
>    -       Garbage after a quoted string or an unquoted empty string are errors.
>    -       (In ls-commit, it must be quoted to disambiguate from ls.)
>    +       If <path> starts with `"`, unquote it and report parse errors;
>    +       otherwise, treat it as literal bytes to the end of the line
>    +       (including any number of SP).
> 
>         2. `parse_path_space` for filecopy (source) and filerename (source):
> 
>    -       If `<path>` starts with `"`, unquote it and report parse errors;
>    -       otherwise, treat it as literal bytes until the next SP.
>    -       It must be followed by a SP. An unquoted empty string is an error.
>    +       If <path> starts with `"`, unquote it and report parse errors;
>    +       otherwise, treat it as literal bytes up to, but not including, the
>    +       next SP. It must be followed by SP.
>    +
>    +    There remain two special cases: The dest <path> in filecopy and rename
>    +    cannot be an unquoted empty string (this will be addressed subsequently)
>    +    and <path> in ls-commit must be quoted to disambiguate it from ls.
> 
>         Signed-off-by: Thalia Archibald <thalia@archibald.dev>
> 
>    - ## Documentation/git-fast-import.txt ##
>    -@@ Documentation/git-fast-import.txt: The value of `<path>` must be in canonical form. That is it must not:
>    - * contain the special component `.` or `..` (e.g. `foo/./bar` and
>    -   `foo/../bar` are invalid).
>    -
>    --The root of the tree can be represented by an empty string as `<path>`.
>    -+The root of the tree can be represented by a quoted empty string (`""`)
>    -+as `<path>`.
>    -
>    - It is recommended that `<path>` always be encoded using UTF-8.
>    -
>    -
>      ## builtin/fast-import.c ##
>    -@@ builtin/fast-import.c: static int parse_mapped_oid_hex(const char *hex, struct object_id *oid, const ch
>    -  *
>    -  *   idnum ::= ':' bigint;
>    -  *
>    -- * Return the first character after the value in *endptr.
>    -+ * Update *endptr to point to the first character after the value.
>    -  *
>    -  * Complain if the following character is not what is expected,
>    -  * either a space or end of the string.
>    -@@ builtin/fast-import.c: static uintmax_t parse_mark_ref_eol(const char *p)
>    - }
>    -
>    - /*
>    -- * Parse the mark reference, demanding a trailing space.  Return a
>    -- * pointer to the space.
>    -+ * Parse the mark reference, demanding a trailing space. Update *p to
>    -+ * point to the first character after the space.
>    -  */
>    - static uintmax_t parse_mark_ref_space(const char **p)
>    - {
>     @@ builtin/fast-import.c: static uintmax_t parse_mark_ref_space(const char **p)
>       return mark;
>      }
>    @@ builtin/fast-import.c: static uintmax_t parse_mark_ref_space(const char **p)
>     +/*
>     + * Parse the path string into the strbuf. It may be quoted with escape sequences
>     + * or unquoted without escape sequences. When unquoted, it may only contain a
>    -+ * space if `allow_spaces` is nonzero.
>    ++ * space if `include_spaces` is nonzero.
>     + */
>    -+static void parse_path(struct strbuf *sb, const char *p, const char **endp, int allow_spaces, const char *field)
>    ++static void parse_path(struct strbuf *sb, const char *p, const char **endp, int include_spaces, const char *field)
>     +{
>    -+ strbuf_reset(sb);
>     + if (*p == '"') {
>     + if (unquote_c_style(sb, p, endp))
>     + die("Invalid %s: %s", field, command_buf.buf);
>     + } else {
>    -+ if (allow_spaces)
>    ++ if (include_spaces)
>     + *endp = p + strlen(p);
>     + else
>     + *endp = strchr(p, ' ');
>    -+ if (*endp == p)
>    -+ die("Missing %s: %s", field, command_buf.buf);
>     + strbuf_add(sb, p, *endp - p);
>     + }
>     +}
>    @@ builtin/fast-import.c: static uintmax_t parse_mark_ref_space(const char **p)
>       struct object_id oid;
>       uint16_t mode, inline_data = 0;
>     @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b)
>    - die("Missing space after SHA1: %s", command_buf.buf);
>       }
> 
>    -- strbuf_reset(&uq);
>    + strbuf_reset(&uq);
>     - if (!unquote_c_style(&uq, p, &endp)) {
>     - if (*endp)
>     - die("Garbage after path in: %s", command_buf.buf);
>    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
>       static struct strbuf uq = STRBUF_INIT;
>     - const char *endp;
> 
>    -- strbuf_reset(&uq);
>    + strbuf_reset(&uq);
>     - if (!unquote_c_style(&uq, p, &endp)) {
>     - if (*endp)
>     - die("Garbage after path in: %s", command_buf.buf);
>    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
>     - const char *endp;
>       struct tree_entry leaf;
> 
>    -- strbuf_reset(&s_uq);
>    + strbuf_reset(&s_uq);
>     - if (!unquote_c_style(&s_uq, s, &endp)) {
>     - if (*endp != ' ')
>     - die("Missing space after source: %s", command_buf.buf);
>    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
>     - strbuf_add(&s_uq, s, endp - s);
>     - }
>     + parse_path_space(&s_uq, p, &p, "source");
>    -+ parse_path_eol(&d_uq, p, "dest");
>       s = s_uq.buf;
>    --
>    +
>     - endp++;
>     - if (!*endp)
>    -- die("Missing dest: %s", command_buf.buf);
>    ++ if (!p)
>    + die("Missing dest: %s", command_buf.buf);
>     -
>     - d = endp;
>    -- strbuf_reset(&d_uq);
>    + strbuf_reset(&d_uq);
>     - if (!unquote_c_style(&d_uq, d, &endp)) {
>     - if (*endp)
>     - die("Garbage after dest in: %s", command_buf.buf);
>     - d = d_uq.buf;
>     - }
>    ++ parse_path_eol(&d_uq, p, "dest");
>     + d = d_uq.buf;
> 
>       memset(&leaf, 0, sizeof(leaf));
>       if (rename)
>    -@@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b)
>    +@@ builtin/fast-import.c: static void print_ls(int mode, const unsigned char *hash, const char *path)
>    +
>    + static void parse_ls(const char *p, struct branch *b)
>      {
>    ++ static struct strbuf uq = STRBUF_INIT;
>       struct tree_entry *root = NULL;
>       struct tree_entry leaf = {NULL};
>    -+ static struct strbuf uq = STRBUF_INIT;
> 
>    - /* ls SP (<tree-ish> SP)? <path> */
>    - if (*p == '"') {
>     @@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b)
>       root->versions[1].mode = S_IFDIR;
>       load_tree(root);
>    @@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b)
>     - die("Garbage after path in: %s", command_buf.buf);
>     - p = uq.buf;
>     - }
>    ++ strbuf_reset(&uq);
>     + parse_path_eol(&uq, p, "path");
>     + p = uq.buf;
>       tree_content_get(root, p, &leaf, 1);
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     +# Path parsing
>     +#
>     +# There are two sorts of ways a path can be parsed, depending on whether it is
>    -+# the last field on the line. Additionally, ls without a <tree-ish> has a
>    -+# special case. Test every occurrence of <path> in the grammar against every
>    -+# error case.
>    ++# the last field on the line. Additionally, ls without a <dataref> has a special
>    ++# case. Test every occurrence of <path> in the grammar against every error case.
>     +#
>     +
>     +#
>     +# Valid paths at the end of a line: filemodify, filedelete, filecopy (dest),
>     +# filerename (dest), and ls.
>     +#
>    -+# commit :301 from root -- modify hello.c
>    ++# commit :301 from root -- modify hello.c (for setup)
>     +# commit :302 from :301 -- modify $path
>     +# commit :303 from :302 -- delete $path
>     +# commit :304 from :301 -- copy hello.c $path
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     +# ls :305 $path
>     +#
>     +test_path_eol_success () {
>    -+ test="$1" path="$2" unquoted_path="$3"
>    ++ local test="$1" path="$2" unquoted_path="$3"
>     + test_expect_success "S: paths at EOL with $test must work" '
>    ++ test_when_finished "git branch -D S-path-eol" &&
>    ++
>     + git fast-import --export-marks=marks.out <<-EOF >out 2>err &&
>     + blob
>     + mark :401
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     + hallo welt
>     + BLOB
>     +
>    -+ commit refs/heads/path-eol
>    ++ commit refs/heads/S-path-eol
>     + mark :301
>     + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
>     + data <<COMMIT
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     + COMMIT
>     + M 100644 :401 hello.c
>     +
>    -+ commit refs/heads/path-eol
>    ++ commit refs/heads/S-path-eol
>     + mark :302
>     + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
>     + data <<COMMIT
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     + from :301
>     + M 100644 :402 '"$path"'
>     +
>    -+ commit refs/heads/path-eol
>    ++ commit refs/heads/S-path-eol
>     + mark :303
>     + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
>     + data <<COMMIT
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     + from :302
>     + D '"$path"'
>     +
>    -+ commit refs/heads/path-eol
>    ++ commit refs/heads/S-path-eol
>     + mark :304
>     + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
>     + data <<COMMIT
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     + from :301
>     + C hello.c '"$path"'
>     +
>    -+ commit refs/heads/path-eol
>    ++ commit refs/heads/S-path-eol
>     + mark :305
>     + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
>     + data <<COMMIT
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     + git ls-tree $commit_r >tree_r.out &&
>     + test_cmp tree_r.exp tree_r.out &&
>     +
>    -+ test_cmp out tree_r.exp &&
>    -+
>    -+ git branch -D path-eol
>    ++ test_cmp out tree_r.exp
>     + '
>     +}
>     +
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     +#
>     +# Valid paths before a space: filecopy (source) and filerename (source).
>     +#
>    -+# commit :301 from root -- modify $path
>    ++# commit :301 from root -- modify $path (for setup)
>     +# commit :302 from :301 -- copy $path hello2.c
>     +# commit :303 from :301 -- rename $path hello2.c
>     +#
>     +test_path_space_success () {
>    -+ test="$1" path="$2" unquoted_path="$3"
>    ++ local test="$1" path="$2" unquoted_path="$3"
>     + test_expect_success "S: paths before space with $test must work" '
>    ++ test_when_finished "git branch -D S-path-space" &&
>    ++
>     + git fast-import --export-marks=marks.out <<-EOF 2>err &&
>     + blob
>     + mark :401
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     + hello world
>     + BLOB
>     +
>    -+ commit refs/heads/path-space
>    ++ commit refs/heads/S-path-space
>     + mark :301
>     + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
>     + data <<COMMIT
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     + COMMIT
>     + M 100644 :401 '"$path"'
>     +
>    -+ commit refs/heads/path-space
>    ++ commit refs/heads/S-path-space
>     + mark :302
>     + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
>     + data <<COMMIT
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     + from :301
>     + C '"$path"' hello2.c
>     +
>    -+ commit refs/heads/path-space
>    ++ commit refs/heads/S-path-space
>     + mark :303
>     + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
>     + data <<COMMIT
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     +
>     + printf "100644 blob $blob\thello2.c\n" >tree_r.exp &&
>     + git ls-tree $commit_r >tree_r.out &&
>    -+ test_cmp tree_r.exp tree_r.out &&
>    -+
>    -+ git branch -D path-space
>    ++ test_cmp tree_r.exp tree_r.out
>     + '
>     +}
>     +
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     +# of <path> in the grammar against all error kinds.
>     +#
>     +test_path_fail () {
>    -+ what="$1" path="$2" err_grep="$3"
>    ++ local change="$1" what="$2" prefix="$3" path="$4" suffix="$5" err_grep="$6"
>     + test_expect_success "S: $change with $what must fail" '
>     + test_must_fail git fast-import <<-EOF 2>err &&
>     + blob
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     +}
>     +
>     +test_path_base_fail () {
>    -+ test_path_fail 'unclosed " in '"$field"          '"hello.c'    "Invalid $field"
>    -+ test_path_fail "invalid escape in quoted $field" '"hello\xff"' "Invalid $field"
>    ++ local change="$1" prefix="$2" field="$3" suffix="$4"
>    ++ test_path_fail "$change" 'unclosed " in '"$field"          "$prefix" '"hello.c'    "$suffix" "Invalid $field"
>    ++ test_path_fail "$change" "invalid escape in quoted $field" "$prefix" '"hello\xff"' "$suffix" "Invalid $field"
>     +}
>     +test_path_eol_quoted_fail () {
>    -+ test_path_base_fail
>    -+ test_path_fail "garbage after quoted $field" '"hello.c"x' "Garbage after $field"
>    -+ test_path_fail "space after quoted $field"   '"hello.c" ' "Garbage after $field"
>    ++ local change="$1" prefix="$2" field="$3" suffix="$4"
>    ++ test_path_base_fail "$change" "$prefix" "$field" "$suffix"
>    ++ test_path_fail "$change" "garbage after quoted $field" "$prefix" '"hello.c"x' "$suffix" "Garbage after $field"
>    ++ test_path_fail "$change" "space after quoted $field"   "$prefix" '"hello.c" ' "$suffix" "Garbage after $field"
>     +}
>     +test_path_eol_fail () {
>    -+ test_path_eol_quoted_fail
>    -+ test_path_fail 'empty unquoted path' '' "Missing $field"
>    ++ local change="$1" prefix="$2" field="$3" suffix="$4"
>    ++ test_path_eol_quoted_fail "$change" "$prefix" "$field" "$suffix"
>     +}
>     +test_path_space_fail () {
>    -+ test_path_base_fail
>    -+ test_path_fail 'empty unquoted path' '' "Missing $field"
>    -+ test_path_fail "missing space after quoted $field" '"hello.c"x' "Missing space after $field"
>    ++ local change="$1" prefix="$2" field="$3" suffix="$4"
>    ++ test_path_base_fail "$change" "$prefix" "$field" "$suffix"
>    ++ test_path_fail "$change" "missing space after quoted $field" "$prefix" '"hello.c"x' "$suffix" "Missing space after $field"
>     +}
>     +
>    -+change=filemodify       prefix='M 100644 :1 ' field=path   suffix=''         test_path_eol_fail
>    -+change=filedelete       prefix='D '           field=path   suffix=''         test_path_eol_fail
>    -+change=filecopy         prefix='C '           field=source suffix=' world.c' test_path_space_fail
>    -+change=filecopy         prefix='C hello.c '   field=dest   suffix=''         test_path_eol_fail
>    -+change=filerename       prefix='R '           field=source suffix=' world.c' test_path_space_fail
>    -+change=filerename       prefix='R hello.c '   field=dest   suffix=''         test_path_eol_fail
>    -+change='ls (in commit)' prefix='ls :2 '       field=path   suffix=''         test_path_eol_fail
>    ++test_path_eol_fail   filemodify       'M 100644 :1 ' path   ''
>    ++test_path_eol_fail   filedelete       'D '           path   ''
>    ++test_path_space_fail filecopy         'C '           source ' world.c'
>    ++test_path_eol_fail   filecopy         'C hello.c '   dest   ''
>    ++test_path_space_fail filerename       'R '           source ' world.c'
>    ++test_path_eol_fail   filerename       'R hello.c '   dest   ''
>    ++test_path_eol_fail   'ls (in commit)' 'ls :2 '       path   ''
>     +
>    -+# When 'ls' has no <tree-ish>, the <path> must be quoted.
>    -+change='ls (without tree-ish in commit)' prefix='ls ' field=path suffix='' \
>    -+test_path_eol_quoted_fail &&
>    -+test_path_fail 'empty unquoted path' '' "Invalid dataref"
>    ++# When 'ls' has no <dataref>, the <path> must be quoted.
>    ++test_path_eol_quoted_fail 'ls (without dataref in commit)' 'ls ' path ''
>     +
>      ###
>      ### series T (ls)
> 2:  a2aca9f9e6 ! 2:  82a6f53c13 fast-import: directly use strbufs for paths
>    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
>       die("Missing space after SHA1: %s", command_buf.buf);
>       }
> 
>    +- strbuf_reset(&uq);
>     - parse_path_eol(&uq, p, "path");
>     - p = uq.buf;
>    ++ strbuf_reset(&path);
>     + parse_path_eol(&path, p, "path");
> 
>       /* Git does not track empty, non-toplevel directories. */
>    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
>     - static struct strbuf uq = STRBUF_INIT;
>     + static struct strbuf path = STRBUF_INIT;
> 
>    +- strbuf_reset(&uq);
>     - parse_path_eol(&uq, p, "path");
>     - p = uq.buf;
>     - tree_content_remove(&b->branch_tree, p, NULL, 1);
>    ++ strbuf_reset(&path);
>     + parse_path_eol(&path, p, "path");
>     + tree_content_remove(&b->branch_tree, path.buf, NULL, 1);
>      }
>    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
>     + static struct strbuf dest = STRBUF_INIT;
>       struct tree_entry leaf;
> 
>    +- strbuf_reset(&s_uq);
>     - parse_path_space(&s_uq, p, &p, "source");
>    -- parse_path_eol(&d_uq, p, "dest");
>     - s = s_uq.buf;
>    -- d = d_uq.buf;
>    ++ strbuf_reset(&source);
>     + parse_path_space(&source, p, &p, "source");
>    +
>    + if (!p)
>    + die("Missing dest: %s", command_buf.buf);
>    +- strbuf_reset(&d_uq);
>    +- parse_path_eol(&d_uq, p, "dest");
>    +- d = d_uq.buf;
>    ++ strbuf_reset(&dest);
>     + parse_path_eol(&dest, p, "dest");
> 
>       memset(&leaf, 0, sizeof(leaf));
>    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
>       &leaf.versions[1].oid,
>       leaf.versions[1].mode,
>       leaf.tree);
>    -@@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b)
>    +@@ builtin/fast-import.c: static void print_ls(int mode, const unsigned char *hash, const char *path)
>    +
>    + static void parse_ls(const char *p, struct branch *b)
>      {
>    - struct tree_entry *root = NULL;
>    - struct tree_entry leaf = {NULL};
>     - static struct strbuf uq = STRBUF_INIT;
>     + static struct strbuf path = STRBUF_INIT;
>    + struct tree_entry *root = NULL;
>    + struct tree_entry leaf = {NULL};
> 
>    - /* ls SP (<tree-ish> SP)? <path> */
>    - if (*p == '"') {
>     @@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b)
>       root->versions[1].mode = S_IFDIR;
>       load_tree(root);
>       }
>    +- strbuf_reset(&uq);
>     - parse_path_eol(&uq, p, "path");
>     - p = uq.buf;
>     - tree_content_get(root, p, &leaf, 1);
>    ++ strbuf_reset(&path);
>     + parse_path_eol(&path, p, "path");
>     + tree_content_get(root, path.buf, &leaf, 1);
>       /*
> 3:  ecaf4883d1 < -:  ---------- fast-import: release unfreed strbufs
> -:  ---------- > 3:  893bbf5e73 fast-import: allow unquoted empty path for root
> 4:  058a38416a ! 4:  cb05a184e6 fast-import: remove dead strbuf
>    @@ Metadata
>      ## Commit message ##
>         fast-import: remove dead strbuf
> 
>    -    The strbuf in `note_change_n` has been unused since the function was
>    +    The strbuf in `note_change_n` is to copy the remainder of `p` before
>    +    potentially invalidating it when reading the next line. However, `p` is
>    +    not used after that point. It has been unused since the function was
>         created in a8dd2e7d2b (fast-import: Add support for importing commit
>         notes, 2009-10-09) and looks to be a fossil from adapting
>    -    `note_change_m`. Remove it.
>    +    `file_change_m`. Remove it.
> 
>         Signed-off-by: Thalia Archibald <thalia@archibald.dev>
> 
> 5:  a5e8df0759 < -:  ---------- fast-import: document C-style escapes for paths
> 6:  9792940ba9 < -:  ---------- fast-import: forbid escaped NUL in paths
> -:  ---------- > 5:  1f34b632d7 fast-import: improve documentation for unquoted paths
> -:  ---------- > 6:  82a4da68af fast-import: document C-style escapes for paths
> -:  ---------- > 7:  c087c6a860 fast-import: forbid escaped NUL in paths
> -:  ---------- > 8:  a503c55b83 fast-import: make comments more precise
> --
> 2.44.0



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 0/8] fast-import: tighten parsing of paths
  2024-04-07 21:19   ` [PATCH v2 0/8] fast-import: tighten parsing of paths Thalia Archibald
@ 2024-04-07 23:46     ` Eric Sunshine
  2024-04-08  6:25       ` Patrick Steinhardt
  0 siblings, 1 reply; 84+ messages in thread
From: Eric Sunshine @ 2024-04-07 23:46 UTC (permalink / raw)
  To: Thalia Archibald
  Cc: Patrick Steinhardt, git, Junio C Hamano, Elijah Newren,
	brian m. carlson, Jeff King

On Sun, Apr 7, 2024 at 5:19 PM Thalia Archibald <thalia@archibald.dev> wrote:
> On Apr 1, 2024, at 02:02, Thalia Archibald wrote:
> >> fast-import has subtle differences in how it parses file paths between each
> >> occurrence of <path> in the grammar. Many errors are suppressed or not checked,
> >> which could lead to silent data corruption. A particularly bad case is when a
> >> front-end sent escapes that Git doesn't recognize (e.g., hex escapes are not
> >> supported), it would be treated as literal bytes instead of a quoted string.
> >>
> >> Bring path parsing into line with the documented behavior and improve
> >> documentation to fill in missing details.
> >
> > Thanks for the review, Patrick. I've made several changes, which I think address
> > your feedback. What's the protocol for adding `Reviewed-by`? Since I don't know
> > whether I, you, or Junio add it, I've refrained from attaching your name to
> > these patches.
>
> Hello! Friendly ping here. It’s been a week since the last emails for this patch
> set, so I’d like to check in on the status.

Pinging is certainly the correct thing to do.

Regarding `Reviewed-by:`: that trailer doesn't mean that someone
merely read and commented on a patch. Instead, it's explicitly _given_
by a reviewer to indicate that the reviewer has thoroughly reviewed
the patch set and is confident that it is ready to be merged into the
project, at which point Junio typically adds the `Reviewed-by:`.

> As a new contributor to the project, I don’t yet have a full view of the
> contribution flow, aside from what I’ve read. I suspect the latency is because I
> may not have cc’d all the area experts. (When I sent v1, I used separate Cc
> lines in send-email --compose, but it dropped all but the last. After Patrick
> reviewed it, I figured I could leave the cc list as-is for v2, assuming I’d get
> another review.) I’ve now cc’d everyone listed by contrib/contacts, as well as
> the maintainer. For anyone not a part of the earlier discussion, the latest
> version is at https://lore.kernel.org/git/cover.1711960552.git.thalia@archibald.dev/.
> If that’s not the problem, I’d be keen to know what I could do better.

Lack of response may be due to the series being overlooked, or it
could mean that nobody has any particular interest in the changes
(which is not to say that there is anything wrong with the changes),
or that people are simply busy elsewhere. It could also mean that this
reroll is good enough and reviewers have nothing else to add. So,
cc'ing potentially interested people makes sense.

> I have several more patch sets in the works, that I’ve held back on sending
> until this one is finished, in case I’ve been doing something wrong. I want to
> move forward. Thank you for your time.

If the additional patch sets are unrelated to this patch set, then I
don't see a reason to hold them back. Feel free to send them. Even if
they are related to this patch set, you may still want to send them.
After all, doing so may get the ball rolling again on this patch set.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 0/8] fast-import: tighten parsing of paths
  2024-04-07 23:46     ` Eric Sunshine
@ 2024-04-08  6:25       ` Patrick Steinhardt
  2024-04-08  7:15         ` Thalia Archibald
  2024-04-08 14:52         ` Junio C Hamano
  0 siblings, 2 replies; 84+ messages in thread
From: Patrick Steinhardt @ 2024-04-08  6:25 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Thalia Archibald, git, Junio C Hamano, Elijah Newren,
	brian m. carlson, Jeff King

[-- Attachment #1: Type: text/plain, Size: 4362 bytes --]

On Sun, Apr 07, 2024 at 07:46:52PM -0400, Eric Sunshine wrote:
> On Sun, Apr 7, 2024 at 5:19 PM Thalia Archibald <thalia@archibald.dev> wrote:
> > On Apr 1, 2024, at 02:02, Thalia Archibald wrote:
> > >> fast-import has subtle differences in how it parses file paths between each
> > >> occurrence of <path> in the grammar. Many errors are suppressed or not checked,
> > >> which could lead to silent data corruption. A particularly bad case is when a
> > >> front-end sent escapes that Git doesn't recognize (e.g., hex escapes are not
> > >> supported), it would be treated as literal bytes instead of a quoted string.
> > >>
> > >> Bring path parsing into line with the documented behavior and improve
> > >> documentation to fill in missing details.
> > >
> > > Thanks for the review, Patrick. I've made several changes, which I think address
> > > your feedback. What's the protocol for adding `Reviewed-by`? Since I don't know
> > > whether I, you, or Junio add it, I've refrained from attaching your name to
> > > these patches.
> >
> > Hello! Friendly ping here. It’s been a week since the last emails for this patch
> > set, so I’d like to check in on the status.
> 
> Pinging is certainly the correct thing to do.
> 
> Regarding `Reviewed-by:`: that trailer doesn't mean that someone
> merely read and commented on a patch. Instead, it's explicitly _given_
> by a reviewer to indicate that the reviewer has thoroughly reviewed
> the patch set and is confident that it is ready to be merged into the
> project, at which point Junio typically adds the `Reviewed-by:`.
> 
> > As a new contributor to the project, I don’t yet have a full view of the
> > contribution flow, aside from what I’ve read. I suspect the latency is because I
> > may not have cc’d all the area experts. (When I sent v1, I used separate Cc
> > lines in send-email --compose, but it dropped all but the last. After Patrick
> > reviewed it, I figured I could leave the cc list as-is for v2, assuming I’d get
> > another review.) I’ve now cc’d everyone listed by contrib/contacts, as well as
> > the maintainer. For anyone not a part of the earlier discussion, the latest
> > version is at https://lore.kernel.org/git/cover.1711960552.git.thalia@archibald.dev/.
> > If that’s not the problem, I’d be keen to know what I could do better.
> 
> Lack of response may be due to the series being overlooked, or it
> could mean that nobody has any particular interest in the changes
> (which is not to say that there is anything wrong with the changes),
> or that people are simply busy elsewhere. It could also mean that this
> reroll is good enough and reviewers have nothing else to add. So,
> cc'ing potentially interested people makes sense.

Yeah, for this patch series I think it's mostly a lack of interest.
Which is too bad, because it does address some real issues with
git-fast-import(1). Part of the problem is also that this area does not
really have an area expert at all -- if you git-shortlog(1) for example
"builtin/fast-import.c" for the last year you will see that it didn't
get much attention at all.

Anyway, I'm currently trying to make it a habit to pick up and review
random patch series that didn't get any attention at all every once in a
while, which is also why I reviewed your first version. I'm taking these
a bit slower though, also in the hope that some initial discussion may
motivate others to chime in, as well. Which may explain why I didn't yet
review your v2.

In any case I do have it in my backlog and will get to it somewhen this
week.

> > I have several more patch sets in the works, that I’ve held back on sending
> > until this one is finished, in case I’ve been doing something wrong. I want to
> > move forward. Thank you for your time.
> 
> If the additional patch sets are unrelated to this patch set, then I
> don't see a reason to hold them back. Feel free to send them. Even if
> they are related to this patch set, you may still want to send them.
> After all, doing so may get the ball rolling again on this patch set.

Agreed. Especially given that this is your first contribution, the
quality of your patch series is quite high. So I don't see much of a
reason to hold back the other patch series in case they are unrelated.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 0/8] fast-import: tighten parsing of paths
  2024-04-08  6:25       ` Patrick Steinhardt
@ 2024-04-08  7:15         ` Thalia Archibald
  2024-04-08  9:07           ` Patrick Steinhardt
  2024-04-08 14:52         ` Junio C Hamano
  1 sibling, 1 reply; 84+ messages in thread
From: Thalia Archibald @ 2024-04-08  7:15 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Eric Sunshine, Junio C Hamano, Elijah Newren,
	brian m. carlson, Jeff King

On Apr 7, 2024, at 23:25, Patrick Steinhardt <ps@pks.im> wrote:
> On Sun, Apr 07, 2024 at 07:46:52PM -0400, Eric Sunshine wrote:
>> 
>> Lack of response may be due to the series being overlooked, or it
>> could mean that nobody has any particular interest in the changes
>> (which is not to say that there is anything wrong with the changes),
>> or that people are simply busy elsewhere. It could also mean that this
>> reroll is good enough and reviewers have nothing else to add. So,
>> cc'ing potentially interested people makes sense.
> 
> Yeah, for this patch series I think it's mostly a lack of interest.
> Which is too bad, because it does address some real issues with
> git-fast-import(1). Part of the problem is also that this area does not
> really have an area expert at all -- if you git-shortlog(1) for example
> "builtin/fast-import.c" for the last year you will see that it didn't
> get much attention at all.

Unfortunately, my upcoming patches will probably suffer the same fate, as
they're mostly fixing parsing issues in fast-import.

> Anyway, I'm currently trying to make it a habit to pick up and review
> random patch series that didn't get any attention at all every once in a
> while, which is also why I reviewed your first version. I'm taking these
> a bit slower though, also in the hope that some initial discussion may
> motivate others to chime in, as well. Which may explain why I didn't yet
> review your v2.
> 
> In any case I do have it in my backlog and will get to it somewhen this
> week.

Thank you!

>>> I have several more patch sets in the works, that I’ve held back on sending
>>> until this one is finished, in case I’ve been doing something wrong. I want to
>>> move forward. Thank you for your time.
>> 
>> If the additional patch sets are unrelated to this patch set, then I
>> don't see a reason to hold them back. Feel free to send them. Even if
>> they are related to this patch set, you may still want to send them.
>> After all, doing so may get the ball rolling again on this patch set.
> 
> Agreed. Especially given that this is your first contribution, the
> quality of your patch series is quite high. So I don't see much of a
> reason to hold back the other patch series in case they are unrelated.

My effort comes from reimplementing fast-import parsing as a Rust library,
following the implementation, not just the documentation, so I’ve noticed many
mismatches between the concrete and abstract grammars. Perhaps it would save
reviewer time to submit those around the same time, so knowledge of fast-import
needs to be evicted and loaded from cache less.

Thalia

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 0/8] fast-import: tighten parsing of paths
  2024-04-08  7:15         ` Thalia Archibald
@ 2024-04-08  9:07           ` Patrick Steinhardt
  0 siblings, 0 replies; 84+ messages in thread
From: Patrick Steinhardt @ 2024-04-08  9:07 UTC (permalink / raw)
  To: Thalia Archibald
  Cc: git, Eric Sunshine, Junio C Hamano, Elijah Newren,
	brian m. carlson, Jeff King

[-- Attachment #1: Type: text/plain, Size: 1642 bytes --]

On Mon, Apr 08, 2024 at 07:15:35AM +0000, Thalia Archibald wrote:
> On Apr 7, 2024, at 23:25, Patrick Steinhardt <ps@pks.im> wrote:
> > On Sun, Apr 07, 2024 at 07:46:52PM -0400, Eric Sunshine wrote:
[snip]
> >>> I have several more patch sets in the works, that I’ve held back on sending
> >>> until this one is finished, in case I’ve been doing something wrong. I want to
> >>> move forward. Thank you for your time.
> >> 
> >> If the additional patch sets are unrelated to this patch set, then I
> >> don't see a reason to hold them back. Feel free to send them. Even if
> >> they are related to this patch set, you may still want to send them.
> >> After all, doing so may get the ball rolling again on this patch set.
> > 
> > Agreed. Especially given that this is your first contribution, the
> > quality of your patch series is quite high. So I don't see much of a
> > reason to hold back the other patch series in case they are unrelated.
> 
> My effort comes from reimplementing fast-import parsing as a Rust library,
> following the implementation, not just the documentation, so I’ve noticed many
> mismatches between the concrete and abstract grammars. Perhaps it would save
> reviewer time to submit those around the same time, so knowledge of fast-import
> needs to be evicted and loaded from cache less.

In this case I think it depends on whether or not these patch series
would conflict with each other. If they do it's preferable to land them
sequentially. If they don't conflict then it should be fine to send
separate patch series and parallelize the review to a certain degree.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 0/8] fast-import: tighten parsing of paths
  2024-04-08  6:25       ` Patrick Steinhardt
  2024-04-08  7:15         ` Thalia Archibald
@ 2024-04-08 14:52         ` Junio C Hamano
  1 sibling, 0 replies; 84+ messages in thread
From: Junio C Hamano @ 2024-04-08 14:52 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Eric Sunshine, Thalia Archibald, git, Elijah Newren,
	brian m. carlson, Jeff King

Patrick Steinhardt <ps@pks.im> writes:

>> Regarding `Reviewed-by:`: that trailer doesn't mean that someone
>> merely read and commented on a patch. Instead, it's explicitly _given_
>> by a reviewer to indicate that the reviewer has thoroughly reviewed
>> the patch set and is confident that it is ready to be merged into the
>> project, at which point Junio typically adds the `Reviewed-by:`.

Just to clarify, "adds" means "cuts the reviewed-by lines written by
the reviewers in their review messages and pasts them into the
commit message, either amending the commits after the fact or while
applying them".  I do _not_ judge from the sideline the quality of
reviews given by others and say "yeah, that is an adequate review,
so I'll forge _their_ reviewed-by trailer".

> Anyway, I'm currently trying to make it a habit to pick up and review
> random patch series that didn't get any attention at all every once in a
> while, ...

I have noticed your effort and appreciated it very much.  I wish
there were many more others who do the same.  I of course have been
playing that role for some time as the final catch-all reviewer, but
a single person does not scale.

Thanks.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 1/8] fast-import: tighten path unquoting
  2024-04-01  9:02   ` [PATCH v2 1/8] fast-import: tighten path unquoting Thalia Archibald
@ 2024-04-10  6:27     ` Patrick Steinhardt
  2024-04-10  8:18       ` Chris Torek
  2024-04-10  9:12       ` Thalia Archibald
  0 siblings, 2 replies; 84+ messages in thread
From: Patrick Steinhardt @ 2024-04-10  6:27 UTC (permalink / raw)
  To: Thalia Archibald; +Cc: git, Elijah Newren

[-- Attachment #1: Type: text/plain, Size: 19112 bytes --]

On Mon, Apr 01, 2024 at 09:02:47AM +0000, Thalia Archibald wrote:
> Path parsing in fast-import is inconsistent and many unquoting errors
> are suppressed or not checked.
> 
> <path> appears in the grammar in these places:
> 
>     filemodify ::= 'M' SP <mode> (<dataref> | 'inline') SP <path> LF
>     filedelete ::= 'D' SP <path> LF
>     filecopy   ::= 'C' SP <path> SP <path> LF
>     filerename ::= 'R' SP <path> SP <path> LF
>     ls         ::= 'ls' SP <dataref> SP <path> LF
>     ls-commit  ::= 'ls' SP <path> LF
> 
> and fast-import.c parses them in five different ways:
> 
> 1. For filemodify and filedelete:
>    Try to unquote <path>. If it unquotes without errors, use the
>    unquoted version; otherwise, treat it as literal bytes to the end of
>    the line (including any number of SP).
> 2. For filecopy (source) and filerename (source):
>    Try to unquote <path>. If it unquotes without errors, use the
>    unquoted version; otherwise, treat it as literal bytes up to, but not
>    including, the next SP.
> 3. For filecopy (dest) and filerename (dest):
>    Like 1., but an unquoted empty string is forbidden.
> 4. For ls:
>    If <path> starts with `"`, unquote it and report parse errors;
>    otherwise, treat it as literal bytes to the end of the line
>    (including any number of SP).
> 5. For ls-commit:
>    Unquote <path> and report parse errors.
>    (It must start with `"` to disambiguate from ls.)
> 
> In the first three, any errors from trying to unquote a string are
> suppressed, so a quoted string that contains invalid escapes would be
> interpreted as literal bytes. For example, `"\xff"` would fail to
> unquote (because hex escapes are not supported), and it would instead be
> interpreted as the byte sequence '"', '\\', 'x', 'f', 'f', '"', which is
> certainly not intended. Some front-ends erroneously use their language's
> standard quoting routine instead of matching Git's, which could silently
> introduce escapes that would be incorrectly parsed due to this and lead
> to data corruption.
> 
> The documentation states “To use a source path that contains SP the path
> must be quoted.”, so it is expected that some implementations depend on
> spaces being allowed in paths in the final position. Thus we have two
> documented ways to parse paths, so simplify the implementation to that.
> 
> Now we have:
> 
> 1. `parse_path_eol` for filemodify, filedelete, filecopy (dest),
>    filerename (dest), ls, and ls-commit:
> 
>    If <path> starts with `"`, unquote it and report parse errors;
>    otherwise, treat it as literal bytes to the end of the line
>    (including any number of SP).
> 
> 2. `parse_path_space` for filecopy (source) and filerename (source):
> 
>    If <path> starts with `"`, unquote it and report parse errors;
>    otherwise, treat it as literal bytes up to, but not including, the
>    next SP. It must be followed by SP.
> 
> There remain two special cases: The dest <path> in filecopy and rename
> cannot be an unquoted empty string (this will be addressed subsequently)
> and <path> in ls-commit must be quoted to disambiguate it from ls.
> 
> Signed-off-by: Thalia Archibald <thalia@archibald.dev>
> ---
>  builtin/fast-import.c  | 102 ++++++++++-------
>  t/t9300-fast-import.sh | 251 ++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 309 insertions(+), 44 deletions(-)
> 
> diff --git a/builtin/fast-import.c b/builtin/fast-import.c
> index 782bda007c..6f9048a037 100644
> --- a/builtin/fast-import.c
> +++ b/builtin/fast-import.c
> @@ -2258,10 +2258,54 @@ static uintmax_t parse_mark_ref_space(const char **p)
>  	return mark;
>  }
>  
> +/*
> + * Parse the path string into the strbuf. It may be quoted with escape sequences
> + * or unquoted without escape sequences. When unquoted, it may only contain a
> + * space if `include_spaces` is nonzero.
> + */
> +static void parse_path(struct strbuf *sb, const char *p, const char **endp, int include_spaces, const char *field)

Let's break this overly long line, for example after `**endp,`.

> +{
> +	if (*p == '"') {
> +		if (unquote_c_style(sb, p, endp))
> +			die("Invalid %s: %s", field, command_buf.buf);
> +	} else {
> +		if (include_spaces)
> +			*endp = p + strlen(p);
> +		else
> +			*endp = strchr(p, ' ');
> +		strbuf_add(sb, p, *endp - p);

strchr(3P) may return a NULL pointer in case there is no space, which
would make us segfault here when dereferencing `*endp`. We should
probably add a testcase that would hit this edge case.

> +	}
> +}
> +
> +/*
> + * Parse the path string into the strbuf, and complain if this is not the end of
> + * the string. It may contain spaces even when unquoted.
> + */
> +static void parse_path_eol(struct strbuf *sb, const char *p, const char *field)
> +{
> +	const char *end;
> +
> +	parse_path(sb, p, &end, 1, field);
> +	if (*end)
> +		die("Garbage after %s: %s", field, command_buf.buf);
> +}
> +
> +/*
> + * Parse the path string into the strbuf, and ensure it is followed by a space.
> + * It may not contain spaces when unquoted. Update *endp to point to the first
> + * character after the space.
> + */
> +static void parse_path_space(struct strbuf *sb, const char *p, const char **endp, const char *field)
> +{
> +	parse_path(sb, p, endp, 0, field);
> +	if (**endp != ' ')
> +		die("Missing space after %s: %s", field, command_buf.buf);
> +	(*endp)++;
> +}
> +
>  static void file_change_m(const char *p, struct branch *b)
>  {
>  	static struct strbuf uq = STRBUF_INIT;
> -	const char *endp;
>  	struct object_entry *oe;
>  	struct object_id oid;
>  	uint16_t mode, inline_data = 0;
> @@ -2299,11 +2343,8 @@ static void file_change_m(const char *p, struct branch *b)
>  	}
>  
>  	strbuf_reset(&uq);
> -	if (!unquote_c_style(&uq, p, &endp)) {
> -		if (*endp)
> -			die("Garbage after path in: %s", command_buf.buf);
> -		p = uq.buf;
> -	}
> +	parse_path_eol(&uq, p, "path");
> +	p = uq.buf;
>  
>  	/* Git does not track empty, non-toplevel directories. */
>  	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *p) {
> @@ -2367,48 +2408,29 @@ static void file_change_m(const char *p, struct branch *b)
>  static void file_change_d(const char *p, struct branch *b)
>  {
>  	static struct strbuf uq = STRBUF_INIT;
> -	const char *endp;
>  
>  	strbuf_reset(&uq);
> -	if (!unquote_c_style(&uq, p, &endp)) {
> -		if (*endp)
> -			die("Garbage after path in: %s", command_buf.buf);
> -		p = uq.buf;
> -	}
> +	parse_path_eol(&uq, p, "path");
> +	p = uq.buf;
>  	tree_content_remove(&b->branch_tree, p, NULL, 1);
>  }
>  
> -static void file_change_cr(const char *s, struct branch *b, int rename)
> +static void file_change_cr(const char *p, struct branch *b, int rename)
>  {
> -	const char *d;
> +	const char *s, *d;
>  	static struct strbuf s_uq = STRBUF_INIT;
>  	static struct strbuf d_uq = STRBUF_INIT;
> -	const char *endp;
>  	struct tree_entry leaf;
>  
>  	strbuf_reset(&s_uq);
> -	if (!unquote_c_style(&s_uq, s, &endp)) {
> -		if (*endp != ' ')
> -			die("Missing space after source: %s", command_buf.buf);
> -	} else {
> -		endp = strchr(s, ' ');
> -		if (!endp)
> -			die("Missing space after source: %s", command_buf.buf);
> -		strbuf_add(&s_uq, s, endp - s);
> -	}
> +	parse_path_space(&s_uq, p, &p, "source");
>  	s = s_uq.buf;
>  
> -	endp++;
> -	if (!*endp)
> +	if (!p)
>  		die("Missing dest: %s", command_buf.buf);

So this statement right now doesn't make a whole lot of sense because
`p` cannot ever be `NULL` -- we'd segfault before that. Once we update
`parse_path()` to handle this correctly it will work as expected though.

I was briefly wondering though whether we really want `parse_path()` to
set `p` to be a NULL pointer. If we didn't, we could retain the previous
behaviour here and instead check for `!*p`.

Patrick

> -	d = endp;
>  	strbuf_reset(&d_uq);
> -	if (!unquote_c_style(&d_uq, d, &endp)) {
> -		if (*endp)
> -			die("Garbage after dest in: %s", command_buf.buf);
> -		d = d_uq.buf;
> -	}
> +	parse_path_eol(&d_uq, p, "dest");
> +	d = d_uq.buf;
>  
>  	memset(&leaf, 0, sizeof(leaf));
>  	if (rename)
> @@ -3152,6 +3174,7 @@ static void print_ls(int mode, const unsigned char *hash, const char *path)
>  
>  static void parse_ls(const char *p, struct branch *b)
>  {
> +	static struct strbuf uq = STRBUF_INIT;
>  	struct tree_entry *root = NULL;
>  	struct tree_entry leaf = {NULL};
>  
> @@ -3168,16 +3191,9 @@ static void parse_ls(const char *p, struct branch *b)
>  			root->versions[1].mode = S_IFDIR;
>  		load_tree(root);
>  	}
> -	if (*p == '"') {
> -		static struct strbuf uq = STRBUF_INIT;
> -		const char *endp;
> -		strbuf_reset(&uq);
> -		if (unquote_c_style(&uq, p, &endp))
> -			die("Invalid path: %s", command_buf.buf);
> -		if (*endp)
> -			die("Garbage after path in: %s", command_buf.buf);
> -		p = uq.buf;
> -	}
> +	strbuf_reset(&uq);
> +	parse_path_eol(&uq, p, "path");
> +	p = uq.buf;
>  	tree_content_get(root, p, &leaf, 1);
>  	/*
>  	 * A directory in preparation would have a sha1 of zero
> diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
> index 60e30fed3c..0fb5612b07 100755
> --- a/t/t9300-fast-import.sh
> +++ b/t/t9300-fast-import.sh
> @@ -2142,6 +2142,7 @@ test_expect_success 'Q: deny note on empty branch' '
>  	EOF
>  	test_must_fail git fast-import <input
>  '
> +
>  ###
>  ### series R (feature and option)
>  ###
> @@ -2790,7 +2791,7 @@ test_expect_success 'R: blob appears only once' '
>  '
>  
>  ###
> -### series S
> +### series S (mark and path parsing)
>  ###
>  #
>  # Make sure missing spaces and EOLs after mark references
> @@ -3060,6 +3061,254 @@ test_expect_success 'S: ls with garbage after sha1 must fail' '
>  	test_grep "space after tree-ish" err
>  '
>  
> +#
> +# Path parsing
> +#
> +# There are two sorts of ways a path can be parsed, depending on whether it is
> +# the last field on the line. Additionally, ls without a <dataref> has a special
> +# case. Test every occurrence of <path> in the grammar against every error case.
> +#
> +
> +#
> +# Valid paths at the end of a line: filemodify, filedelete, filecopy (dest),
> +# filerename (dest), and ls.
> +#
> +# commit :301 from root -- modify hello.c (for setup)
> +# commit :302 from :301 -- modify $path
> +# commit :303 from :302 -- delete $path
> +# commit :304 from :301 -- copy hello.c $path
> +# commit :305 from :301 -- rename hello.c $path
> +# ls :305 $path
> +#
> +test_path_eol_success () {
> +	local test="$1" path="$2" unquoted_path="$3"
> +	test_expect_success "S: paths at EOL with $test must work" '
> +		test_when_finished "git branch -D S-path-eol" &&
> +
> +		git fast-import --export-marks=marks.out <<-EOF >out 2>err &&
> +		blob
> +		mark :401
> +		data <<BLOB
> +		hello world
> +		BLOB
> +
> +		blob
> +		mark :402
> +		data <<BLOB
> +		hallo welt
> +		BLOB
> +
> +		commit refs/heads/S-path-eol
> +		mark :301
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		initial commit
> +		COMMIT
> +		M 100644 :401 hello.c
> +
> +		commit refs/heads/S-path-eol
> +		mark :302
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit filemodify
> +		COMMIT
> +		from :301
> +		M 100644 :402 '"$path"'
> +
> +		commit refs/heads/S-path-eol
> +		mark :303
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit filedelete
> +		COMMIT
> +		from :302
> +		D '"$path"'
> +
> +		commit refs/heads/S-path-eol
> +		mark :304
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit filecopy dest
> +		COMMIT
> +		from :301
> +		C hello.c '"$path"'
> +
> +		commit refs/heads/S-path-eol
> +		mark :305
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit filerename dest
> +		COMMIT
> +		from :301
> +		R hello.c '"$path"'
> +
> +		ls :305 '"$path"'
> +		EOF
> +
> +		commit_m=$(grep :302 marks.out | cut -d\  -f2) &&
> +		commit_d=$(grep :303 marks.out | cut -d\  -f2) &&
> +		commit_c=$(grep :304 marks.out | cut -d\  -f2) &&
> +		commit_r=$(grep :305 marks.out | cut -d\  -f2) &&
> +		blob1=$(grep :401 marks.out | cut -d\  -f2) &&
> +		blob2=$(grep :402 marks.out | cut -d\  -f2) &&
> +
> +		( printf "100644 blob $blob2\t'"$unquoted_path"'\n" &&
> +		  printf "100644 blob $blob1\thello.c\n" ) | sort >tree_m.exp &&

I think it is more customary to format as follows:

	(
		printf "100644 blob $blob2\t'"$unquoted_path"'\n" &&
		printf "100644 blob $blob1\thello.c\n"
	) | sort >tree_m.exp &&

Same for other statements further down.

Also, there is no need to do `'"$unuoted_path"'` here. You should be
able to refer to `$unquoted_path` just fine even without unquoting again
because we use eval to execute the code block. In fact, it can even be
harmful as it is known to break shells under some circumstances. See
also 7c4449eb31 (t/README: document how to loop around test cases,
2024-03-22), which I think should apply in your case, too.

Patrick

> +		git ls-tree $commit_m | sort >tree_m.out &&
> +		test_cmp tree_m.exp tree_m.out &&
> +
> +		printf "100644 blob $blob1\thello.c\n" >tree_d.exp &&
> +		git ls-tree $commit_d >tree_d.out &&
> +		test_cmp tree_d.exp tree_d.out &&
> +
> +		( printf "100644 blob $blob1\t'"$unquoted_path"'\n" &&
> +		  printf "100644 blob $blob1\thello.c\n" ) | sort >tree_c.exp &&
> +		git ls-tree $commit_c | sort >tree_c.out &&
> +		test_cmp tree_c.exp tree_c.out &&
> +
> +		printf "100644 blob $blob1\t'"$unquoted_path"'\n" >tree_r.exp &&
> +		git ls-tree $commit_r >tree_r.out &&
> +		test_cmp tree_r.exp tree_r.out &&
> +
> +		test_cmp out tree_r.exp
> +	'
> +}
> +
> +test_path_eol_success 'quoted spaces'   '" hello world.c "' ' hello world.c '
> +test_path_eol_success 'unquoted spaces' ' hello world.c '   ' hello world.c '
> +
> +#
> +# Valid paths before a space: filecopy (source) and filerename (source).
> +#
> +# commit :301 from root -- modify $path (for setup)
> +# commit :302 from :301 -- copy $path hello2.c
> +# commit :303 from :301 -- rename $path hello2.c
> +#
> +test_path_space_success () {
> +	local test="$1" path="$2" unquoted_path="$3"
> +	test_expect_success "S: paths before space with $test must work" '
> +		test_when_finished "git branch -D S-path-space" &&
> +
> +		git fast-import --export-marks=marks.out <<-EOF 2>err &&
> +		blob
> +		mark :401
> +		data <<BLOB
> +		hello world
> +		BLOB
> +
> +		commit refs/heads/S-path-space
> +		mark :301
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		initial commit
> +		COMMIT
> +		M 100644 :401 '"$path"'
> +
> +		commit refs/heads/S-path-space
> +		mark :302
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit filecopy source
> +		COMMIT
> +		from :301
> +		C '"$path"' hello2.c
> +
> +		commit refs/heads/S-path-space
> +		mark :303
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit filerename source
> +		COMMIT
> +		from :301
> +		R '"$path"' hello2.c
> +
> +		EOF
> +
> +		commit_c=$(grep :302 marks.out | cut -d\  -f2) &&
> +		commit_r=$(grep :303 marks.out | cut -d\  -f2) &&
> +		blob=$(grep :401 marks.out | cut -d\  -f2) &&
> +
> +		( printf "100644 blob $blob\t'"$unquoted_path"'\n" &&
> +		  printf "100644 blob $blob\thello2.c\n" ) | sort >tree_c.exp &&
> +		git ls-tree $commit_c | sort >tree_c.out &&
> +		test_cmp tree_c.exp tree_c.out &&
> +
> +		printf "100644 blob $blob\thello2.c\n" >tree_r.exp &&
> +		git ls-tree $commit_r >tree_r.out &&
> +		test_cmp tree_r.exp tree_r.out
> +	'
> +}
> +
> +test_path_space_success 'quoted spaces'      '" hello world.c "' ' hello world.c '
> +test_path_space_success 'no unquoted spaces' 'hello_world.c'     'hello_world.c'
> +
> +#
> +# Test a single commit change with an invalid path. Run it with all occurrences
> +# of <path> in the grammar against all error kinds.
> +#
> +test_path_fail () {
> +	local change="$1" what="$2" prefix="$3" path="$4" suffix="$5" err_grep="$6"
> +	test_expect_success "S: $change with $what must fail" '
> +		test_must_fail git fast-import <<-EOF 2>err &&
> +		blob
> +		mark :1
> +		data <<BLOB
> +		hello world
> +		BLOB
> +
> +		commit refs/heads/S-path-fail
> +		mark :2
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit setup
> +		COMMIT
> +		M 100644 :1 hello.c
> +
> +		commit refs/heads/S-path-fail
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit with bad path
> +		COMMIT
> +		from :2
> +		'"$prefix$path$suffix"'
> +		EOF
> +
> +		test_grep '"'$err_grep'"' err
> +	'
> +}
> +
> +test_path_base_fail () {
> +	local change="$1" prefix="$2" field="$3" suffix="$4"
> +	test_path_fail "$change" 'unclosed " in '"$field"          "$prefix" '"hello.c'    "$suffix" "Invalid $field"
> +	test_path_fail "$change" "invalid escape in quoted $field" "$prefix" '"hello\xff"' "$suffix" "Invalid $field"
> +}
> +test_path_eol_quoted_fail () {
> +	local change="$1" prefix="$2" field="$3" suffix="$4"
> +	test_path_base_fail "$change" "$prefix" "$field" "$suffix"
> +	test_path_fail "$change" "garbage after quoted $field" "$prefix" '"hello.c"x' "$suffix" "Garbage after $field"
> +	test_path_fail "$change" "space after quoted $field"   "$prefix" '"hello.c" ' "$suffix" "Garbage after $field"
> +}
> +test_path_eol_fail () {
> +	local change="$1" prefix="$2" field="$3" suffix="$4"
> +	test_path_eol_quoted_fail "$change" "$prefix" "$field" "$suffix"
> +}
> +test_path_space_fail () {
> +	local change="$1" prefix="$2" field="$3" suffix="$4"
> +	test_path_base_fail "$change" "$prefix" "$field" "$suffix"
> +	test_path_fail "$change" "missing space after quoted $field" "$prefix" '"hello.c"x' "$suffix" "Missing space after $field"
> +}
> +
> +test_path_eol_fail   filemodify       'M 100644 :1 ' path   ''
> +test_path_eol_fail   filedelete       'D '           path   ''
> +test_path_space_fail filecopy         'C '           source ' world.c'
> +test_path_eol_fail   filecopy         'C hello.c '   dest   ''
> +test_path_space_fail filerename       'R '           source ' world.c'
> +test_path_eol_fail   filerename       'R hello.c '   dest   ''
> +test_path_eol_fail   'ls (in commit)' 'ls :2 '       path   ''
> +
> +# When 'ls' has no <dataref>, the <path> must be quoted.
> +test_path_eol_quoted_fail 'ls (without dataref in commit)' 'ls ' path ''
> +
>  ###
>  ### series T (ls)
>  ###
> -- 
> 2.44.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 2/8] fast-import: directly use strbufs for paths
  2024-04-01  9:03   ` [PATCH v2 2/8] fast-import: directly use strbufs for paths Thalia Archibald
@ 2024-04-10  6:27     ` Patrick Steinhardt
  2024-04-10 10:07       ` Thalia Archibald
  0 siblings, 1 reply; 84+ messages in thread
From: Patrick Steinhardt @ 2024-04-10  6:27 UTC (permalink / raw)
  To: Thalia Archibald; +Cc: git, Elijah Newren

[-- Attachment #1: Type: text/plain, Size: 6430 bytes --]

On Mon, Apr 01, 2024 at 09:03:06AM +0000, Thalia Archibald wrote:
> Previously, one case would not write the path to the strbuf: when the
> path is unquoted and at the end of the string. It was essentially
> copy-on-write. However, with the logic simplification of the previous
> commit, this case was eliminated and the strbuf is always populated.
> 
> Directly use the strbufs now instead of an alias.
> 
> Since this already changes all the lines that use the strbufs, rename
> them from `uq` to be more descriptive. That they are unquoted is not
> their most important property, so name them after what they carry.
> 
> Additionally, `file_change_m` no longer needs to copy the path before
> reading inline data.
> 
> Signed-off-by: Thalia Archibald <thalia@archibald.dev>
> ---
>  builtin/fast-import.c | 64 ++++++++++++++++++-------------------------
>  1 file changed, 27 insertions(+), 37 deletions(-)
> 
> diff --git a/builtin/fast-import.c b/builtin/fast-import.c
> index 6f9048a037..fad9324e59 100644
> --- a/builtin/fast-import.c
> +++ b/builtin/fast-import.c
> @@ -2305,7 +2305,
> 7 @@ static void parse_path_space(struct strbuf *sb, const char *p, const char **endp
>  
>  static void file_change_m(const char *p, struct branch *b)
>  {
> -	static struct strbuf uq = STRBUF_INIT;
> +	static struct strbuf path = STRBUF_INIT;
>  	struct object_entry *oe;
>  	struct object_id oid;
>  	uint16_t mode, inline_data = 0;
> @@ -2342,13 +2342,12 @@ static void file_change_m(const char *p, struct branch *b)
>  			die("Missing space after SHA1: %s", command_buf.buf);
>  	}
>  
> -	strbuf_reset(&uq);
> -	parse_path_eol(&uq, p, "path");
> -	p = uq.buf;
> +	strbuf_reset(&path);
> +	parse_path_eol(&path, p, "path");
>  
>  	/* Git does not track empty, non-toplevel directories. */
> -	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *p) {
> -		tree_content_remove(&b->branch_tree, p, NULL, 0);
> +	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *path.buf) {
> +		tree_content_remove(&b->branch_tree, path.buf, NULL, 0);
>  		return;
>  	}
>  
> @@ -2369,10 +2368,6 @@ static void file_change_m(const char *p, str
> uct branch *b)
>  		if (S_ISDIR(mode))
>  			die("Directories cannot be specified 'inline': %s",
>  				command_buf.buf);
> -		if (p != uq.buf) {
> -			strbuf_addstr(&uq, p);
> -			p = uq.buf;
> -		}
>  		while (read_next_command() != EOF) {
>  			const char *v;
>  			if (skip_prefix(command_buf.buf, "cat-blob ", &v))
> @@ -2398,55 +2393,51 @@ static void file_change_m(const char *p, struct branch *b)
>  				command_buf.buf);
>  	}
>  
> -	if (!*p) {
> +	if (!*path.buf) {
>  		tree_content_replace(&b->branch_tree, &oid, mode, NULL);
>  		return;
>  	}
> -	tree_content_set(&b->branch_tree, p, &oid, mode, NULL);
> +	tree_content_set(&b->branch_tree, path.buf, &oid, mode, NULL);
>  }
>  
>  static void file_change_d(const char *p, struct branch *b)
>  {
> -	static struct strbuf uq = STRBUF_INIT;
> +	static struct strbuf path = STRBUF_INIT;
>  
> -	strbuf_reset(&uq);
> -	parse_path_eol(&uq, p, "path");
> -	p = uq.buf;
> -	tree_content_remove(&b->branch_tree, p, NULL, 1);
> +	strbuf_reset(&path);
> +	parse_path_eol(&path, p
> , "path");

This looks weird. Did you manually edit the patch or is there some weird
character in here that breaks diff generation?

> +	tree_content_remove(&b->branch_tree, path.buf, NULL, 1);
>  }
>  
>  static void file_change_cr(const char *p, struct branch *b, int rename)
>  {
> -	const char *s, *d;
> -	static struct strbuf s_uq = STRBUF_INIT;
> -	static struct strbuf d_uq = STRBUF_INIT;
> +	static struct strbuf source = STRBUF_INIT;
> +	static struct strbuf dest = STRBUF_INIT;
>  	struct tree_entry leaf;
>  
> -	strbuf_reset(&s_uq);
> -	parse_path_space(&s_uq, p, &p, "source");
> -	s = s_uq.buf;
> +	strbuf_reset(&source);
> +	parse_path_space(&source, p, &p, "source");
>  
>  	if (!p)
>  		die("Missing dest: %s", command_buf.buf);
> -	strbuf_reset(&d_uq);
> -	parse_path_eol(&d_uq, p, "dest");
> -	d = d_uq.buf;
> +	strbuf_reset(&dest);
> +	parse_path_eol(&dest, p, "dest");
>  
>  	memset(&leaf, 0, sizeof(leaf));
>  	if (rename)
> -		tree_content_remove(&b->branch_tree, s, &leaf, 1);
> +		tree_content_remove(&b->branch_tree, source.buf, &leaf, 1);
>  	else
> -		tree_content_get(&b->branch_tree, s, &leaf, 1);
> +		tree_content_get(&b-
> >branch_tree, source.buf, &leaf, 1);

Same here. Is your mail agent maybe wrapping lines?

>  	if (!leaf.versions[1].mode)
> -		die("Path %s not in branch", s);
> -	if (!*d) {	/* C "path/to/subdir" "" */
> +		die("Path %s not in branch", source.buf);
> +	if (!*dest.buf) {	/* C "path/to/subdir" "" */
>  		tree_content_replace(&b->branch_tree,
>  			&leaf.versions[1].oid,
>  			leaf.versions[1].mode,
>  			leaf.tree);
>  		return;
>  	}
> -	tree_content_set(&b->branch_tree, d,
> +	tree_content_set(&b->branch_tree, dest.buf,
>  		&leaf.versions[1].oid,
>  		leaf.versions[1].mode,
>  		leaf.tree);
> @@ -3174,7 +3165,7 @@ static void print_ls(int mode, const unsigned char *hash, const char *path)
>  
>  static void parse_ls(const char *p, struct branch *b)
>  {
> -	static struct strbuf uq = STRBUF_INIT;
> +	static struct strbuf path = STRBUF_INIT;
>  	struct tree_entry *root = NULL;
>  	struct tree_entry leaf = {NULL};
>  
> @@ -3191,10 +3182,9 @@ static void parse_ls(const char *p, struct branch *b)
>  			root->versions[1].mode = S_IFDIR;
>  		load_tree(root);
>  	}
> -	s
> trbuf_reset(&uq);

And here.

Other than those formatting issues this patch looks fine to me.

Patrick

> -	parse_path_eol(&uq, p, "path");
> -	p = uq.buf;
> -	tree_content_get(root, p, &leaf, 1);
> +	strbuf_reset(&path);
> +	parse_path_eol(&path, p, "path");
> +	tree_content_get(root, path.buf, &leaf, 1);
>  	/*
>  	 * A directory in preparation would have a sha1 of zero
>  	 * until it is saved.  Save, for simplicity.
> @@ -3202,7 +3192,7 @@ static void parse_ls(const char *p, struct branch *b)
>  	if (S_ISDIR(leaf.versions[1].mode))
>  		store_tree(&leaf);
>  
> -	print_ls(leaf.versions[1].mode, leaf.versions[1].oid.hash, p);
> +	print_ls(leaf.versions[1].mode, leaf.versions[1].oid.hash, path.buf);
>  	if (leaf.tree)
>  		release_tree_content_recursive(leaf.tree);
>  	if (!b || root != &b->branch_tree)
> -- 
> 2.44.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 3/8] fast-import: allow unquoted empty path for root
  2024-04-01  9:03   ` [PATCH v2 3/8] fast-import: allow unquoted empty path for root Thalia Archibald
@ 2024-04-10  6:27     ` Patrick Steinhardt
  0 siblings, 0 replies; 84+ messages in thread
From: Patrick Steinhardt @ 2024-04-10  6:27 UTC (permalink / raw)
  To: Thalia Archibald; +Cc: git, Elijah Newren

[-- Attachment #1: Type: text/plain, Size: 16533 bytes --]

On Mon, Apr 01, 2024 at 09:03:17AM +0000, Thalia Archibald wrote:
> Ever since filerename was added in f39a946a1f (Support wholesale
> directory renames in fast-import, 2007-07-09) and filecopy in b6f3481bb4
> (Teach fast-import to recursively copy files/directories, 2007-07-15),
> both have produced an error when the destination path is empty. Later,
> when support for targeting the root directory with an empty string was
> added in 2794ad5244 (fast-import: Allow filemodify to set the root,
> 2010-10-10), this had the effect of allowing the quoted empty string
> (`""`), but forbidding its unquoted variant (``). This seems to have
> been intended as simple data validation for parsing two paths, rather
> than a syntax restriction, because it was not extended to the other
> operations.
> 
> All other occurrences of paths (in filemodify, filedelete, the source of
> filecopy and filerename, and ls) allow both.
> 
> For most of this feature's lifetime, the documentation has not
> prescribed the use of quoted empty strings. In e5959106d6
> (Documentation/fast-import: put explanation of M 040000 <dataref> "" in
> context, 2011-01-15), its documentation was changed from “`<path>` may
> also be an empty string (`""`) to specify the root of the tree” to “The
> root of the tree can be represented by an empty string as `<path>`”.
> 
> Thus, we can assume that some front-ends have depended on this behavior.
> 
> Remove this restriction for the destination paths of filecopy and
> filerename and change tests targeting the root to test `""` and ``.
> 
> Signed-off-by: Thalia Archibald <thalia@archibald.dev>
> ---
>  builtin/fast-import.c  |   5 +-
>  t/t9300-fast-import.sh | 363 +++++++++++++++++++++--------------------
>  2 files changed, 191 insertions(+), 177 deletions(-)
> 
> diff --git a/builtin/fast-import.c b/builtin/fast-import.c
> index fad9324e59..58cc8d4ede 100644
> --- a/builtin/fast-import.c
> +++ b/builtin/fast-import.c
> @@ -2416,11 +2416,8 @@ static void file_change_cr(const char *p, struct branch *b, int rename)
>  	struct tree_entry leaf;
>  
>  	strbuf_reset(&source);
> -	parse_path_space(&source, p, &p, "source");

Nit: the diff would be a bit easier to read if you retained the sequence
of `strbuf_reset()` and `parse_path_space()`.

> -
> -	if (!p)
> -		die("Missing dest: %s", command_buf.buf);

>  	strbuf_reset(&dest);

I also wonder why this actually makes a difference. As mentioned in a
preceding mail, `if (!p)` cannot really do anything because the only
case where `p` could be a `NULL` pointer is when strchr(3P) did not
found a subsequent space in `parse_path()`. And in that case we would
have segfaulted because we do dereference `p` afterwards.

> +	parse_path_space(&source, p, &p, "source");
>  	parse_path_eol(&dest, p, "dest");
>  
>  	memset(&leaf, 0, sizeof(leaf));
> diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
> index 0fb5612b07..635b1b9af7 100755
> --- a/t/t9300-fast-import.sh
> +++ b/t/t9300-fast-import.sh
> @@ -1059,30 +1059,33 @@ test_expect_success 'M: rename subdirectory to new subdirectory' '
>  	compare_diff_raw expect actual
>  '
>  
> -test_expect_success 'M: rename root to subdirectory' '
> -	cat >input <<-INPUT_END &&
> -	commit refs/heads/M4
> -	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> -	data <<COMMIT
> -	rename root
> -	COMMIT
> +for root in '""' ''
> +do
> +	test_expect_success "M: rename root ($root) to subdirectory" '
> +		cat >input <<-INPUT_END &&
> +		commit refs/heads/M4
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		rename root
> +		COMMIT
>  
> -	from refs/heads/M2^0
> -	R "" sub
> +		from refs/heads/M2^0
> +		R '"$root"' sub

Same comment here, we should not do the `'"$root"'` dance but can
instead just refer to the variable directly in the quoted block.

Patrick

> -	INPUT_END
> +		INPUT_END
>  
> -	cat >expect <<-EOF &&
> -	:100644 100644 $oldf $oldf R100	file2/oldf	sub/file2/oldf
> -	:100755 100755 $f4id $f4id R100	file4	sub/file4
> -	:100755 100755 $newf $newf R100	i/am/new/to/you	sub/i/am/new/to/you
> -	:100755 100755 $f6id $f6id R100	newdir/exec.sh	sub/newdir/exec.sh
> -	:100644 100644 $f5id $f5id R100	newdir/interesting	sub/newdir/interesting
> -	EOF
> -	git fast-import <input &&
> -	git diff-tree -M -r M4^ M4 >actual &&
> -	compare_diff_raw expect actual
> -'
> +		cat >expect <<-EOF &&
> +		:100644 100644 $oldf $oldf R100	file2/oldf	sub/file2/oldf
> +		:100755 100755 $f4id $f4id R100	file4	sub/file4
> +		:100755 100755 $newf $newf R100	i/am/new/to/you	sub/i/am/new/to/you
> +		:100755 100755 $f6id $f6id R100	newdir/exec.sh	sub/newdir/exec.sh
> +		:100644 100644 $f5id $f5id R100	newdir/interesting	sub/newdir/interesting
> +		EOF
> +		git fast-import <input &&
> +		git diff-tree -M -r M4^ M4 >actual &&
> +		compare_diff_raw expect actual
> +	'
> +done
>  
>  ###
>  ### series N
> @@ -1259,49 +1262,52 @@ test_expect_success PIPE 'N: empty directory reads as missing' '
>  	test_cmp expect actual
>  '
>  
> -test_expect_success 'N: copy root directory by tree hash' '
> -	cat >expect <<-EOF &&
> -	:100755 000000 $newf $zero D	file3/newf
> -	:100644 000000 $oldf $zero D	file3/oldf
> -	EOF
> -	root=$(git rev-parse refs/heads/branch^0^{tree}) &&
> -	cat >input <<-INPUT_END &&
> -	commit refs/heads/N6
> -	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> -	data <<COMMIT
> -	copy root directory by tree hash
> -	COMMIT
> +for root in '""' ''
> +do
> +	test_expect_success "N: copy root ($root) by tree hash" '
> +		cat >expect <<-EOF &&
> +		:100755 000000 $newf $zero D	file3/newf
> +		:100644 000000 $oldf $zero D	file3/oldf
> +		EOF
> +		root_tree=$(git rev-parse refs/heads/branch^0^{tree}) &&
> +		cat >input <<-INPUT_END &&
> +		commit refs/heads/N6
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		copy root directory by tree hash
> +		COMMIT
>  
> -	from refs/heads/branch^0
> -	M 040000 $root ""
> -	INPUT_END
> -	git fast-import <input &&
> -	git diff-tree -C --find-copies-harder -r N4 N6 >actual &&
> -	compare_diff_raw expect actual
> -'
> +		from refs/heads/branch^0
> +		M 040000 $root_tree '"$root"'
> +		INPUT_END
> +		git fast-import <input &&
> +		git diff-tree -C --find-copies-harder -r N4 N6 >actual &&
> +		compare_diff_raw expect actual
> +	'
>  
> -test_expect_success 'N: copy root by path' '
> -	cat >expect <<-EOF &&
> -	:100755 100755 $newf $newf C100	file2/newf	oldroot/file2/newf
> -	:100644 100644 $oldf $oldf C100	file2/oldf	oldroot/file2/oldf
> -	:100755 100755 $f4id $f4id C100	file4	oldroot/file4
> -	:100755 100755 $f6id $f6id C100	newdir/exec.sh	oldroot/newdir/exec.sh
> -	:100644 100644 $f5id $f5id C100	newdir/interesting	oldroot/newdir/interesting
> -	EOF
> -	cat >input <<-INPUT_END &&
> -	commit refs/heads/N-copy-root-path
> -	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> -	data <<COMMIT
> -	copy root directory by (empty) path
> -	COMMIT
> +	test_expect_success "N: copy root ($root) by path" '
> +		cat >expect <<-EOF &&
> +		:100755 100755 $newf $newf C100	file2/newf	oldroot/file2/newf
> +		:100644 100644 $oldf $oldf C100	file2/oldf	oldroot/file2/oldf
> +		:100755 100755 $f4id $f4id C100	file4	oldroot/file4
> +		:100755 100755 $f6id $f6id C100	newdir/exec.sh	oldroot/newdir/exec.sh
> +		:100644 100644 $f5id $f5id C100	newdir/interesting	oldroot/newdir/interesting
> +		EOF
> +		cat >input <<-INPUT_END &&
> +		commit refs/heads/N-copy-root-path
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		copy root directory by (empty) path
> +		COMMIT
>  
> -	from refs/heads/branch^0
> -	C "" oldroot
> -	INPUT_END
> -	git fast-import <input &&
> -	git diff-tree -C --find-copies-harder -r branch N-copy-root-path >actual &&
> -	compare_diff_raw expect actual
> -'
> +		from refs/heads/branch^0
> +		C '"$root"' oldroot
> +		INPUT_END
> +		git fast-import <input &&
> +		git diff-tree -C --find-copies-harder -r branch N-copy-root-path >actual &&
> +		compare_diff_raw expect actual
> +	'
> +done
>  
>  test_expect_success 'N: delete directory by copying' '
>  	cat >expect <<-\EOF &&
> @@ -1431,98 +1437,102 @@ test_expect_success 'N: reject foo/ syntax in ls argument' '
>  	INPUT_END
>  '
>  
> -test_expect_success 'N: copy to root by id and modify' '
> -	echo "hello, world" >expect.foo &&
> -	echo hello >expect.bar &&
> -	git fast-import <<-SETUP_END &&
> -	commit refs/heads/N7
> -	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> -	data <<COMMIT
> -	hello, tree
> -	COMMIT
> +for root in '""' ''
> +do
> +	test_expect_success "N: copy to root ($root) by id and modify" '
> +		echo "hello, world" >expect.foo &&
> +		echo hello >expect.bar &&
> +		git fast-import <<-SETUP_END &&
> +		commit refs/heads/N7
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		hello, tree
> +		COMMIT
>  
> -	deleteall
> -	M 644 inline foo/bar
> -	data <<EOF
> -	hello
> -	EOF
> -	SETUP_END
> +		deleteall
> +		M 644 inline foo/bar
> +		data <<EOF
> +		hello
> +		EOF
> +		SETUP_END
>  
> -	tree=$(git rev-parse --verify N7:) &&
> -	git fast-import <<-INPUT_END &&
> -	commit refs/heads/N8
> -	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> -	data <<COMMIT
> -	copy to root by id and modify
> -	COMMIT
> +		tree=$(git rev-parse --verify N7:) &&
> +		git fast-import <<-INPUT_END &&
> +		commit refs/heads/N8
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		copy to root by id and modify
> +		COMMIT
>  
> -	M 040000 $tree ""
> -	M 644 inline foo/foo
> -	data <<EOF
> -	hello, world
> -	EOF
> -	INPUT_END
> -	git show N8:foo/foo >actual.foo &&
> -	git show N8:foo/bar >actual.bar &&
> -	test_cmp expect.foo actual.foo &&
> -	test_cmp expect.bar actual.bar
> -'
> +		M 040000 $tree '"$root"'
> +		M 644 inline foo/foo
> +		data <<EOF
> +		hello, world
> +		EOF
> +		INPUT_END
> +		git show N8:foo/foo >actual.foo &&
> +		git show N8:foo/bar >actual.bar &&
> +		test_cmp expect.foo actual.foo &&
> +		test_cmp expect.bar actual.bar
> +	'
>  
> -test_expect_success 'N: extract subtree' '
> -	branch=$(git rev-parse --verify refs/heads/branch^{tree}) &&
> -	cat >input <<-INPUT_END &&
> -	commit refs/heads/N9
> -	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> -	data <<COMMIT
> -	extract subtree branch:newdir
> -	COMMIT
> +	test_expect_success "N: extract subtree to the root ($root)" '
> +		branch=$(git rev-parse --verify refs/heads/branch^{tree}) &&
> +		cat >input <<-INPUT_END &&
> +		commit refs/heads/N9
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		extract subtree branch:newdir
> +		COMMIT
>  
> -	M 040000 $branch ""
> -	C "newdir" ""
> -	INPUT_END
> -	git fast-import <input &&
> -	git diff --exit-code branch:newdir N9
> -'
> +		M 040000 $branch '"$root"'
> +		C "newdir" '"$root"'
> +		INPUT_END
> +		git fast-import <input &&
> +		git diff --exit-code branch:newdir N9
> +	'
>  
> -test_expect_success 'N: modify subtree, extract it, and modify again' '
> -	echo hello >expect.baz &&
> -	echo hello, world >expect.qux &&
> -	git fast-import <<-SETUP_END &&
> -	commit refs/heads/N10
> -	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> -	data <<COMMIT
> -	hello, tree
> -	COMMIT
> +	test_expect_success "N: modify subtree, extract it to the root ($root), and modify again" '
> +		echo hello >expect.baz &&
> +		echo hello, world >expect.qux &&
> +		git fast-import <<-SETUP_END &&
> +		commit refs/heads/N10
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		hello, tree
> +		COMMIT
>  
> -	deleteall
> -	M 644 inline foo/bar/baz
> -	data <<EOF
> -	hello
> -	EOF
> -	SETUP_END
> +		deleteall
> +		M 644 inline foo/bar/baz
> +		data <<EOF
> +		hello
> +		EOF
> +		SETUP_END
>  
> -	tree=$(git rev-parse --verify N10:) &&
> -	git fast-import <<-INPUT_END &&
> -	commit refs/heads/N11
> -	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> -	data <<COMMIT
> -	copy to root by id and modify
> -	COMMIT
> +		tree=$(git rev-parse --verify N10:) &&
> +		git fast-import <<-INPUT_END &&
> +		commit refs/heads/N11
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		copy to root by id and modify
> +		COMMIT
>  
> -	M 040000 $tree ""
> -	M 100644 inline foo/bar/qux
> -	data <<EOF
> -	hello, world
> -	EOF
> -	R "foo" ""
> -	C "bar/qux" "bar/quux"
> -	INPUT_END
> -	git show N11:bar/baz >actual.baz &&
> -	git show N11:bar/qux >actual.qux &&
> -	git show N11:bar/quux >actual.quux &&
> -	test_cmp expect.baz actual.baz &&
> -	test_cmp expect.qux actual.qux &&
> -	test_cmp expect.qux actual.quux'
> +		M 040000 $tree '"$root"'
> +		M 100644 inline foo/bar/qux
> +		data <<EOF
> +		hello, world
> +		EOF
> +		R "foo" '"$root"'
> +		C "bar/qux" "bar/quux"
> +		INPUT_END
> +		git show N11:bar/baz >actual.baz &&
> +		git show N11:bar/qux >actual.qux &&
> +		git show N11:bar/quux >actual.quux &&
> +		test_cmp expect.baz actual.baz &&
> +		test_cmp expect.qux actual.qux &&
> +		test_cmp expect.qux actual.quux
> +	'
> +done
>  
>  ###
>  ### series O
> @@ -3067,6 +3077,7 @@ test_expect_success 'S: ls with garbage after sha1 must fail' '
>  # There are two sorts of ways a path can be parsed, depending on whether it is
>  # the last field on the line. Additionally, ls without a <dataref> has a special
>  # case. Test every occurrence of <path> in the grammar against every error case.
> +# Paths for the root (empty strings) are tested elsewhere.
>  #
>  
>  #
> @@ -3314,16 +3325,19 @@ test_path_eol_quoted_fail 'ls (without dataref in commit)' 'ls ' path ''
>  ###
>  # Setup is carried over from series S.
>  
> -test_expect_success 'T: ls root tree' '
> -	sed -e "s/Z\$//" >expect <<-EOF &&
> -	040000 tree $(git rev-parse S^{tree})	Z
> -	EOF
> -	sha1=$(git rev-parse --verify S) &&
> -	git fast-import --import-marks=marks <<-EOF >actual &&
> -	ls $sha1 ""
> -	EOF
> -	test_cmp expect actual
> -'
> +for root in '""' ''
> +do
> +	test_expect_success "T: ls root ($root) tree" '
> +		sed -e "s/Z\$//" >expect <<-EOF &&
> +		040000 tree $(git rev-parse S^{tree})	Z
> +		EOF
> +		sha1=$(git rev-parse --verify S) &&
> +		git fast-import --import-marks=marks <<-EOF >actual &&
> +		ls $sha1 $root
> +		EOF
> +		test_cmp expect actual
> +	'
> +done
>  
>  test_expect_success 'T: delete branch' '
>  	git branch to-delete &&
> @@ -3425,30 +3439,33 @@ test_expect_success 'U: validate directory delete result' '
>  	compare_diff_raw expect actual
>  '
>  
> -test_expect_success 'U: filedelete root succeeds' '
> -	cat >input <<-INPUT_END &&
> -	commit refs/heads/U
> -	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> -	data <<COMMIT
> -	must succeed
> -	COMMIT
> -	from refs/heads/U^0
> -	D ""
> +for root in '""' ''
> +do
> +	test_expect_success "U: filedelete root ($root) succeeds" '
> +		cat >input <<-INPUT_END &&
> +		commit refs/heads/U-delete-root
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		must succeed
> +		COMMIT
> +		from refs/heads/U^0
> +		D '"$root"'
>  
> -	INPUT_END
> +		INPUT_END
>  
> -	git fast-import <input
> -'
> +		git fast-import <input
> +	'
>  
> -test_expect_success 'U: validate root delete result' '
> -	cat >expect <<-EOF &&
> -	:100644 000000 $f7id $ZERO_OID D	hello.c
> -	EOF
> +	test_expect_success "U: validate root ($root) delete result" '
> +		cat >expect <<-EOF &&
> +		:100644 000000 $f7id $ZERO_OID D	hello.c
> +		EOF
>  
> -	git diff-tree -M -r U^1 U >actual &&
> +		git diff-tree -M -r U U-delete-root >actual &&
>  
> -	compare_diff_raw expect actual
> -'
> +		compare_diff_raw expect actual
> +	'
> +done
>  
>  ###
>  ### series V (checkpoint)
> -- 
> 2.44.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 1/8] fast-import: tighten path unquoting
  2024-04-10  6:27     ` Patrick Steinhardt
@ 2024-04-10  8:18       ` Chris Torek
  2024-04-10  8:44         ` Thalia Archibald
  2024-04-10  9:12       ` Thalia Archibald
  1 sibling, 1 reply; 84+ messages in thread
From: Chris Torek @ 2024-04-10  8:18 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Thalia Archibald, git, Elijah Newren

On Tue, Apr 9, 2024 at 11:30 PM Patrick Steinhardt <ps@pks.im> wrote:
> > +             if (include_spaces)
> > +                     *endp = p + strlen(p);
> > +             else
> > +                     *endp = strchr(p, ' ');
> > +             strbuf_add(sb, p, *endp - p);
>
> strchr(3P) may return a NULL pointer in case there is no space, which
> would make us segfault here when dereferencing `*endp`. We should
> probably add a testcase that would hit this edge case.

Note that you can do:

    *endp = p + strcspn(p, " ");

(though `strcspn` is a fundamentally harder operation since it
takes a string argument). Everything depends on whether you
want to test for an explicit "there was no space at all" case of
course; performance considerations are secondary.

Chris

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 1/8] fast-import: tighten path unquoting
  2024-04-10  8:18       ` Chris Torek
@ 2024-04-10  8:44         ` Thalia Archibald
  2024-04-10  8:51           ` Chris Torek
  0 siblings, 1 reply; 84+ messages in thread
From: Thalia Archibald @ 2024-04-10  8:44 UTC (permalink / raw)
  To: Chris Torek; +Cc: Patrick Steinhardt, git, Elijah Newren

On Apr 10, 2024, at 01:18, Chris Torek <chris.torek@gmail.com> wrote:
> On Tue, Apr 9, 2024 at 11:30 PM Patrick Steinhardt <ps@pks.im> wrote:
>>> +             if (include_spaces)
>>> +                     *endp = p + strlen(p);
>>> +             else
>>> +                     *endp = strchr(p, ' ');
>>> +             strbuf_add(sb, p, *endp - p);
>> 
>> strchr(3P) may return a NULL pointer in case there is no space, which
>> would make us segfault here when dereferencing `*endp`. We should
>> probably add a testcase that would hit this edge case.
> 
> Note that you can do:
> 
>    *endp = p + strcspn(p, " ");
> 
> (though `strcspn` is a fundamentally harder operation since it
> takes a string argument). Everything depends on whether you
> want to test for an explicit "there was no space at all" case of
> course; performance considerations are secondary.

I thought strchr returned a pointer to the terminating NUL byte if the needle
was not found. Turns out it does return NULL in that case, as you say. strchrnul
does what I want here and I’ve replaced it with that.

I’ve added a test covering this case.

Thalia



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 1/8] fast-import: tighten path unquoting
  2024-04-10  8:44         ` Thalia Archibald
@ 2024-04-10  8:51           ` Chris Torek
  2024-04-10  9:14             ` Thalia Archibald
  2024-04-10  9:16             ` Thalia Archibald
  0 siblings, 2 replies; 84+ messages in thread
From: Chris Torek @ 2024-04-10  8:51 UTC (permalink / raw)
  To: Thalia Archibald; +Cc: Patrick Steinhardt, git, Elijah Newren

On Wed, Apr 10, 2024 at 1:47 AM Thalia Archibald <thalia@archibald.dev> wrote:
> strchrnul does what I want here and I’ve replaced it with that.

`strchrnul` is a GNU extension (found on a lot of systems, but not
part of C90 or C99).

Chris

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 1/8] fast-import: tighten path unquoting
  2024-04-10  6:27     ` Patrick Steinhardt
  2024-04-10  8:18       ` Chris Torek
@ 2024-04-10  9:12       ` Thalia Archibald
  1 sibling, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-10  9:12 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Elijah Newren

(Sorry for re-sending)

On Apr 9, 2024, at 23:27, Patrick Steinhardt <ps@pks.im> wrote:
> On Mon, Apr 01, 2024 at 09:02:47AM +0000, Thalia Archibald wrote:
>> 
>> - if (!*endp)
>> + if (!p)
>> die("Missing dest: %s", command_buf.buf);
> 
> So this statement right now doesn't make a whole lot of sense because
> `p` cannot ever be `NULL` -- we'd segfault before that. Once we update
> `parse_path()` to handle this correctly it will work as expected though.
> 
> I was briefly wondering though whether we really want `parse_path()` to
> set `p` to be a NULL pointer. If we didn't, we could retain the previous
> behaviour here and instead check for `!*p`.

Good catch. There should be a deref there.

This mistake was because I originally planned to not allow unquoted empty
strings and had factored that condition into parse_path. After your round 1
feedback, I changed my mind after reanalysis. The condition you see here is
supposed to match the behavior for before and is removed in patch 3/8. There was
no test before my series exercising this branch and my test for it is added in
3/8, so it wasn't caught in this intermediate version.

>> + ( printf "100644 blob $blob2\t'"$unquoted_path"'\n" &&
>> +   printf "100644 blob $blob1\thello.c\n" ) | sort >tree_m.exp &&
> 
> Also, there is no need to do `'"$unuoted_path"'` here. You should be
> able to refer to `$unquoted_path` just fine even without unquoting again
> because we use eval to execute the code block. In fact, it can even be
> harmful as it is known to break shells under some circumstances. See
> also 7c4449eb31 (t/README: document how to loop around test cases,
> 2024-03-22), which I think should apply in your case, too.

I agree it makes it less finicky. The one upside to string splicing is that when
a test fails, the substitutions are visible in the dump of the shell script. I
found that useful while debugging. The titles can uniquely identify the
$prefix/$path/$suffix values when looking in the source, since they're all
1-to-1. Considering the downsides, I've switched to plain substitutions. 

Thalia

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 1/8] fast-import: tighten path unquoting
  2024-04-10  8:51           ` Chris Torek
@ 2024-04-10  9:14             ` Thalia Archibald
  2024-04-10  9:42               ` Patrick Steinhardt
  2024-04-10  9:16             ` Thalia Archibald
  1 sibling, 1 reply; 84+ messages in thread
From: Thalia Archibald @ 2024-04-10  9:14 UTC (permalink / raw)
  To: Chris Torek; +Cc: Patrick Steinhardt, git, Elijah Newren

On Apr 10, 2024, at 01:51, Chris Torek <chris.torek@gmail.com> wrote:
> On Wed, Apr 10, 2024 at 1:47 AM Thalia Archibald <thalia@archibald.dev> wrote:
>> strchrnul does what I want here and I’ve replaced it with that.
> 
> `strchrnul` is a GNU extension (found on a lot of systems, but not
> part of C90 or C99).

I can’t speak to Git standards, but it seems broadly used in Git, including
three times already in fast-import:

$ rg --count-matches --sort=path strchrnul
add-patch.c:1
advice.c:1
apply.c:2
archive.c:1
attr.c:1
builtin/am.c:1
builtin/fast-export.c:5
builtin/fast-import.c:4
builtin/stash.c:1
cache-tree.c:2
commit.c:5
compat/mingw.c:1
compat/terminal.c:1
config.c:1
diff.c:1
fmt-merge-msg.c:2
fsck.c:2
git-compat-util.h:3
git.c:1
gpg-interface.c:8
graph.c:1
help.c:1
http.c:2
ident.c:2
log-tree.c:1
mailmap.c:1
match-trees.c:1
notes.c:1
object-file.c:2
parse-options.c:2
path.c:1
pretty.c:2
ref-filter.c:5
refs/debug.c:1
remote-curl.c:1
remote.c:1
run-command.c:1
scalar.c:1
sequencer.c:5
strbuf.c:1
trailer.c:1
transport-helper.c:1
utf8.c:1
ws.c:1
wt-status.c:1

Thalia

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 1/8] fast-import: tighten path unquoting
  2024-04-10  8:51           ` Chris Torek
  2024-04-10  9:14             ` Thalia Archibald
@ 2024-04-10  9:16             ` Thalia Archibald
  1 sibling, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-10  9:16 UTC (permalink / raw)
  To: Chris Torek; +Cc: Patrick Steinhardt, git, Elijah Newren

On Apr 10, 2024, at 02:14, Thalia Archibald <thalia@archibald.dev> wrote:
> On Apr 10, 2024, at 01:51, Chris Torek <chris.torek@gmail.com> wrote:
>> On Wed, Apr 10, 2024 at 1:47 AM Thalia Archibald <thalia@archibald.dev> wrote:
>>> strchrnul does what I want here and I’ve replaced it with that.
>> 
>> `strchrnul` is a GNU extension (found on a lot of systems, but not
>> part of C90 or C99).
> 
> I can’t speak to Git standards, but it seems broadly used in Git, including
> three times already in fast-import.

… and that would be because it is supplied when unavailable:

git-compat-util.h

#ifndef HAVE_STRCHRNUL
#define strchrnul gitstrchrnul
static inline char *gitstrchrnul(const char *s, int c)
{
	while (*s && *s != c)
		s++;
	return (char *)s;
}
#endif

Thalia

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 1/8] fast-import: tighten path unquoting
  2024-04-10  9:14             ` Thalia Archibald
@ 2024-04-10  9:42               ` Patrick Steinhardt
  0 siblings, 0 replies; 84+ messages in thread
From: Patrick Steinhardt @ 2024-04-10  9:42 UTC (permalink / raw)
  To: Thalia Archibald; +Cc: Chris Torek, git, Elijah Newren

[-- Attachment #1: Type: text/plain, Size: 670 bytes --]

On Wed, Apr 10, 2024 at 09:14:16AM +0000, Thalia Archibald wrote:
> On Apr 10, 2024, at 01:51, Chris Torek <chris.torek@gmail.com> wrote:
> > On Wed, Apr 10, 2024 at 1:47 AM Thalia Archibald <thalia@archibald.dev> wrote:
> >> strchrnul does what I want here and I’ve replaced it with that.
> > 
> > `strchrnul` is a GNU extension (found on a lot of systems, but not
> > part of C90 or C99).
> 
> I can’t speak to Git standards, but it seems broadly used in Git, including
> three times already in fast-import:

It's fine to use `strchrnul()` in Git. In case libc doesn't provide it
we have a fallback implementation in "git-compat-util.h".

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v3 0/8] fast-import: tighten parsing of paths
  2024-04-01  9:02 ` [PATCH v2 0/8] fast-import: tighten parsing of paths Thalia Archibald
                     ` (8 preceding siblings ...)
  2024-04-07 21:19   ` [PATCH v2 0/8] fast-import: tighten parsing of paths Thalia Archibald
@ 2024-04-10  9:54   ` Thalia Archibald
  2024-04-10  9:55     ` [PATCH v3 1/8] fast-import: tighten path unquoting Thalia Archibald
                       ` (8 more replies)
  9 siblings, 9 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-10  9:54 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, Chris Torek, Elijah Newren, Thalia Archibald

> fast-import has subtle differences in how it parses file paths between each
> occurrence of <path> in the grammar. Many errors are suppressed or not checked,
> which could lead to silent data corruption. A particularly bad case is when a
> front-end sent escapes that Git doesn't recognize (e.g., hex escapes are not
> supported), it would be treated as literal bytes instead of a quoted string.
>
> Bring path parsing into line with the documented behavior and improve
> documentation to fill in missing details.

Updated to address review comments. Thanks, Patrick!

Changes since v2:
* Fix NUL overrun by replacing `strchr(p, ' ')` with `strchrnul(p, ' ')` in
  patch 1/8
* Fix "Missing dest" error condition in patch 1/8
* Test missing space after unquoted path
* Substitute shell parameters in test_expect_success call, instead of with
  string splicing
* Reformat (-subshells
* Rewrap long lines in `parse_path` and `parse_path_space`

Hopefully, this series sends without any rewrapped lines. I use Proton Mail via
Proton Mail Bridge and Apple Mail. I have no idea how to control this, or if I
even can, and see no relevant-looking settings in any of the three. In v2 and
now v3, I only manually modified the cover letter after using format-patch, not
any of the others.

Thalia


Thalia Archibald (8):
  fast-import: tighten path unquoting
  fast-import: directly use strbufs for paths
  fast-import: allow unquoted empty path for root
  fast-import: remove dead strbuf
  fast-import: improve documentation for unquoted paths
  fast-import: document C-style escapes for paths
  fast-import: forbid escaped NUL in paths
  fast-import: make comments more precise

 Documentation/git-fast-import.txt |  30 +-
 builtin/fast-import.c             | 158 ++++----
 t/t9300-fast-import.sh            | 624 +++++++++++++++++++++---------
 3 files changed, 550 insertions(+), 262 deletions(-)

Range-diff against v2:
1:  e790bdf714 ! 1:  d9ab0c6a75 fast-import: tighten path unquoting
    @@ builtin/fast-import.c: static uintmax_t parse_mark_ref_space(const char **p)
     + * or unquoted without escape sequences. When unquoted, it may only contain a
     + * space if `include_spaces` is nonzero.
     + */
    -+static void parse_path(struct strbuf *sb, const char *p, const char **endp, int include_spaces, const char *field)
    ++static void parse_path(struct strbuf *sb, const char *p, const char **endp,
    ++		int include_spaces, const char *field)
     +{
     +	if (*p == '"') {
     +		if (unquote_c_style(sb, p, endp))
    @@ builtin/fast-import.c: static uintmax_t parse_mark_ref_space(const char **p)
     +		if (include_spaces)
     +			*endp = p + strlen(p);
     +		else
    -+			*endp = strchr(p, ' ');
    ++			*endp = strchrnul(p, ' ');
     +		strbuf_add(sb, p, *endp - p);
     +	}
     +}
    @@ builtin/fast-import.c: static uintmax_t parse_mark_ref_space(const char **p)
     + * It may not contain spaces when unquoted. Update *endp to point to the first
     + * character after the space.
     + */
    -+static void parse_path_space(struct strbuf *sb, const char *p, const char **endp, const char *field)
    ++static void parse_path_space(struct strbuf *sb, const char *p,
    ++		const char **endp, const char *field)
     +{
     +	parse_path(sb, p, endp, 0, field);
     +	if (**endp != ' ')
    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
      
     -	endp++;
     -	if (!*endp)
    -+	if (!p)
    ++	if (!*p)
      		die("Missing dest: %s", command_buf.buf);
     -
     -	d = endp;
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		commit filemodify
     +		COMMIT
     +		from :301
    -+		M 100644 :402 '"$path"'
    ++		M 100644 :402 $path
     +
     +		commit refs/heads/S-path-eol
     +		mark :303
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		commit filedelete
     +		COMMIT
     +		from :302
    -+		D '"$path"'
    ++		D $path
     +
     +		commit refs/heads/S-path-eol
     +		mark :304
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		commit filecopy dest
     +		COMMIT
     +		from :301
    -+		C hello.c '"$path"'
    ++		C hello.c $path
     +
     +		commit refs/heads/S-path-eol
     +		mark :305
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		commit filerename dest
     +		COMMIT
     +		from :301
    -+		R hello.c '"$path"'
    ++		R hello.c $path
     +
    -+		ls :305 '"$path"'
    ++		ls :305 $path
     +		EOF
     +
     +		commit_m=$(grep :302 marks.out | cut -d\  -f2) &&
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		blob1=$(grep :401 marks.out | cut -d\  -f2) &&
     +		blob2=$(grep :402 marks.out | cut -d\  -f2) &&
     +
    -+		( printf "100644 blob $blob2\t'"$unquoted_path"'\n" &&
    -+		  printf "100644 blob $blob1\thello.c\n" ) | sort >tree_m.exp &&
    ++		(
    ++			printf "100644 blob $blob2\t$unquoted_path\n" &&
    ++			printf "100644 blob $blob1\thello.c\n"
    ++		) | sort >tree_m.exp &&
     +		git ls-tree $commit_m | sort >tree_m.out &&
     +		test_cmp tree_m.exp tree_m.out &&
     +
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		git ls-tree $commit_d >tree_d.out &&
     +		test_cmp tree_d.exp tree_d.out &&
     +
    -+		( printf "100644 blob $blob1\t'"$unquoted_path"'\n" &&
    -+		  printf "100644 blob $blob1\thello.c\n" ) | sort >tree_c.exp &&
    ++		(
    ++			printf "100644 blob $blob1\t$unquoted_path\n" &&
    ++			printf "100644 blob $blob1\thello.c\n"
    ++		) | sort >tree_c.exp &&
     +		git ls-tree $commit_c | sort >tree_c.out &&
     +		test_cmp tree_c.exp tree_c.out &&
     +
    -+		printf "100644 blob $blob1\t'"$unquoted_path"'\n" >tree_r.exp &&
    ++		printf "100644 blob $blob1\t$unquoted_path\n" >tree_r.exp &&
     +		git ls-tree $commit_r >tree_r.out &&
     +		test_cmp tree_r.exp tree_r.out &&
     +
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		data <<COMMIT
     +		initial commit
     +		COMMIT
    -+		M 100644 :401 '"$path"'
    ++		M 100644 :401 $path
     +
     +		commit refs/heads/S-path-space
     +		mark :302
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		commit filecopy source
     +		COMMIT
     +		from :301
    -+		C '"$path"' hello2.c
    ++		C $path hello2.c
     +
     +		commit refs/heads/S-path-space
     +		mark :303
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		commit filerename source
     +		COMMIT
     +		from :301
    -+		R '"$path"' hello2.c
    ++		R $path hello2.c
     +
     +		EOF
     +
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		commit_r=$(grep :303 marks.out | cut -d\  -f2) &&
     +		blob=$(grep :401 marks.out | cut -d\  -f2) &&
     +
    -+		( printf "100644 blob $blob\t'"$unquoted_path"'\n" &&
    -+		  printf "100644 blob $blob\thello2.c\n" ) | sort >tree_c.exp &&
    ++		(
    ++			printf "100644 blob $blob\t$unquoted_path\n" &&
    ++			printf "100644 blob $blob\thello2.c\n"
    ++		) | sort >tree_c.exp &&
     +		git ls-tree $commit_c | sort >tree_c.out &&
     +		test_cmp tree_c.exp tree_c.out &&
     +
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		commit with bad path
     +		COMMIT
     +		from :2
    -+		'"$prefix$path$suffix"'
    ++		$prefix$path$suffix
     +		EOF
     +
    -+		test_grep '"'$err_grep'"' err
    ++		test_grep "$err_grep" err
     +	'
     +}
     +
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +	test_path_fail "$change" "invalid escape in quoted $field" "$prefix" '"hello\xff"' "$suffix" "Invalid $field"
     +}
     +test_path_eol_quoted_fail () {
    -+	local change="$1" prefix="$2" field="$3" suffix="$4"
    -+	test_path_base_fail "$change" "$prefix" "$field" "$suffix"
    -+	test_path_fail "$change" "garbage after quoted $field" "$prefix" '"hello.c"x' "$suffix" "Garbage after $field"
    -+	test_path_fail "$change" "space after quoted $field"   "$prefix" '"hello.c" ' "$suffix" "Garbage after $field"
    ++	local change="$1" prefix="$2" field="$3"
    ++	test_path_base_fail "$change" "$prefix" "$field" ''
    ++	test_path_fail "$change" "garbage after quoted $field" "$prefix" '"hello.c"' 'x' "Garbage after $field"
    ++	test_path_fail "$change" "space after quoted $field"   "$prefix" '"hello.c"' ' ' "Garbage after $field"
     +}
     +test_path_eol_fail () {
    -+	local change="$1" prefix="$2" field="$3" suffix="$4"
    -+	test_path_eol_quoted_fail "$change" "$prefix" "$field" "$suffix"
    ++	local change="$1" prefix="$2" field="$3"
    ++	test_path_eol_quoted_fail "$change" "$prefix" "$field"
     +}
     +test_path_space_fail () {
    -+	local change="$1" prefix="$2" field="$3" suffix="$4"
    -+	test_path_base_fail "$change" "$prefix" "$field" "$suffix"
    -+	test_path_fail "$change" "missing space after quoted $field" "$prefix" '"hello.c"x' "$suffix" "Missing space after $field"
    ++	local change="$1" prefix="$2" field="$3"
    ++	test_path_base_fail "$change" "$prefix" "$field" ' world.c'
    ++	test_path_fail "$change" "missing space after quoted $field"   "$prefix" '"hello.c"' 'x world.c' "Missing space after $field"
    ++	test_path_fail "$change" "missing space after unquoted $field" "$prefix" 'hello.c'   ''          "Missing space after $field"
     +}
     +
    -+test_path_eol_fail   filemodify       'M 100644 :1 ' path   ''
    -+test_path_eol_fail   filedelete       'D '           path   ''
    -+test_path_space_fail filecopy         'C '           source ' world.c'
    -+test_path_eol_fail   filecopy         'C hello.c '   dest   ''
    -+test_path_space_fail filerename       'R '           source ' world.c'
    -+test_path_eol_fail   filerename       'R hello.c '   dest   ''
    -+test_path_eol_fail   'ls (in commit)' 'ls :2 '       path   ''
    ++test_path_eol_fail   filemodify       'M 100644 :1 ' path
    ++test_path_eol_fail   filedelete       'D '           path
    ++test_path_space_fail filecopy         'C '           source
    ++test_path_eol_fail   filecopy         'C hello.c '   dest
    ++test_path_space_fail filerename       'R '           source
    ++test_path_eol_fail   filerename       'R hello.c '   dest
    ++test_path_eol_fail   'ls (in commit)' 'ls :2 '       path
     +
     +# When 'ls' has no <dataref>, the <path> must be quoted.
    -+test_path_eol_quoted_fail 'ls (without dataref in commit)' 'ls ' path ''
    ++test_path_eol_quoted_fail 'ls (without dataref in commit)' 'ls ' path
     +
      ###
      ### series T (ls)
2:  82a6f53c13 ! 2:  696ca27bb7 fast-import: directly use strbufs for paths
    @@ Commit message
         Signed-off-by: Thalia Archibald <thalia@archibald.dev>
     
      ## builtin/fast-import.c ##
    -@@ builtin/fast-import.c: static void parse_path_space(struct strbuf *sb, const char *p, const char **endp
    +@@ builtin/fast-import.c: static void parse_path_space(struct strbuf *sb, const char *p,
      
      static void file_change_m(const char *p, struct branch *b)
      {
    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
     +	strbuf_reset(&source);
     +	parse_path_space(&source, p, &p, "source");
      
    - 	if (!p)
    + 	if (!*p)
      		die("Missing dest: %s", command_buf.buf);
     -	strbuf_reset(&d_uq);
     -	parse_path_eol(&d_uq, p, "dest");
3:  893bbf5e73 ! 3:  39879d0a66 fast-import: allow unquoted empty path for root
    @@ Commit message
     
      ## builtin/fast-import.c ##
     @@ builtin/fast-import.c: static void file_change_cr(const char *p, struct branch *b, int rename)
    - 	struct tree_entry leaf;
      
      	strbuf_reset(&source);
    --	parse_path_space(&source, p, &p, "source");
    + 	parse_path_space(&source, p, &p, "source");
     -
    --	if (!p)
    +-	if (!*p)
     -		die("Missing dest: %s", command_buf.buf);
      	strbuf_reset(&dest);
    -+	parse_path_space(&source, p, &p, "source");
      	parse_path_eol(&dest, p, "dest");
      
    - 	memset(&leaf, 0, sizeof(leaf));
     
      ## t/t9300-fast-import.sh ##
     @@ t/t9300-fast-import.sh: test_expect_success 'M: rename subdirectory to new subdirectory' '
    @@ t/t9300-fast-import.sh: test_expect_success 'M: rename subdirectory to new subdi
     -	from refs/heads/M2^0
     -	R "" sub
     +		from refs/heads/M2^0
    -+		R '"$root"' sub
    ++		R $root sub
      
     -	INPUT_END
     +		INPUT_END
    @@ t/t9300-fast-import.sh: test_expect_success PIPE 'N: empty directory reads as mi
     -	compare_diff_raw expect actual
     -'
     +		from refs/heads/branch^0
    -+		M 040000 $root_tree '"$root"'
    ++		M 040000 $root_tree $root
     +		INPUT_END
     +		git fast-import <input &&
     +		git diff-tree -C --find-copies-harder -r N4 N6 >actual &&
    @@ t/t9300-fast-import.sh: test_expect_success PIPE 'N: empty directory reads as mi
     -	compare_diff_raw expect actual
     -'
     +		from refs/heads/branch^0
    -+		C '"$root"' oldroot
    ++		C $root oldroot
     +		INPUT_END
     +		git fast-import <input &&
     +		git diff-tree -C --find-copies-harder -r branch N-copy-root-path >actual &&
    @@ t/t9300-fast-import.sh: test_expect_success 'N: reject foo/ syntax in ls argumen
     -	test_cmp expect.foo actual.foo &&
     -	test_cmp expect.bar actual.bar
     -'
    -+		M 040000 $tree '"$root"'
    ++		M 040000 $tree $root
     +		M 644 inline foo/foo
     +		data <<EOF
     +		hello, world
    @@ t/t9300-fast-import.sh: test_expect_success 'N: reject foo/ syntax in ls argumen
     -	git fast-import <input &&
     -	git diff --exit-code branch:newdir N9
     -'
    -+		M 040000 $branch '"$root"'
    -+		C "newdir" '"$root"'
    ++		M 040000 $branch $root
    ++		C "newdir" $root
     +		INPUT_END
     +		git fast-import <input &&
     +		git diff --exit-code branch:newdir N9
    @@ t/t9300-fast-import.sh: test_expect_success 'N: reject foo/ syntax in ls argumen
     -	test_cmp expect.baz actual.baz &&
     -	test_cmp expect.qux actual.qux &&
     -	test_cmp expect.qux actual.quux'
    -+		M 040000 $tree '"$root"'
    ++		M 040000 $tree $root
     +		M 100644 inline foo/bar/qux
     +		data <<EOF
     +		hello, world
     +		EOF
    -+		R "foo" '"$root"'
    ++		R "foo" $root
     +		C "bar/qux" "bar/quux"
     +		INPUT_END
     +		git show N11:bar/baz >actual.baz &&
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
      #
      
      #
    -@@ t/t9300-fast-import.sh: test_path_eol_quoted_fail 'ls (without dataref in commit)' 'ls ' path ''
    +@@ t/t9300-fast-import.sh: test_path_eol_quoted_fail 'ls (without dataref in commit)' 'ls ' path
      ###
      # Setup is carried over from series S.
      
    @@ t/t9300-fast-import.sh: test_expect_success 'U: validate directory delete result
     +		must succeed
     +		COMMIT
     +		from refs/heads/U^0
    -+		D '"$root"'
    ++		D $root
      
     -	INPUT_END
     +		INPUT_END
4:  cb05a184e6 = 4:  1cef05e59a fast-import: remove dead strbuf
5:  1f34b632d7 = 5:  2e78690023 fast-import: improve documentation for unquoted paths
6:  82a4da68af = 6:  1b07ddffe0 fast-import: document C-style escapes for paths
7:  c087c6a860 ! 7:  dc67464b6a fast-import: forbid escaped NUL in paths
    @@ Documentation/git-fast-import.txt: and must be in canonical form. That is it mus
      `filedelete`
     
      ## builtin/fast-import.c ##
    -@@ builtin/fast-import.c: static void parse_path(struct strbuf *sb, const char *p, const char **endp, int
    +@@ builtin/fast-import.c: static void parse_path(struct strbuf *sb, const char *p, const char **endp,
      	if (*p == '"') {
      		if (unquote_c_style(sb, p, endp))
      			die("Invalid %s: %s", field, command_buf.buf);
    @@ t/t9300-fast-import.sh: test_path_base_fail () {
     +	test_path_fail "$change" "escaped NUL in quoted $field"    "$prefix" '"hello\000"' "$suffix" "NUL in $field"
      }
      test_path_eol_quoted_fail () {
    - 	local change="$1" prefix="$2" field="$3" suffix="$4"
    + 	local change="$1" prefix="$2" field="$3"
8:  a503c55b83 = 8:  5e02d887bc fast-import: make comments more precise
-- 
2.44.0



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v3 1/8] fast-import: tighten path unquoting
  2024-04-10  9:54   ` [PATCH v3 " Thalia Archibald
@ 2024-04-10  9:55     ` Thalia Archibald
  2024-04-10  9:55     ` [PATCH v3 2/8] fast-import: directly use strbufs for paths Thalia Archibald
                       ` (7 subsequent siblings)
  8 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-10  9:55 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, Chris Torek, Elijah Newren, Thalia Archibald

Path parsing in fast-import is inconsistent and many unquoting errors
are suppressed or not checked.

<path> appears in the grammar in these places:

    filemodify ::= 'M' SP <mode> (<dataref> | 'inline') SP <path> LF
    filedelete ::= 'D' SP <path> LF
    filecopy   ::= 'C' SP <path> SP <path> LF
    filerename ::= 'R' SP <path> SP <path> LF
    ls         ::= 'ls' SP <dataref> SP <path> LF
    ls-commit  ::= 'ls' SP <path> LF

and fast-import.c parses them in five different ways:

1. For filemodify and filedelete:
   Try to unquote <path>. If it unquotes without errors, use the
   unquoted version; otherwise, treat it as literal bytes to the end of
   the line (including any number of SP).
2. For filecopy (source) and filerename (source):
   Try to unquote <path>. If it unquotes without errors, use the
   unquoted version; otherwise, treat it as literal bytes up to, but not
   including, the next SP.
3. For filecopy (dest) and filerename (dest):
   Like 1., but an unquoted empty string is forbidden.
4. For ls:
   If <path> starts with `"`, unquote it and report parse errors;
   otherwise, treat it as literal bytes to the end of the line
   (including any number of SP).
5. For ls-commit:
   Unquote <path> and report parse errors.
   (It must start with `"` to disambiguate from ls.)

In the first three, any errors from trying to unquote a string are
suppressed, so a quoted string that contains invalid escapes would be
interpreted as literal bytes. For example, `"\xff"` would fail to
unquote (because hex escapes are not supported), and it would instead be
interpreted as the byte sequence '"', '\\', 'x', 'f', 'f', '"', which is
certainly not intended. Some front-ends erroneously use their language's
standard quoting routine instead of matching Git's, which could silently
introduce escapes that would be incorrectly parsed due to this and lead
to data corruption.

The documentation states “To use a source path that contains SP the path
must be quoted.”, so it is expected that some implementations depend on
spaces being allowed in paths in the final position. Thus we have two
documented ways to parse paths, so simplify the implementation to that.

Now we have:

1. `parse_path_eol` for filemodify, filedelete, filecopy (dest),
   filerename (dest), ls, and ls-commit:

   If <path> starts with `"`, unquote it and report parse errors;
   otherwise, treat it as literal bytes to the end of the line
   (including any number of SP).

2. `parse_path_space` for filecopy (source) and filerename (source):

   If <path> starts with `"`, unquote it and report parse errors;
   otherwise, treat it as literal bytes up to, but not including, the
   next SP. It must be followed by SP.

There remain two special cases: The dest <path> in filecopy and rename
cannot be an unquoted empty string (this will be addressed subsequently)
and <path> in ls-commit must be quoted to disambiguate it from ls.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c  | 104 ++++++++++-------
 t/t9300-fast-import.sh | 258 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 318 insertions(+), 44 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 782bda007c..ce9231afe6 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2258,10 +2258,56 @@ static uintmax_t parse_mark_ref_space(const char **p)
 	return mark;
 }
 
+/*
+ * Parse the path string into the strbuf. It may be quoted with escape sequences
+ * or unquoted without escape sequences. When unquoted, it may only contain a
+ * space if `include_spaces` is nonzero.
+ */
+static void parse_path(struct strbuf *sb, const char *p, const char **endp,
+		int include_spaces, const char *field)
+{
+	if (*p == '"') {
+		if (unquote_c_style(sb, p, endp))
+			die("Invalid %s: %s", field, command_buf.buf);
+	} else {
+		if (include_spaces)
+			*endp = p + strlen(p);
+		else
+			*endp = strchrnul(p, ' ');
+		strbuf_add(sb, p, *endp - p);
+	}
+}
+
+/*
+ * Parse the path string into the strbuf, and complain if this is not the end of
+ * the string. It may contain spaces even when unquoted.
+ */
+static void parse_path_eol(struct strbuf *sb, const char *p, const char *field)
+{
+	const char *end;
+
+	parse_path(sb, p, &end, 1, field);
+	if (*end)
+		die("Garbage after %s: %s", field, command_buf.buf);
+}
+
+/*
+ * Parse the path string into the strbuf, and ensure it is followed by a space.
+ * It may not contain spaces when unquoted. Update *endp to point to the first
+ * character after the space.
+ */
+static void parse_path_space(struct strbuf *sb, const char *p,
+		const char **endp, const char *field)
+{
+	parse_path(sb, p, endp, 0, field);
+	if (**endp != ' ')
+		die("Missing space after %s: %s", field, command_buf.buf);
+	(*endp)++;
+}
+
 static void file_change_m(const char *p, struct branch *b)
 {
 	static struct strbuf uq = STRBUF_INIT;
-	const char *endp;
 	struct object_entry *oe;
 	struct object_id oid;
 	uint16_t mode, inline_data = 0;
@@ -2299,11 +2345,8 @@ static void file_change_m(const char *p, struct branch *b)
 	}
 
 	strbuf_reset(&uq);
-	if (!unquote_c_style(&uq, p, &endp)) {
-		if (*endp)
-			die("Garbage after path in: %s", command_buf.buf);
-		p = uq.buf;
-	}
+	parse_path_eol(&uq, p, "path");
+	p = uq.buf;
 
 	/* Git does not track empty, non-toplevel directories. */
 	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *p) {
@@ -2367,48 +2410,29 @@ static void file_change_m(const char *p, struct branch *b)
 static void file_change_d(const char *p, struct branch *b)
 {
 	static struct strbuf uq = STRBUF_INIT;
-	const char *endp;
 
 	strbuf_reset(&uq);
-	if (!unquote_c_style(&uq, p, &endp)) {
-		if (*endp)
-			die("Garbage after path in: %s", command_buf.buf);
-		p = uq.buf;
-	}
+	parse_path_eol(&uq, p, "path");
+	p = uq.buf;
 	tree_content_remove(&b->branch_tree, p, NULL, 1);
 }
 
-static void file_change_cr(const char *s, struct branch *b, int rename)
+static void file_change_cr(const char *p, struct branch *b, int rename)
 {
-	const char *d;
+	const char *s, *d;
 	static struct strbuf s_uq = STRBUF_INIT;
 	static struct strbuf d_uq = STRBUF_INIT;
-	const char *endp;
 	struct tree_entry leaf;
 
 	strbuf_reset(&s_uq);
-	if (!unquote_c_style(&s_uq, s, &endp)) {
-		if (*endp != ' ')
-			die("Missing space after source: %s", command_buf.buf);
-	} else {
-		endp = strchr(s, ' ');
-		if (!endp)
-			die("Missing space after source: %s", command_buf.buf);
-		strbuf_add(&s_uq, s, endp - s);
-	}
+	parse_path_space(&s_uq, p, &p, "source");
 	s = s_uq.buf;
 
-	endp++;
-	if (!*endp)
+	if (!*p)
 		die("Missing dest: %s", command_buf.buf);
-
-	d = endp;
 	strbuf_reset(&d_uq);
-	if (!unquote_c_style(&d_uq, d, &endp)) {
-		if (*endp)
-			die("Garbage after dest in: %s", command_buf.buf);
-		d = d_uq.buf;
-	}
+	parse_path_eol(&d_uq, p, "dest");
+	d = d_uq.buf;
 
 	memset(&leaf, 0, sizeof(leaf));
 	if (rename)
@@ -3152,6 +3176,7 @@ static void print_ls(int mode, const unsigned char *hash, const char *path)
 
 static void parse_ls(const char *p, struct branch *b)
 {
+	static struct strbuf uq = STRBUF_INIT;
 	struct tree_entry *root = NULL;
 	struct tree_entry leaf = {NULL};
 
@@ -3168,16 +3193,9 @@ static void parse_ls(const char *p, struct branch *b)
 			root->versions[1].mode = S_IFDIR;
 		load_tree(root);
 	}
-	if (*p == '"') {
-		static struct strbuf uq = STRBUF_INIT;
-		const char *endp;
-		strbuf_reset(&uq);
-		if (unquote_c_style(&uq, p, &endp))
-			die("Invalid path: %s", command_buf.buf);
-		if (*endp)
-			die("Garbage after path in: %s", command_buf.buf);
-		p = uq.buf;
-	}
+	strbuf_reset(&uq);
+	parse_path_eol(&uq, p, "path");
+	p = uq.buf;
 	tree_content_get(root, p, &leaf, 1);
 	/*
 	 * A directory in preparation would have a sha1 of zero
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 60e30fed3c..de2f1304e8 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -2142,6 +2142,7 @@ test_expect_success 'Q: deny note on empty branch' '
 	EOF
 	test_must_fail git fast-import <input
 '
+
 ###
 ### series R (feature and option)
 ###
@@ -2790,7 +2791,7 @@ test_expect_success 'R: blob appears only once' '
 '
 
 ###
-### series S
+### series S (mark and path parsing)
 ###
 #
 # Make sure missing spaces and EOLs after mark references
@@ -3060,6 +3061,261 @@ test_expect_success 'S: ls with garbage after sha1 must fail' '
 	test_grep "space after tree-ish" err
 '
 
+#
+# Path parsing
+#
+# There are two sorts of ways a path can be parsed, depending on whether it is
+# the last field on the line. Additionally, ls without a <dataref> has a special
+# case. Test every occurrence of <path> in the grammar against every error case.
+#
+
+#
+# Valid paths at the end of a line: filemodify, filedelete, filecopy (dest),
+# filerename (dest), and ls.
+#
+# commit :301 from root -- modify hello.c (for setup)
+# commit :302 from :301 -- modify $path
+# commit :303 from :302 -- delete $path
+# commit :304 from :301 -- copy hello.c $path
+# commit :305 from :301 -- rename hello.c $path
+# ls :305 $path
+#
+test_path_eol_success () {
+	local test="$1" path="$2" unquoted_path="$3"
+	test_expect_success "S: paths at EOL with $test must work" '
+		test_when_finished "git branch -D S-path-eol" &&
+
+		git fast-import --export-marks=marks.out <<-EOF >out 2>err &&
+		blob
+		mark :401
+		data <<BLOB
+		hello world
+		BLOB
+
+		blob
+		mark :402
+		data <<BLOB
+		hallo welt
+		BLOB
+
+		commit refs/heads/S-path-eol
+		mark :301
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		initial commit
+		COMMIT
+		M 100644 :401 hello.c
+
+		commit refs/heads/S-path-eol
+		mark :302
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filemodify
+		COMMIT
+		from :301
+		M 100644 :402 $path
+
+		commit refs/heads/S-path-eol
+		mark :303
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filedelete
+		COMMIT
+		from :302
+		D $path
+
+		commit refs/heads/S-path-eol
+		mark :304
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filecopy dest
+		COMMIT
+		from :301
+		C hello.c $path
+
+		commit refs/heads/S-path-eol
+		mark :305
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filerename dest
+		COMMIT
+		from :301
+		R hello.c $path
+
+		ls :305 $path
+		EOF
+
+		commit_m=$(grep :302 marks.out | cut -d\  -f2) &&
+		commit_d=$(grep :303 marks.out | cut -d\  -f2) &&
+		commit_c=$(grep :304 marks.out | cut -d\  -f2) &&
+		commit_r=$(grep :305 marks.out | cut -d\  -f2) &&
+		blob1=$(grep :401 marks.out | cut -d\  -f2) &&
+		blob2=$(grep :402 marks.out | cut -d\  -f2) &&
+
+		(
+			printf "100644 blob $blob2\t$unquoted_path\n" &&
+			printf "100644 blob $blob1\thello.c\n"
+		) | sort >tree_m.exp &&
+		git ls-tree $commit_m | sort >tree_m.out &&
+		test_cmp tree_m.exp tree_m.out &&
+
+		printf "100644 blob $blob1\thello.c\n" >tree_d.exp &&
+		git ls-tree $commit_d >tree_d.out &&
+		test_cmp tree_d.exp tree_d.out &&
+
+		(
+			printf "100644 blob $blob1\t$unquoted_path\n" &&
+			printf "100644 blob $blob1\thello.c\n"
+		) | sort >tree_c.exp &&
+		git ls-tree $commit_c | sort >tree_c.out &&
+		test_cmp tree_c.exp tree_c.out &&
+
+		printf "100644 blob $blob1\t$unquoted_path\n" >tree_r.exp &&
+		git ls-tree $commit_r >tree_r.out &&
+		test_cmp tree_r.exp tree_r.out &&
+
+		test_cmp out tree_r.exp
+	'
+}
+
+test_path_eol_success 'quoted spaces'   '" hello world.c "' ' hello world.c '
+test_path_eol_success 'unquoted spaces' ' hello world.c '   ' hello world.c '
+
+#
+# Valid paths before a space: filecopy (source) and filerename (source).
+#
+# commit :301 from root -- modify $path (for setup)
+# commit :302 from :301 -- copy $path hello2.c
+# commit :303 from :301 -- rename $path hello2.c
+#
+test_path_space_success () {
+	local test="$1" path="$2" unquoted_path="$3"
+	test_expect_success "S: paths before space with $test must work" '
+		test_when_finished "git branch -D S-path-space" &&
+
+		git fast-import --export-marks=marks.out <<-EOF 2>err &&
+		blob
+		mark :401
+		data <<BLOB
+		hello world
+		BLOB
+
+		commit refs/heads/S-path-space
+		mark :301
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		initial commit
+		COMMIT
+		M 100644 :401 $path
+
+		commit refs/heads/S-path-space
+		mark :302
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filecopy source
+		COMMIT
+		from :301
+		C $path hello2.c
+
+		commit refs/heads/S-path-space
+		mark :303
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filerename source
+		COMMIT
+		from :301
+		R $path hello2.c
+
+		EOF
+
+		commit_c=$(grep :302 marks.out | cut -d\  -f2) &&
+		commit_r=$(grep :303 marks.out | cut -d\  -f2) &&
+		blob=$(grep :401 marks.out | cut -d\  -f2) &&
+
+		(
+			printf "100644 blob $blob\t$unquoted_path\n" &&
+			printf "100644 blob $blob\thello2.c\n"
+		) | sort >tree_c.exp &&
+		git ls-tree $commit_c | sort >tree_c.out &&
+		test_cmp tree_c.exp tree_c.out &&
+
+		printf "100644 blob $blob\thello2.c\n" >tree_r.exp &&
+		git ls-tree $commit_r >tree_r.out &&
+		test_cmp tree_r.exp tree_r.out
+	'
+}
+
+test_path_space_success 'quoted spaces'      '" hello world.c "' ' hello world.c '
+test_path_space_success 'no unquoted spaces' 'hello_world.c'     'hello_world.c'
+
+#
+# Test a single commit change with an invalid path. Run it with all occurrences
+# of <path> in the grammar against all error kinds.
+#
+test_path_fail () {
+	local change="$1" what="$2" prefix="$3" path="$4" suffix="$5" err_grep="$6"
+	test_expect_success "S: $change with $what must fail" '
+		test_must_fail git fast-import <<-EOF 2>err &&
+		blob
+		mark :1
+		data <<BLOB
+		hello world
+		BLOB
+
+		commit refs/heads/S-path-fail
+		mark :2
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit setup
+		COMMIT
+		M 100644 :1 hello.c
+
+		commit refs/heads/S-path-fail
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit with bad path
+		COMMIT
+		from :2
+		$prefix$path$suffix
+		EOF
+
+		test_grep "$err_grep" err
+	'
+}
+
+test_path_base_fail () {
+	local change="$1" prefix="$2" field="$3" suffix="$4"
+	test_path_fail "$change" 'unclosed " in '"$field"          "$prefix" '"hello.c'    "$suffix" "Invalid $field"
+	test_path_fail "$change" "invalid escape in quoted $field" "$prefix" '"hello\xff"' "$suffix" "Invalid $field"
+}
+test_path_eol_quoted_fail () {
+	local change="$1" prefix="$2" field="$3"
+	test_path_base_fail "$change" "$prefix" "$field" ''
+	test_path_fail "$change" "garbage after quoted $field" "$prefix" '"hello.c"' 'x' "Garbage after $field"
+	test_path_fail "$change" "space after quoted $field"   "$prefix" '"hello.c"' ' ' "Garbage after $field"
+}
+test_path_eol_fail () {
+	local change="$1" prefix="$2" field="$3"
+	test_path_eol_quoted_fail "$change" "$prefix" "$field"
+}
+test_path_space_fail () {
+	local change="$1" prefix="$2" field="$3"
+	test_path_base_fail "$change" "$prefix" "$field" ' world.c'
+	test_path_fail "$change" "missing space after quoted $field"   "$prefix" '"hello.c"' 'x world.c' "Missing space after $field"
+	test_path_fail "$change" "missing space after unquoted $field" "$prefix" 'hello.c'   ''          "Missing space after $field"
+}
+
+test_path_eol_fail   filemodify       'M 100644 :1 ' path
+test_path_eol_fail   filedelete       'D '           path
+test_path_space_fail filecopy         'C '           source
+test_path_eol_fail   filecopy         'C hello.c '   dest
+test_path_space_fail filerename       'R '           source
+test_path_eol_fail   filerename       'R hello.c '   dest
+test_path_eol_fail   'ls (in commit)' 'ls :2 '       path
+
+# When 'ls' has no <dataref>, the <path> must be quoted.
+test_path_eol_quoted_fail 'ls (without dataref in commit)' 'ls ' path
+
 ###
 ### series T (ls)
 ###
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v3 2/8] fast-import: directly use strbufs for paths
  2024-04-10  9:54   ` [PATCH v3 " Thalia Archibald
  2024-04-10  9:55     ` [PATCH v3 1/8] fast-import: tighten path unquoting Thalia Archibald
@ 2024-04-10  9:55     ` Thalia Archibald
  2024-04-10  9:55     ` [PATCH v3 3/8] fast-import: allow unquoted empty path for root Thalia Archibald
                       ` (6 subsequent siblings)
  8 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-10  9:55 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, Chris Torek, Elijah Newren, Thalia Archibald

Previously, one case would not write the path to the strbuf: when the
path is unquoted and at the end of the string. It was essentially
copy-on-write. However, with the logic simplification of the previous
commit, this case was eliminated and the strbuf is always populated.

Directly use the strbufs now instead of an alias.

Since this already changes all the lines that use the strbufs, rename
them from `uq` to be more descriptive. That they are unquoted is not
their most important property, so name them after what they carry.

Additionally, `file_change_m` no longer needs to copy the path before
reading inline data.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c | 64 ++++++++++++++++++-------------------------
 1 file changed, 27 insertions(+), 37 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index ce9231afe6..8f6312fbaf 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2307,7 +2307,7 @@ static void parse_path_space(struct strbuf *sb, const char *p,
 
 static void file_change_m(const char *p, struct branch *b)
 {
-	static struct strbuf uq = STRBUF_INIT;
+	static struct strbuf path = STRBUF_INIT;
 	struct object_entry *oe;
 	struct object_id oid;
 	uint16_t mode, inline_data = 0;
@@ -2344,13 +2344,12 @@ static void file_change_m(const char *p, struct branch *b)
 			die("Missing space after SHA1: %s", command_buf.buf);
 	}
 
-	strbuf_reset(&uq);
-	parse_path_eol(&uq, p, "path");
-	p = uq.buf;
+	strbuf_reset(&path);
+	parse_path_eol(&path, p, "path");
 
 	/* Git does not track empty, non-toplevel directories. */
-	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *p) {
-		tree_content_remove(&b->branch_tree, p, NULL, 0);
+	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *path.buf) {
+		tree_content_remove(&b->branch_tree, path.buf, NULL, 0);
 		return;
 	}
 
@@ -2371,10 +2370,6 @@ static void file_change_m(const char *p, struct branch *b)
 		if (S_ISDIR(mode))
 			die("Directories cannot be specified 'inline': %s",
 				command_buf.buf);
-		if (p != uq.buf) {
-			strbuf_addstr(&uq, p);
-			p = uq.buf;
-		}
 		while (read_next_command() != EOF) {
 			const char *v;
 			if (skip_prefix(command_buf.buf, "cat-blob ", &v))
@@ -2400,55 +2395,51 @@ static void file_change_m(const char *p, struct branch *b)
 				command_buf.buf);
 	}
 
-	if (!*p) {
+	if (!*path.buf) {
 		tree_content_replace(&b->branch_tree, &oid, mode, NULL);
 		return;
 	}
-	tree_content_set(&b->branch_tree, p, &oid, mode, NULL);
+	tree_content_set(&b->branch_tree, path.buf, &oid, mode, NULL);
 }
 
 static void file_change_d(const char *p, struct branch *b)
 {
-	static struct strbuf uq = STRBUF_INIT;
+	static struct strbuf path = STRBUF_INIT;
 
-	strbuf_reset(&uq);
-	parse_path_eol(&uq, p, "path");
-	p = uq.buf;
-	tree_content_remove(&b->branch_tree, p, NULL, 1);
+	strbuf_reset(&path);
+	parse_path_eol(&path, p, "path");
+	tree_content_remove(&b->branch_tree, path.buf, NULL, 1);
 }
 
 static void file_change_cr(const char *p, struct branch *b, int rename)
 {
-	const char *s, *d;
-	static struct strbuf s_uq = STRBUF_INIT;
-	static struct strbuf d_uq = STRBUF_INIT;
+	static struct strbuf source = STRBUF_INIT;
+	static struct strbuf dest = STRBUF_INIT;
 	struct tree_entry leaf;
 
-	strbuf_reset(&s_uq);
-	parse_path_space(&s_uq, p, &p, "source");
-	s = s_uq.buf;
+	strbuf_reset(&source);
+	parse_path_space(&source, p, &p, "source");
 
 	if (!*p)
 		die("Missing dest: %s", command_buf.buf);
-	strbuf_reset(&d_uq);
-	parse_path_eol(&d_uq, p, "dest");
-	d = d_uq.buf;
+	strbuf_reset(&dest);
+	parse_path_eol(&dest, p, "dest");
 
 	memset(&leaf, 0, sizeof(leaf));
 	if (rename)
-		tree_content_remove(&b->branch_tree, s, &leaf, 1);
+		tree_content_remove(&b->branch_tree, source.buf, &leaf, 1);
 	else
-		tree_content_get(&b->branch_tree, s, &leaf, 1);
+		tree_content_get(&b->branch_tree, source.buf, &leaf, 1);
 	if (!leaf.versions[1].mode)
-		die("Path %s not in branch", s);
-	if (!*d) {	/* C "path/to/subdir" "" */
+		die("Path %s not in branch", source.buf);
+	if (!*dest.buf) {	/* C "path/to/subdir" "" */
 		tree_content_replace(&b->branch_tree,
 			&leaf.versions[1].oid,
 			leaf.versions[1].mode,
 			leaf.tree);
 		return;
 	}
-	tree_content_set(&b->branch_tree, d,
+	tree_content_set(&b->branch_tree, dest.buf,
 		&leaf.versions[1].oid,
 		leaf.versions[1].mode,
 		leaf.tree);
@@ -3176,7 +3167,7 @@ static void print_ls(int mode, const unsigned char *hash, const char *path)
 
 static void parse_ls(const char *p, struct branch *b)
 {
-	static struct strbuf uq = STRBUF_INIT;
+	static struct strbuf path = STRBUF_INIT;
 	struct tree_entry *root = NULL;
 	struct tree_entry leaf = {NULL};
 
@@ -3193,10 +3184,9 @@ static void parse_ls(const char *p, struct branch *b)
 			root->versions[1].mode = S_IFDIR;
 		load_tree(root);
 	}
-	strbuf_reset(&uq);
-	parse_path_eol(&uq, p, "path");
-	p = uq.buf;
-	tree_content_get(root, p, &leaf, 1);
+	strbuf_reset(&path);
+	parse_path_eol(&path, p, "path");
+	tree_content_get(root, path.buf, &leaf, 1);
 	/*
 	 * A directory in preparation would have a sha1 of zero
 	 * until it is saved.  Save, for simplicity.
@@ -3204,7 +3194,7 @@ static void parse_ls(const char *p, struct branch *b)
 	if (S_ISDIR(leaf.versions[1].mode))
 		store_tree(&leaf);
 
-	print_ls(leaf.versions[1].mode, leaf.versions[1].oid.hash, p);
+	print_ls(leaf.versions[1].mode, leaf.versions[1].oid.hash, path.buf);
 	if (leaf.tree)
 		release_tree_content_recursive(leaf.tree);
 	if (!b || root != &b->branch_tree)
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v3 3/8] fast-import: allow unquoted empty path for root
  2024-04-10  9:54   ` [PATCH v3 " Thalia Archibald
  2024-04-10  9:55     ` [PATCH v3 1/8] fast-import: tighten path unquoting Thalia Archibald
  2024-04-10  9:55     ` [PATCH v3 2/8] fast-import: directly use strbufs for paths Thalia Archibald
@ 2024-04-10  9:55     ` Thalia Archibald
  2024-04-11 19:59       ` Junio C Hamano
  2024-04-10  9:55     ` [PATCH v3 4/8] fast-import: remove dead strbuf Thalia Archibald
                       ` (5 subsequent siblings)
  8 siblings, 1 reply; 84+ messages in thread
From: Thalia Archibald @ 2024-04-10  9:55 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, Chris Torek, Elijah Newren, Thalia Archibald

Ever since filerename was added in f39a946a1f (Support wholesale
directory renames in fast-import, 2007-07-09) and filecopy in b6f3481bb4
(Teach fast-import to recursively copy files/directories, 2007-07-15),
both have produced an error when the destination path is empty. Later,
when support for targeting the root directory with an empty string was
added in 2794ad5244 (fast-import: Allow filemodify to set the root,
2010-10-10), this had the effect of allowing the quoted empty string
(`""`), but forbidding its unquoted variant (``). This seems to have
been intended as simple data validation for parsing two paths, rather
than a syntax restriction, because it was not extended to the other
operations.

All other occurrences of paths (in filemodify, filedelete, the source of
filecopy and filerename, and ls) allow both.

For most of this feature's lifetime, the documentation has not
prescribed the use of quoted empty strings. In e5959106d6
(Documentation/fast-import: put explanation of M 040000 <dataref> "" in
context, 2011-01-15), its documentation was changed from “`<path>` may
also be an empty string (`""`) to specify the root of the tree” to “The
root of the tree can be represented by an empty string as `<path>`”.

Thus, we can assume that some front-ends have depended on this behavior.

Remove this restriction for the destination paths of filecopy and
filerename and change tests targeting the root to test `""` and ``.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c  |   3 -
 t/t9300-fast-import.sh | 363 +++++++++++++++++++++--------------------
 2 files changed, 190 insertions(+), 176 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 8f6312fbaf..0da7e8a5a5 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2419,9 +2419,6 @@ static void file_change_cr(const char *p, struct branch *b, int rename)
 
 	strbuf_reset(&source);
 	parse_path_space(&source, p, &p, "source");
-
-	if (!*p)
-		die("Missing dest: %s", command_buf.buf);
 	strbuf_reset(&dest);
 	parse_path_eol(&dest, p, "dest");
 
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index de2f1304e8..13f98e6688 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -1059,30 +1059,33 @@ test_expect_success 'M: rename subdirectory to new subdirectory' '
 	compare_diff_raw expect actual
 '
 
-test_expect_success 'M: rename root to subdirectory' '
-	cat >input <<-INPUT_END &&
-	commit refs/heads/M4
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	rename root
-	COMMIT
+for root in '""' ''
+do
+	test_expect_success "M: rename root ($root) to subdirectory" '
+		cat >input <<-INPUT_END &&
+		commit refs/heads/M4
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		rename root
+		COMMIT
 
-	from refs/heads/M2^0
-	R "" sub
+		from refs/heads/M2^0
+		R $root sub
 
-	INPUT_END
+		INPUT_END
 
-	cat >expect <<-EOF &&
-	:100644 100644 $oldf $oldf R100	file2/oldf	sub/file2/oldf
-	:100755 100755 $f4id $f4id R100	file4	sub/file4
-	:100755 100755 $newf $newf R100	i/am/new/to/you	sub/i/am/new/to/you
-	:100755 100755 $f6id $f6id R100	newdir/exec.sh	sub/newdir/exec.sh
-	:100644 100644 $f5id $f5id R100	newdir/interesting	sub/newdir/interesting
-	EOF
-	git fast-import <input &&
-	git diff-tree -M -r M4^ M4 >actual &&
-	compare_diff_raw expect actual
-'
+		cat >expect <<-EOF &&
+		:100644 100644 $oldf $oldf R100	file2/oldf	sub/file2/oldf
+		:100755 100755 $f4id $f4id R100	file4	sub/file4
+		:100755 100755 $newf $newf R100	i/am/new/to/you	sub/i/am/new/to/you
+		:100755 100755 $f6id $f6id R100	newdir/exec.sh	sub/newdir/exec.sh
+		:100644 100644 $f5id $f5id R100	newdir/interesting	sub/newdir/interesting
+		EOF
+		git fast-import <input &&
+		git diff-tree -M -r M4^ M4 >actual &&
+		compare_diff_raw expect actual
+	'
+done
 
 ###
 ### series N
@@ -1259,49 +1262,52 @@ test_expect_success PIPE 'N: empty directory reads as missing' '
 	test_cmp expect actual
 '
 
-test_expect_success 'N: copy root directory by tree hash' '
-	cat >expect <<-EOF &&
-	:100755 000000 $newf $zero D	file3/newf
-	:100644 000000 $oldf $zero D	file3/oldf
-	EOF
-	root=$(git rev-parse refs/heads/branch^0^{tree}) &&
-	cat >input <<-INPUT_END &&
-	commit refs/heads/N6
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	copy root directory by tree hash
-	COMMIT
+for root in '""' ''
+do
+	test_expect_success "N: copy root ($root) by tree hash" '
+		cat >expect <<-EOF &&
+		:100755 000000 $newf $zero D	file3/newf
+		:100644 000000 $oldf $zero D	file3/oldf
+		EOF
+		root_tree=$(git rev-parse refs/heads/branch^0^{tree}) &&
+		cat >input <<-INPUT_END &&
+		commit refs/heads/N6
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		copy root directory by tree hash
+		COMMIT
 
-	from refs/heads/branch^0
-	M 040000 $root ""
-	INPUT_END
-	git fast-import <input &&
-	git diff-tree -C --find-copies-harder -r N4 N6 >actual &&
-	compare_diff_raw expect actual
-'
+		from refs/heads/branch^0
+		M 040000 $root_tree $root
+		INPUT_END
+		git fast-import <input &&
+		git diff-tree -C --find-copies-harder -r N4 N6 >actual &&
+		compare_diff_raw expect actual
+	'
 
-test_expect_success 'N: copy root by path' '
-	cat >expect <<-EOF &&
-	:100755 100755 $newf $newf C100	file2/newf	oldroot/file2/newf
-	:100644 100644 $oldf $oldf C100	file2/oldf	oldroot/file2/oldf
-	:100755 100755 $f4id $f4id C100	file4	oldroot/file4
-	:100755 100755 $f6id $f6id C100	newdir/exec.sh	oldroot/newdir/exec.sh
-	:100644 100644 $f5id $f5id C100	newdir/interesting	oldroot/newdir/interesting
-	EOF
-	cat >input <<-INPUT_END &&
-	commit refs/heads/N-copy-root-path
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	copy root directory by (empty) path
-	COMMIT
+	test_expect_success "N: copy root ($root) by path" '
+		cat >expect <<-EOF &&
+		:100755 100755 $newf $newf C100	file2/newf	oldroot/file2/newf
+		:100644 100644 $oldf $oldf C100	file2/oldf	oldroot/file2/oldf
+		:100755 100755 $f4id $f4id C100	file4	oldroot/file4
+		:100755 100755 $f6id $f6id C100	newdir/exec.sh	oldroot/newdir/exec.sh
+		:100644 100644 $f5id $f5id C100	newdir/interesting	oldroot/newdir/interesting
+		EOF
+		cat >input <<-INPUT_END &&
+		commit refs/heads/N-copy-root-path
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		copy root directory by (empty) path
+		COMMIT
 
-	from refs/heads/branch^0
-	C "" oldroot
-	INPUT_END
-	git fast-import <input &&
-	git diff-tree -C --find-copies-harder -r branch N-copy-root-path >actual &&
-	compare_diff_raw expect actual
-'
+		from refs/heads/branch^0
+		C $root oldroot
+		INPUT_END
+		git fast-import <input &&
+		git diff-tree -C --find-copies-harder -r branch N-copy-root-path >actual &&
+		compare_diff_raw expect actual
+	'
+done
 
 test_expect_success 'N: delete directory by copying' '
 	cat >expect <<-\EOF &&
@@ -1431,98 +1437,102 @@ test_expect_success 'N: reject foo/ syntax in ls argument' '
 	INPUT_END
 '
 
-test_expect_success 'N: copy to root by id and modify' '
-	echo "hello, world" >expect.foo &&
-	echo hello >expect.bar &&
-	git fast-import <<-SETUP_END &&
-	commit refs/heads/N7
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	hello, tree
-	COMMIT
+for root in '""' ''
+do
+	test_expect_success "N: copy to root ($root) by id and modify" '
+		echo "hello, world" >expect.foo &&
+		echo hello >expect.bar &&
+		git fast-import <<-SETUP_END &&
+		commit refs/heads/N7
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		hello, tree
+		COMMIT
 
-	deleteall
-	M 644 inline foo/bar
-	data <<EOF
-	hello
-	EOF
-	SETUP_END
+		deleteall
+		M 644 inline foo/bar
+		data <<EOF
+		hello
+		EOF
+		SETUP_END
 
-	tree=$(git rev-parse --verify N7:) &&
-	git fast-import <<-INPUT_END &&
-	commit refs/heads/N8
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	copy to root by id and modify
-	COMMIT
+		tree=$(git rev-parse --verify N7:) &&
+		git fast-import <<-INPUT_END &&
+		commit refs/heads/N8
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		copy to root by id and modify
+		COMMIT
 
-	M 040000 $tree ""
-	M 644 inline foo/foo
-	data <<EOF
-	hello, world
-	EOF
-	INPUT_END
-	git show N8:foo/foo >actual.foo &&
-	git show N8:foo/bar >actual.bar &&
-	test_cmp expect.foo actual.foo &&
-	test_cmp expect.bar actual.bar
-'
+		M 040000 $tree $root
+		M 644 inline foo/foo
+		data <<EOF
+		hello, world
+		EOF
+		INPUT_END
+		git show N8:foo/foo >actual.foo &&
+		git show N8:foo/bar >actual.bar &&
+		test_cmp expect.foo actual.foo &&
+		test_cmp expect.bar actual.bar
+	'
 
-test_expect_success 'N: extract subtree' '
-	branch=$(git rev-parse --verify refs/heads/branch^{tree}) &&
-	cat >input <<-INPUT_END &&
-	commit refs/heads/N9
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	extract subtree branch:newdir
-	COMMIT
+	test_expect_success "N: extract subtree to the root ($root)" '
+		branch=$(git rev-parse --verify refs/heads/branch^{tree}) &&
+		cat >input <<-INPUT_END &&
+		commit refs/heads/N9
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		extract subtree branch:newdir
+		COMMIT
 
-	M 040000 $branch ""
-	C "newdir" ""
-	INPUT_END
-	git fast-import <input &&
-	git diff --exit-code branch:newdir N9
-'
+		M 040000 $branch $root
+		C "newdir" $root
+		INPUT_END
+		git fast-import <input &&
+		git diff --exit-code branch:newdir N9
+	'
 
-test_expect_success 'N: modify subtree, extract it, and modify again' '
-	echo hello >expect.baz &&
-	echo hello, world >expect.qux &&
-	git fast-import <<-SETUP_END &&
-	commit refs/heads/N10
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	hello, tree
-	COMMIT
+	test_expect_success "N: modify subtree, extract it to the root ($root), and modify again" '
+		echo hello >expect.baz &&
+		echo hello, world >expect.qux &&
+		git fast-import <<-SETUP_END &&
+		commit refs/heads/N10
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		hello, tree
+		COMMIT
 
-	deleteall
-	M 644 inline foo/bar/baz
-	data <<EOF
-	hello
-	EOF
-	SETUP_END
+		deleteall
+		M 644 inline foo/bar/baz
+		data <<EOF
+		hello
+		EOF
+		SETUP_END
 
-	tree=$(git rev-parse --verify N10:) &&
-	git fast-import <<-INPUT_END &&
-	commit refs/heads/N11
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	copy to root by id and modify
-	COMMIT
+		tree=$(git rev-parse --verify N10:) &&
+		git fast-import <<-INPUT_END &&
+		commit refs/heads/N11
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		copy to root by id and modify
+		COMMIT
 
-	M 040000 $tree ""
-	M 100644 inline foo/bar/qux
-	data <<EOF
-	hello, world
-	EOF
-	R "foo" ""
-	C "bar/qux" "bar/quux"
-	INPUT_END
-	git show N11:bar/baz >actual.baz &&
-	git show N11:bar/qux >actual.qux &&
-	git show N11:bar/quux >actual.quux &&
-	test_cmp expect.baz actual.baz &&
-	test_cmp expect.qux actual.qux &&
-	test_cmp expect.qux actual.quux'
+		M 040000 $tree $root
+		M 100644 inline foo/bar/qux
+		data <<EOF
+		hello, world
+		EOF
+		R "foo" $root
+		C "bar/qux" "bar/quux"
+		INPUT_END
+		git show N11:bar/baz >actual.baz &&
+		git show N11:bar/qux >actual.qux &&
+		git show N11:bar/quux >actual.quux &&
+		test_cmp expect.baz actual.baz &&
+		test_cmp expect.qux actual.qux &&
+		test_cmp expect.qux actual.quux
+	'
+done
 
 ###
 ### series O
@@ -3067,6 +3077,7 @@ test_expect_success 'S: ls with garbage after sha1 must fail' '
 # There are two sorts of ways a path can be parsed, depending on whether it is
 # the last field on the line. Additionally, ls without a <dataref> has a special
 # case. Test every occurrence of <path> in the grammar against every error case.
+# Paths for the root (empty strings) are tested elsewhere.
 #
 
 #
@@ -3321,16 +3332,19 @@ test_path_eol_quoted_fail 'ls (without dataref in commit)' 'ls ' path
 ###
 # Setup is carried over from series S.
 
-test_expect_success 'T: ls root tree' '
-	sed -e "s/Z\$//" >expect <<-EOF &&
-	040000 tree $(git rev-parse S^{tree})	Z
-	EOF
-	sha1=$(git rev-parse --verify S) &&
-	git fast-import --import-marks=marks <<-EOF >actual &&
-	ls $sha1 ""
-	EOF
-	test_cmp expect actual
-'
+for root in '""' ''
+do
+	test_expect_success "T: ls root ($root) tree" '
+		sed -e "s/Z\$//" >expect <<-EOF &&
+		040000 tree $(git rev-parse S^{tree})	Z
+		EOF
+		sha1=$(git rev-parse --verify S) &&
+		git fast-import --import-marks=marks <<-EOF >actual &&
+		ls $sha1 $root
+		EOF
+		test_cmp expect actual
+	'
+done
 
 test_expect_success 'T: delete branch' '
 	git branch to-delete &&
@@ -3432,30 +3446,33 @@ test_expect_success 'U: validate directory delete result' '
 	compare_diff_raw expect actual
 '
 
-test_expect_success 'U: filedelete root succeeds' '
-	cat >input <<-INPUT_END &&
-	commit refs/heads/U
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	must succeed
-	COMMIT
-	from refs/heads/U^0
-	D ""
+for root in '""' ''
+do
+	test_expect_success "U: filedelete root ($root) succeeds" '
+		cat >input <<-INPUT_END &&
+		commit refs/heads/U-delete-root
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		must succeed
+		COMMIT
+		from refs/heads/U^0
+		D $root
 
-	INPUT_END
+		INPUT_END
 
-	git fast-import <input
-'
+		git fast-import <input
+	'
 
-test_expect_success 'U: validate root delete result' '
-	cat >expect <<-EOF &&
-	:100644 000000 $f7id $ZERO_OID D	hello.c
-	EOF
+	test_expect_success "U: validate root ($root) delete result" '
+		cat >expect <<-EOF &&
+		:100644 000000 $f7id $ZERO_OID D	hello.c
+		EOF
 
-	git diff-tree -M -r U^1 U >actual &&
+		git diff-tree -M -r U U-delete-root >actual &&
 
-	compare_diff_raw expect actual
-'
+		compare_diff_raw expect actual
+	'
+done
 
 ###
 ### series V (checkpoint)
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v3 4/8] fast-import: remove dead strbuf
  2024-04-10  9:54   ` [PATCH v3 " Thalia Archibald
                       ` (2 preceding siblings ...)
  2024-04-10  9:55     ` [PATCH v3 3/8] fast-import: allow unquoted empty path for root Thalia Archibald
@ 2024-04-10  9:55     ` Thalia Archibald
  2024-04-11 19:53       ` Junio C Hamano
  2024-04-10  9:55     ` [PATCH v3 5/8] fast-import: improve documentation for unquoted paths Thalia Archibald
                       ` (4 subsequent siblings)
  8 siblings, 1 reply; 84+ messages in thread
From: Thalia Archibald @ 2024-04-10  9:55 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, Chris Torek, Elijah Newren, Thalia Archibald

The strbuf in `note_change_n` is to copy the remainder of `p` before
potentially invalidating it when reading the next line. However, `p` is
not used after that point. It has been unused since the function was
created in a8dd2e7d2b (fast-import: Add support for importing commit
notes, 2009-10-09) and looks to be a fossil from adapting
`file_change_m`. Remove it.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 0da7e8a5a5..7a398dc975 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2444,7 +2444,6 @@ static void file_change_cr(const char *p, struct branch *b, int rename)
 
 static void note_change_n(const char *p, struct branch *b, unsigned char *old_fanout)
 {
-	static struct strbuf uq = STRBUF_INIT;
 	struct object_entry *oe;
 	struct branch *s;
 	struct object_id oid, commit_oid;
@@ -2509,10 +2508,6 @@ static void note_change_n(const char *p, struct branch *b, unsigned char *old_fa
 		die("Invalid ref name or SHA1 expression: %s", p);
 
 	if (inline_data) {
-		if (p != uq.buf) {
-			strbuf_addstr(&uq, p);
-			p = uq.buf;
-		}
 		read_next_command();
 		parse_and_store_blob(&last_blob, &oid, 0);
 	} else if (oe) {
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v3 5/8] fast-import: improve documentation for unquoted paths
  2024-04-10  9:54   ` [PATCH v3 " Thalia Archibald
                       ` (3 preceding siblings ...)
  2024-04-10  9:55     ` [PATCH v3 4/8] fast-import: remove dead strbuf Thalia Archibald
@ 2024-04-10  9:55     ` Thalia Archibald
  2024-04-11 19:51       ` Junio C Hamano
  2024-04-10  9:56     ` [PATCH v3 6/8] fast-import: document C-style escapes for paths Thalia Archibald
                       ` (3 subsequent siblings)
  8 siblings, 1 reply; 84+ messages in thread
From: Thalia Archibald @ 2024-04-10  9:55 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, Chris Torek, Elijah Newren, Thalia Archibald

It describes what cannot be in an unquoted path, but not what it is.
Reframe it as a definition of unquoted paths. The requirement that it
not start with `"` is the core element that implies the rest.

The restriction that the source paths of filecopy and filerename cannot
contain SP is only stated in their respective sections. Restate it in
the <path> section.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 Documentation/git-fast-import.txt | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index b2607366b9..f26d7a8571 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -630,18 +630,23 @@ in octal.  Git only supports the following modes:
 In both formats `<path>` is the complete path of the file to be added
 (if not already existing) or modified (if already existing).
 
-A `<path>` string must use UNIX-style directory separators (forward
-slash `/`), may contain any byte other than `LF`, and must not
-start with double quote (`"`).
+A `<path>` can be written as unquoted bytes or a C-style quoted string:
 
-A path can use C-style string quoting; this is accepted in all cases
-and mandatory if the filename starts with double quote or contains
-`LF`. In C-style quoting, the complete name should be surrounded with
+When a `<path>` does not start with double quote (`"`), it is an
+unquoted string and is parsed as literal bytes without any escape
+sequences. However, if the filename contains `LF` or starts with double
+quote, it must be written as a quoted string. Additionally, the source
+`<path>` in `filecopy` or `filerename` must be quoted if it contains SP.
+
+A `<path>` can use C-style string quoting; this is accepted in all cases
+and mandatory in the cases where the filename cannot be represented as
+an unquoted string. In C-style quoting, the complete name should be surrounded with
 double quotes, and any `LF`, backslash, or double quote characters
 must be escaped by preceding them with a backslash (e.g.,
 `"path/with\n, \\ and \" in it"`).
 
-The value of `<path>` must be in canonical form. That is it must not:
+A `<path>` must use UNIX-style directory separators (forward slash `/`)
+and must be in canonical form. That is it must not:
 
 * contain an empty directory component (e.g. `foo//bar` is invalid),
 * end with a directory separator (e.g. `foo/` is invalid),
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v3 6/8] fast-import: document C-style escapes for paths
  2024-04-10  9:54   ` [PATCH v3 " Thalia Archibald
                       ` (4 preceding siblings ...)
  2024-04-10  9:55     ` [PATCH v3 5/8] fast-import: improve documentation for unquoted paths Thalia Archibald
@ 2024-04-10  9:56     ` Thalia Archibald
  2024-04-10 18:28       ` Junio C Hamano
  2024-04-10  9:56     ` [PATCH v3 7/8] fast-import: forbid escaped NUL in paths Thalia Archibald
                       ` (2 subsequent siblings)
  8 siblings, 1 reply; 84+ messages in thread
From: Thalia Archibald @ 2024-04-10  9:56 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, Chris Torek, Elijah Newren, Thalia Archibald

Simply saying “C-style” string quoting is imprecise, as only a subset of
C escapes are supported. Document the exact escapes.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 Documentation/git-fast-import.txt | 12 ++++++++----
 t/t9300-fast-import.sh            | 10 ++++++----
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index f26d7a8571..db53b50268 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -640,10 +640,14 @@ quote, it must be written as a quoted string. Additionally, the source
 
 A `<path>` can use C-style string quoting; this is accepted in all cases
 and mandatory in the cases where the filename cannot be represented as
-an unquoted string. In C-style quoting, the complete name should be surrounded with
-double quotes, and any `LF`, backslash, or double quote characters
-must be escaped by preceding them with a backslash (e.g.,
-`"path/with\n, \\ and \" in it"`).
+an unquoted string. In C-style quoting, the complete filename is
+surrounded with double quote (`"`) and certain characters must be
+escaped by preceding them with a backslash: `LF` is written as `\n`,
+backslash as `\\`, and double quote as `\"`. Some characters may may
+optionally be written with escape sequences: `\a` for bell, `\b` for
+backspace, `\f` for form feed, `\n` for line feed, `\r` for carriage
+return, `\t` for horizontal tab, and `\v` for vertical tab. Any byte can
+be written with 3-digit octal codes (e.g., `\033`).
 
 A `<path>` must use UNIX-style directory separators (forward slash `/`)
 and must be in canonical form. That is it must not:
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 13f98e6688..5cde8f8d01 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -3189,8 +3189,9 @@ test_path_eol_success () {
 	'
 }
 
-test_path_eol_success 'quoted spaces'   '" hello world.c "' ' hello world.c '
-test_path_eol_success 'unquoted spaces' ' hello world.c '   ' hello world.c '
+test_path_eol_success 'quoted spaces'   '" hello world.c "'  ' hello world.c '
+test_path_eol_success 'unquoted spaces' ' hello world.c '    ' hello world.c '
+test_path_eol_success 'octal escapes'   '"\150\151\056\143"' 'hi.c'
 
 #
 # Valid paths before a space: filecopy (source) and filerename (source).
@@ -3256,8 +3257,9 @@ test_path_space_success () {
 	'
 }
 
-test_path_space_success 'quoted spaces'      '" hello world.c "' ' hello world.c '
-test_path_space_success 'no unquoted spaces' 'hello_world.c'     'hello_world.c'
+test_path_space_success 'quoted spaces'      '" hello world.c "'  ' hello world.c '
+test_path_space_success 'no unquoted spaces' 'hello_world.c'      'hello_world.c'
+test_path_space_success 'octal escapes'      '"\150\151\056\143"' 'hi.c'
 
 #
 # Test a single commit change with an invalid path. Run it with all occurrences
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v3 7/8] fast-import: forbid escaped NUL in paths
  2024-04-10  9:54   ` [PATCH v3 " Thalia Archibald
                       ` (5 preceding siblings ...)
  2024-04-10  9:56     ` [PATCH v3 6/8] fast-import: document C-style escapes for paths Thalia Archibald
@ 2024-04-10  9:56     ` Thalia Archibald
  2024-04-10 18:51       ` Junio C Hamano
  2024-04-10  9:56     ` [PATCH v3 8/8] fast-import: make comments more precise Thalia Archibald
  2024-04-12  8:01     ` [PATCH v4 0/8] fast-import: tighten parsing of paths Thalia Archibald
  8 siblings, 1 reply; 84+ messages in thread
From: Thalia Archibald @ 2024-04-10  9:56 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, Chris Torek, Elijah Newren, Thalia Archibald

NUL cannot appear in paths. Even disregarding filesystem path
limitations, the tree object format delimits with NUL, so such a path
cannot be encoded by Git.

When a quoted path is unquoted, it could possibly contain NUL from
"\000". Forbid it so it isn't truncated.

fast-import still has other issues with NUL, but those will be addressed
later.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 Documentation/git-fast-import.txt | 1 +
 builtin/fast-import.c             | 2 ++
 t/t9300-fast-import.sh            | 1 +
 3 files changed, 4 insertions(+)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index db53b50268..edda30f90c 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -660,6 +660,7 @@ and must be in canonical form. That is it must not:
 
 The root of the tree can be represented by an empty string as `<path>`.
 
+`<path>` cannot contain NUL, either literally or escaped as `\000`.
 It is recommended that `<path>` always be encoded using UTF-8.
 
 `filedelete`
diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 7a398dc975..98096b6fa7 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2269,6 +2269,8 @@ static void parse_path(struct strbuf *sb, const char *p, const char **endp,
 	if (*p == '"') {
 		if (unquote_c_style(sb, p, endp))
 			die("Invalid %s: %s", field, command_buf.buf);
+		if (strlen(sb->buf) != sb->len)
+			die("NUL in %s: %s", field, command_buf.buf);
 	} else {
 		if (include_spaces)
 			*endp = p + strlen(p);
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 5cde8f8d01..1e68426852 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -3300,6 +3300,7 @@ test_path_base_fail () {
 	local change="$1" prefix="$2" field="$3" suffix="$4"
 	test_path_fail "$change" 'unclosed " in '"$field"          "$prefix" '"hello.c'    "$suffix" "Invalid $field"
 	test_path_fail "$change" "invalid escape in quoted $field" "$prefix" '"hello\xff"' "$suffix" "Invalid $field"
+	test_path_fail "$change" "escaped NUL in quoted $field"    "$prefix" '"hello\000"' "$suffix" "NUL in $field"
 }
 test_path_eol_quoted_fail () {
 	local change="$1" prefix="$2" field="$3"
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v3 8/8] fast-import: make comments more precise
  2024-04-10  9:54   ` [PATCH v3 " Thalia Archibald
                       ` (6 preceding siblings ...)
  2024-04-10  9:56     ` [PATCH v3 7/8] fast-import: forbid escaped NUL in paths Thalia Archibald
@ 2024-04-10  9:56     ` Thalia Archibald
  2024-04-10 19:21       ` Junio C Hamano
  2024-04-12  8:01     ` [PATCH v4 0/8] fast-import: tighten parsing of paths Thalia Archibald
  8 siblings, 1 reply; 84+ messages in thread
From: Thalia Archibald @ 2024-04-10  9:56 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, Chris Torek, Elijah Newren, Thalia Archibald

The former is somewhat imprecise. The latter became out of sync with the
behavior in e814c39c2f (fast-import: refactor parsing of spaces,
2014-06-18).

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 98096b6fa7..fd23a00150 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2210,7 +2210,7 @@ static int parse_mapped_oid_hex(const char *hex, struct object_id *oid, const ch
  *
  *   idnum ::= ':' bigint;
  *
- * Return the first character after the value in *endptr.
+ * Update *endptr to point to the first character after the value.
  *
  * Complain if the following character is not what is expected,
  * either a space or end of the string.
@@ -2243,8 +2243,8 @@ static uintmax_t parse_mark_ref_eol(const char *p)
 }
 
 /*
- * Parse the mark reference, demanding a trailing space.  Return a
- * pointer to the space.
+ * Parse the mark reference, demanding a trailing space. Update *p to
+ * point to the first character after the space.
  */
 static uintmax_t parse_mark_ref_space(const char **p)
 {
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 2/8] fast-import: directly use strbufs for paths
  2024-04-10  6:27     ` Patrick Steinhardt
@ 2024-04-10 10:07       ` Thalia Archibald
  2024-04-10 10:18         ` Patrick Steinhardt
  0 siblings, 1 reply; 84+ messages in thread
From: Thalia Archibald @ 2024-04-10 10:07 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Elijah Newren

On Apr 9, 2024, at 23:27, Patrick Steinhardt <ps@pks.im> wrote:
> On Mon, Apr 01, 2024 at 09:03:06AM +0000, Thalia Archibald wrote:
>> 
>> + parse_path_eol(&path, p
>> , "path");
> 
> This looks weird. Did you manually edit the patch or is there some weird
> character in here that breaks diff generation?
> 
>> + tree_content_get(&b-
>>> branch_tree, source.buf, &leaf, 1);
> 
> Same here. Is your mail agent maybe wrapping lines?
> 
>> - s
>> trbuf_reset(&uq);
> 
> And here.
> 
> Other than those formatting issues this patch looks fine to me.

I’m not able to reproduce these rewrapping issues anywhere I view this email: in
my outbox, inbox, or the archive. I think it’s on your end.

https://lore.kernel.org/git/82a6f53c1326a420348eb70461f5929340a930d3.1711960552.git.thalia@archibald.dev/

Thalia

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 2/8] fast-import: directly use strbufs for paths
  2024-04-10 10:07       ` Thalia Archibald
@ 2024-04-10 10:18         ` Patrick Steinhardt
  0 siblings, 0 replies; 84+ messages in thread
From: Patrick Steinhardt @ 2024-04-10 10:18 UTC (permalink / raw)
  To: Thalia Archibald; +Cc: git, Elijah Newren

[-- Attachment #1: Type: text/plain, Size: 1046 bytes --]

On Wed, Apr 10, 2024 at 10:07:13AM +0000, Thalia Archibald wrote:
> On Apr 9, 2024, at 23:27, Patrick Steinhardt <ps@pks.im> wrote:
> > On Mon, Apr 01, 2024 at 09:03:06AM +0000, Thalia Archibald wrote:
> >> 
> >> + parse_path_eol(&path, p
> >> , "path");
> > 
> > This looks weird. Did you manually edit the patch or is there some weird
> > character in here that breaks diff generation?
> > 
> >> + tree_content_get(&b-
> >>> branch_tree, source.buf, &leaf, 1);
> > 
> > Same here. Is your mail agent maybe wrapping lines?
> > 
> >> - s
> >> trbuf_reset(&uq);
> > 
> > And here.
> > 
> > Other than those formatting issues this patch looks fine to me.
> 
> I’m not able to reproduce these rewrapping issues anywhere I view this email: in
> my outbox, inbox, or the archive. I think it’s on your end.
> 
> https://lore.kernel.org/git/82a6f53c1326a420348eb70461f5929340a930d3.1711960552.git.thalia@archibald.dev/

Could be that this is happening because the mails you sent to me are
actually encrypted.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 6/8] fast-import: document C-style escapes for paths
  2024-04-10  9:56     ` [PATCH v3 6/8] fast-import: document C-style escapes for paths Thalia Archibald
@ 2024-04-10 18:28       ` Junio C Hamano
  2024-04-10 22:50         ` Thalia Archibald
  0 siblings, 1 reply; 84+ messages in thread
From: Junio C Hamano @ 2024-04-10 18:28 UTC (permalink / raw)
  To: Thalia Archibald; +Cc: git, Patrick Steinhardt, Chris Torek, Elijah Newren

Thalia Archibald <thalia@archibald.dev> writes:

> +an unquoted string. In C-style quoting, the complete filename is
> +surrounded with double quote (`"`) and certain characters must be
> +escaped by preceding them with a backslash: `LF` is written as `\n`,
> +backslash as `\\`, and double quote as `\"`. Some characters may may

"may may"?

> +optionally be written with escape sequences: `\a` for bell, `\b` for
> +backspace, `\f` for form feed, `\n` for line feed, `\r` for carriage
> +return, `\t` for horizontal tab, and `\v` for vertical tab. Any byte can
> +be written with 3-digit octal codes (e.g., `\033`).

Separating the escaped characters into two classes (mandatory LF and
BackSlash, and others) is probably a good idea to clarify the
description.  Nicely done.

>  A `<path>` must use UNIX-style directory separators (forward slash `/`)
>  and must be in canonical form. That is it must not:
> diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
> index 13f98e6688..5cde8f8d01 100755
> --- a/t/t9300-fast-import.sh
> +++ b/t/t9300-fast-import.sh
> @@ -3189,8 +3189,9 @@ test_path_eol_success () {
>  	'
>  }
>  
> -test_path_eol_success 'quoted spaces'   '" hello world.c "' ' hello world.c '
> -test_path_eol_success 'unquoted spaces' ' hello world.c '   ' hello world.c '
> +test_path_eol_success 'quoted spaces'   '" hello world.c "'  ' hello world.c '
> +test_path_eol_success 'unquoted spaces' ' hello world.c '    ' hello world.c '

It is annoying to see these changes (and the same for the next
hunk).  Would it make a lot of damage to existing lines in this file
if we just say "do not align with extra spaces in between strings"?
If so, that is a good reason to keep doing things this way, but if I
recall correctly, these test_path_eol/space_success are what this
series added to the file, so if we stop such alignment from the get-go,
it may be alright.

> +test_path_eol_success 'octal escapes'   '"\150\151\056\143"' 'hi.c'
>  
>  #
>  # Valid paths before a space: filecopy (source) and filerename (source).
> @@ -3256,8 +3257,9 @@ test_path_space_success () {
>  	'
>  }
>  
> -test_path_space_success 'quoted spaces'      '" hello world.c "' ' hello world.c '
> -test_path_space_success 'no unquoted spaces' 'hello_world.c'     'hello_world.c'
> +test_path_space_success 'quoted spaces'      '" hello world.c "'  ' hello world.c '
> +test_path_space_success 'no unquoted spaces' 'hello_world.c'      'hello_world.c'
> +test_path_space_success 'octal escapes'      '"\150\151\056\143"' 'hi.c'
>  
>  #
>  # Test a single commit change with an invalid path. Run it with all occurrences

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 7/8] fast-import: forbid escaped NUL in paths
  2024-04-10  9:56     ` [PATCH v3 7/8] fast-import: forbid escaped NUL in paths Thalia Archibald
@ 2024-04-10 18:51       ` Junio C Hamano
  0 siblings, 0 replies; 84+ messages in thread
From: Junio C Hamano @ 2024-04-10 18:51 UTC (permalink / raw)
  To: Thalia Archibald; +Cc: git, Patrick Steinhardt, Chris Torek, Elijah Newren

Thalia Archibald <thalia@archibald.dev> writes:

> NUL cannot appear in paths. Even disregarding filesystem path
> limitations, the tree object format delimits with NUL, so such a path
> cannot be encoded by Git.
>
> When a quoted path is unquoted, it could possibly contain NUL from
> "\000". Forbid it so it isn't truncated.
>
> fast-import still has other issues with NUL, but those will be addressed
> later.

Later meaning outside the series, as 8/8 is not about that?  Not a
complaint, and if the way I interpreted is correct, then there is no
need to update the above statement.  Just double-checking.

> Signed-off-by: Thalia Archibald <thalia@archibald.dev>
> ---
>  Documentation/git-fast-import.txt | 1 +
>  builtin/fast-import.c             | 2 ++
>  t/t9300-fast-import.sh            | 1 +
>  3 files changed, 4 insertions(+)
>
> diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
> index db53b50268..edda30f90c 100644
> --- a/Documentation/git-fast-import.txt
> +++ b/Documentation/git-fast-import.txt
> @@ -660,6 +660,7 @@ and must be in canonical form. That is it must not:
>  
>  The root of the tree can be represented by an empty string as `<path>`.
>  
> +`<path>` cannot contain NUL, either literally or escaped as `\000`.

OK.

> +		if (strlen(sb->buf) != sb->len)
> +			die("NUL in %s: %s", field, command_buf.buf);

Nice.  !memchr(sb->buf, ch, sb->len) would be more general solution
if we were looking for ch that is not NUL, but for checking NUL,
what you wrote is the most natural to read.

>  	} else {
>  		if (include_spaces)
>  			*endp = p + strlen(p);
> diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
> index 5cde8f8d01..1e68426852 100755
> --- a/t/t9300-fast-import.sh
> +++ b/t/t9300-fast-import.sh
> @@ -3300,6 +3300,7 @@ test_path_base_fail () {
>  	local change="$1" prefix="$2" field="$3" suffix="$4"
>  	test_path_fail "$change" 'unclosed " in '"$field"          "$prefix" '"hello.c'    "$suffix" "Invalid $field"
>  	test_path_fail "$change" "invalid escape in quoted $field" "$prefix" '"hello\xff"' "$suffix" "Invalid $field"
> +	test_path_fail "$change" "escaped NUL in quoted $field"    "$prefix" '"hello\000"' "$suffix" "NUL in $field"
>  }
>  test_path_eol_quoted_fail () {
>  	local change="$1" prefix="$2" field="$3"

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 8/8] fast-import: make comments more precise
  2024-04-10  9:56     ` [PATCH v3 8/8] fast-import: make comments more precise Thalia Archibald
@ 2024-04-10 19:21       ` Junio C Hamano
  0 siblings, 0 replies; 84+ messages in thread
From: Junio C Hamano @ 2024-04-10 19:21 UTC (permalink / raw)
  To: Thalia Archibald; +Cc: git, Patrick Steinhardt, Chris Torek, Elijah Newren

Thalia Archibald <thalia@archibald.dev> writes:

> The former is somewhat imprecise. The latter became out of sync with the
> behavior in e814c39c2f (fast-import: refactor parsing of spaces,
> 2014-06-18).
>
> Signed-off-by: Thalia Archibald <thalia@archibald.dev>
> ---
>  builtin/fast-import.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)

Thanks for being careful.  Looking good.

>
> diff --git a/builtin/fast-import.c b/builtin/fast-import.c
> index 98096b6fa7..fd23a00150 100644
> --- a/builtin/fast-import.c
> +++ b/builtin/fast-import.c
> @@ -2210,7 +2210,7 @@ static int parse_mapped_oid_hex(const char *hex, struct object_id *oid, const ch
>   *
>   *   idnum ::= ':' bigint;
>   *
> - * Return the first character after the value in *endptr.
> + * Update *endptr to point to the first character after the value.
>   *
>   * Complain if the following character is not what is expected,
>   * either a space or end of the string.
> @@ -2243,8 +2243,8 @@ static uintmax_t parse_mark_ref_eol(const char *p)
>  }
>  
>  /*
> - * Parse the mark reference, demanding a trailing space.  Return a
> - * pointer to the space.
> + * Parse the mark reference, demanding a trailing space. Update *p to
> + * point to the first character after the space.
>   */
>  static uintmax_t parse_mark_ref_space(const char **p)
>  {

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 6/8] fast-import: document C-style escapes for paths
  2024-04-10 18:28       ` Junio C Hamano
@ 2024-04-10 22:50         ` Thalia Archibald
  2024-04-11  5:32           ` Junio C Hamano
  0 siblings, 1 reply; 84+ messages in thread
From: Thalia Archibald @ 2024-04-10 22:50 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Patrick Steinhardt, Chris Torek, Elijah Newren

On Apr 10, 2024, at 11:28, Junio C Hamano <gitster@pobox.com> wrote:
> Thalia Archibald <thalia@archibald.dev> writes:
>> 
>> -test_path_eol_success 'quoted spaces'   '" hello world.c "' ' hello world.c '
>> -test_path_eol_success 'unquoted spaces' ' hello world.c '   ' hello world.c '
>> +test_path_eol_success 'quoted spaces'   '" hello world.c "'  ' hello world.c '
>> +test_path_eol_success 'unquoted spaces' ' hello world.c '    ' hello world.c '
>> +test_path_eol_success 'octal escapes'   '"\150\151\056\143"' 'hi.c'
> 
> It is annoying to see these changes (and the same for the next
> hunk).  Would it make a lot of damage to existing lines in this file
> if we just say "do not align with extra spaces in between strings"?
> If so, that is a good reason to keep doing things this way, but if I
> recall correctly, these test_path_eol/space_success are what this
> series added to the file, so if we stop such alignment from the get-go,
> it may be alright.
> 
>> -test_path_space_success 'quoted spaces'      '" hello world.c "' ' hello world.c '
>> -test_path_space_success 'no unquoted spaces' 'hello_world.c'     'hello_world.c'
>> +test_path_space_success 'quoted spaces'      '" hello world.c "'  ' hello world.c '
>> +test_path_space_success 'no unquoted spaces' 'hello_world.c'      'hello_world.c'
>> +test_path_space_success 'octal escapes'      '"\150\151\056\143"' ‘hi.c'

Is it a style problem, that you prefer parameters to not be aligned? I
think it reads nicer this way, especially because there are quotes
within quotes and spaces at the starts and ends of strings, which can
lead to reinterpreting the boundaries of the strings on a less-careful
read through. They’re like a table of tests. But ultimately, it should
be the Git style that prevails not mine, so if that’s it, I’ll change
it.

Or I could preemptively align them according to the final alignment in
this series. I expect there wouldn't be many changes to these tests
later, so it should be stable.

I expected more pushback with 3/8, where 9 tests were indented to place
them inside loops in order to test them with multiple values for root,
so it seems not to be purely about whitespace changes in diffs.

In any case, it’s not a big deal and I’m happy to go with your
direction.

Thalia

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 6/8] fast-import: document C-style escapes for paths
  2024-04-10 22:50         ` Thalia Archibald
@ 2024-04-11  5:32           ` Junio C Hamano
  2024-04-11  9:14             ` Patrick Steinhardt
  0 siblings, 1 reply; 84+ messages in thread
From: Junio C Hamano @ 2024-04-11  5:32 UTC (permalink / raw)
  To: Thalia Archibald; +Cc: git, Patrick Steinhardt, Chris Torek, Elijah Newren

Thalia Archibald <thalia@archibald.dev> writes:

> I expected more pushback with 3/8, where 9 tests were indented to place
> them inside loops in order to test them with multiple values for root,
> so it seems not to be purely about whitespace changes in diffs.

Well, if I read it, I may have (or not have) comments on the step,
but because Patrick started from front, I was reading backwards from
the end of the series, and I didn't reach 3/8 ;-)



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 6/8] fast-import: document C-style escapes for paths
  2024-04-11  5:32           ` Junio C Hamano
@ 2024-04-11  9:14             ` Patrick Steinhardt
  0 siblings, 0 replies; 84+ messages in thread
From: Patrick Steinhardt @ 2024-04-11  9:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Thalia Archibald, git, Chris Torek, Elijah Newren

[-- Attachment #1: Type: text/plain, Size: 820 bytes --]

On Wed, Apr 10, 2024 at 10:32:06PM -0700, Junio C Hamano wrote:
> Thalia Archibald <thalia@archibald.dev> writes:
> 
> > I expected more pushback with 3/8, where 9 tests were indented to place
> > them inside loops in order to test them with multiple values for root,
> > so it seems not to be purely about whitespace changes in diffs.
> 
> Well, if I read it, I may have (or not have) comments on the step,
> but because Patrick started from front, I was reading backwards from
> the end of the series, and I didn't reach 3/8 ;-)

I wasn't all that happy with that conversion indeed. But I also couldn't
really think of a nicer way to handle this. While we could've just not
reindented the tests at all, I kind of doubt that this would be the
wisest decisions.

So I just didn't complain :)

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 5/8] fast-import: improve documentation for unquoted paths
  2024-04-10  9:55     ` [PATCH v3 5/8] fast-import: improve documentation for unquoted paths Thalia Archibald
@ 2024-04-11 19:51       ` Junio C Hamano
  0 siblings, 0 replies; 84+ messages in thread
From: Junio C Hamano @ 2024-04-11 19:51 UTC (permalink / raw)
  To: Thalia Archibald; +Cc: git, Patrick Steinhardt, Chris Torek, Elijah Newren

Thalia Archibald <thalia@archibald.dev> writes:

> It describes what cannot be in an unquoted path, but not what it is.
> Reframe it as a definition of unquoted paths. The requirement that it
> not start with `"` is the core element that implies the rest.

The other is that a path with LF in it cannot be written unquoted,
which should be treated the same way as ones that begin with dq in
this explanation, I think.

> The restriction that the source paths of filecopy and filerename cannot
> contain SP is only stated in their respective sections. Restate it in
> the <path> section.

Elsewhere later in the series we clarify that NUL cannot appear in a
path, so the above looks perfect at this point in the series.

> Signed-off-by: Thalia Archibald <thalia@archibald.dev>
> ---
>  Documentation/git-fast-import.txt | 19 ++++++++++++-------
>  1 file changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
> index b2607366b9..f26d7a8571 100644
> --- a/Documentation/git-fast-import.txt
> +++ b/Documentation/git-fast-import.txt
> @@ -630,18 +630,23 @@ in octal.  Git only supports the following modes:
>  In both formats `<path>` is the complete path of the file to be added
>  (if not already existing) or modified (if already existing).
>  
> -A `<path>` string must use UNIX-style directory separators (forward
> -slash `/`), may contain any byte other than `LF`, and must not
> -start with double quote (`"`).
>  
> -A path can use C-style string quoting; this is accepted in all cases
> -and mandatory if the filename starts with double quote or contains
> -`LF`. In C-style quoting, the complete name should be surrounded with

> +A `<path>` can be written as unquoted bytes or a C-style quoted string:

If it is followed by two-bullet-point enumeration (one for unquoted,
the other for quoted), then sentence ending in a colon here is
perfectly fine, but we are not doing so in the next two paragraphs,
so let's give it a normal full-stop (period).  And make it a
paragraph on its own, which you did correctly.

> +When a `<path>` does not start with double quote (`"`), it is an
> +unquoted string and is parsed as literal bytes without any escape
> +sequences. However, if the filename contains `LF` or starts with double
> +quote, it must be written as a quoted string. Additionally, the source
> +`<path>` in `filecopy` or `filerename` must be quoted if it contains SP.

As the description for <path> in filecopy and filerename refers back
to this description, this "Additionally" is a good clarification to
have here.  Nicely done.

> +A `<path>` can use C-style string quoting; this is accepted in all cases
> +and mandatory in the cases where the filename cannot be represented as
> +an unquoted string. In C-style quoting, the complete name should be surrounded with

I somehow think the early part (before "In C-style quoting") is
redundant and unnecessary, as we started the section with "as
unquoted bytes or a C-style quoted string."  As the previous
paragraph about unquoted path begins with "When a <path> does not
start with a double quote", I would have expected this paragraph to
begin like so:

	When a `<path>` begins with a double quote (`"`), it is a
	C-style quoted string, where the complete name is enclosed
	in a pair of double quotes, and ...

>  double quotes, and any `LF`, backslash, or double quote characters
>  must be escaped by preceding them with a backslash (e.g.,
>  `"path/with\n, \\ and \" in it"`).
>  
> -The value of `<path>` must be in canonical form. That is it must not:
> +A `<path>` must use UNIX-style directory separators (forward slash `/`)
> +and must be in canonical form. That is it must not:
>  
>  * contain an empty directory component (e.g. `foo//bar` is invalid),
>  * end with a directory separator (e.g. `foo/` is invalid),

Thanks.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 4/8] fast-import: remove dead strbuf
  2024-04-10  9:55     ` [PATCH v3 4/8] fast-import: remove dead strbuf Thalia Archibald
@ 2024-04-11 19:53       ` Junio C Hamano
  0 siblings, 0 replies; 84+ messages in thread
From: Junio C Hamano @ 2024-04-11 19:53 UTC (permalink / raw)
  To: Thalia Archibald; +Cc: git, Patrick Steinhardt, Chris Torek, Elijah Newren

Thalia Archibald <thalia@archibald.dev> writes:

> The strbuf in `note_change_n` is to copy the remainder of `p` before
> potentially invalidating it when reading the next line. However, `p` is
> not used after that point. It has been unused since the function was
> created in a8dd2e7d2b (fast-import: Add support for importing commit
> notes, 2009-10-09) and looks to be a fossil from adapting
> `file_change_m`. Remove it.
>
> Signed-off-by: Thalia Archibald <thalia@archibald.dev>
> ---
>  builtin/fast-import.c | 5 -----
>  1 file changed, 5 deletions(-)

Losing code that is not used is always good ;-)


> diff --git a/builtin/fast-import.c b/builtin/fast-import.c
> index 0da7e8a5a5..7a398dc975 100644
> --- a/builtin/fast-import.c
> +++ b/builtin/fast-import.c
> @@ -2444,7 +2444,6 @@ static void file_change_cr(const char *p, struct branch *b, int rename)
>  
>  static void note_change_n(const char *p, struct branch *b, unsigned char *old_fanout)
>  {
> -	static struct strbuf uq = STRBUF_INIT;
>  	struct object_entry *oe;
>  	struct branch *s;
>  	struct object_id oid, commit_oid;
> @@ -2509,10 +2508,6 @@ static void note_change_n(const char *p, struct branch *b, unsigned char *old_fa
>  		die("Invalid ref name or SHA1 expression: %s", p);
>  
>  	if (inline_data) {
> -		if (p != uq.buf) {
> -			strbuf_addstr(&uq, p);
> -			p = uq.buf;
> -		}
>  		read_next_command();
>  		parse_and_store_blob(&last_blob, &oid, 0);
>  	} else if (oe) {

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 3/8] fast-import: allow unquoted empty path for root
  2024-04-10  9:55     ` [PATCH v3 3/8] fast-import: allow unquoted empty path for root Thalia Archibald
@ 2024-04-11 19:59       ` Junio C Hamano
  0 siblings, 0 replies; 84+ messages in thread
From: Junio C Hamano @ 2024-04-11 19:59 UTC (permalink / raw)
  To: Thalia Archibald; +Cc: git, Patrick Steinhardt, Chris Torek, Elijah Newren

Thalia Archibald <thalia@archibald.dev> writes:

> For most of this feature's lifetime, the documentation has not
> prescribed the use of quoted empty strings. In e5959106d6
> (Documentation/fast-import: put explanation of M 040000 <dataref> "" in
> context, 2011-01-15), its documentation was changed from “`<path>` may
> also be an empty string (`""`) to specify the root of the tree” to “The
> root of the tree can be represented by an empty string as `<path>`”.
>
> Thus, we can assume that some front-ends have depended on this behavior.

If I were writing this, I would say "must" instead of "can", as
otherwise it would probably be a good idea if we could tighten it.

Of course no need to reroll just to update this.

> diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
> index de2f1304e8..13f98e6688 100755
> --- a/t/t9300-fast-import.sh
> +++ b/t/t9300-fast-import.sh

The changes needed to test both a C_quoted empty string and an
unquoted empty string are so small (when viewed with "show -w")
and pleasant.  Nicely done.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v4 0/8] fast-import: tighten parsing of paths
  2024-04-10  9:54   ` [PATCH v3 " Thalia Archibald
                       ` (7 preceding siblings ...)
  2024-04-10  9:56     ` [PATCH v3 8/8] fast-import: make comments more precise Thalia Archibald
@ 2024-04-12  8:01     ` Thalia Archibald
  2024-04-12  8:02       ` [PATCH v4 1/8] fast-import: tighten path unquoting Thalia Archibald
                         ` (8 more replies)
  8 siblings, 9 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-12  8:01 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Chris Torek, Elijah Newren,
	Thalia Archibald

> fast-import has subtle differences in how it parses file paths between each
> occurrence of <path> in the grammar. Many errors are suppressed or not checked,
> which could lead to silent data corruption. A particularly bad case is when a
> front-end sent escapes that Git doesn't recognize (e.g., hex escapes are not
> supported), it would be treated as literal bytes instead of a quoted string.
>
> Bring path parsing into line with the documented behavior and improve
> documentation to fill in missing details.

Updated to address Junio's review comments. Thanks!

This round needed no code changes, so is probably ready, if the documentation
changes look good.

Changes since v3:
* Reword documentation paragraphs on path quoting.
* Fix minor typo in commit message.

Thalia


Thalia Archibald (8):
  fast-import: tighten path unquoting
  fast-import: directly use strbufs for paths
  fast-import: allow unquoted empty path for root
  fast-import: remove dead strbuf
  fast-import: improve documentation for path quoting
  fast-import: document C-style escapes for paths
  fast-import: forbid escaped NUL in paths
  fast-import: make comments more precise

 Documentation/git-fast-import.txt |  31 +-
 builtin/fast-import.c             | 158 ++++----
 t/t9300-fast-import.sh            | 624 +++++++++++++++++++++---------
 3 files changed, 551 insertions(+), 262 deletions(-)

Range-diff against v3:
1:  d9ab0c6a75 = 1:  d6ea8aca46 fast-import: tighten path unquoting
2:  696ca27bb7 = 2:  9499f34aae fast-import: directly use strbufs for paths
3:  39879d0a66 ! 3:  9b1e6b80f5 fast-import: allow unquoted empty path for root
    @@ Commit message
         also be an empty string (`""`) to specify the root of the tree” to “The
         root of the tree can be represented by an empty string as `<path>`”.
     
    -    Thus, we can assume that some front-ends have depended on this behavior.
    +    Thus, we should assume that some front-ends have depended on this
    +    behavior.
     
         Remove this restriction for the destination paths of filecopy and
         filerename and change tests targeting the root to test `""` and ``.
4:  1cef05e59a = 4:  1a2b0dc616 fast-import: remove dead strbuf
5:  2e78690023 ! 5:  fb0d870d53 fast-import: improve documentation for unquoted paths
    @@ Metadata
     Author: Thalia Archibald <thalia@archibald.dev>
     
      ## Commit message ##
    -    fast-import: improve documentation for unquoted paths
    +    fast-import: improve documentation for path quoting
     
    -    It describes what cannot be in an unquoted path, but not what it is.
    -    Reframe it as a definition of unquoted paths. The requirement that it
    -    not start with `"` is the core element that implies the rest.
    +    It describes what characters cannot be in an unquoted path, but not
    +    their semantics. Reframe it as a definition of unquoted paths. From the
    +    perspective of the parser, whether it starts with `"` is what defines
    +    whether it will parse it as quoted or unquoted.
    +
    +    The restrictions on characters in unquoted paths (with starting-", LF,
    +    and spaces) are explained in the quoted paragraph. Move it to the
    +    unquoted paragraph and reword.
     
         The restriction that the source paths of filecopy and filerename cannot
         contain SP is only stated in their respective sections. Restate it in
    @@ Documentation/git-fast-import.txt: in octal.  Git only supports the following mo
     -A `<path>` string must use UNIX-style directory separators (forward
     -slash `/`), may contain any byte other than `LF`, and must not
     -start with double quote (`"`).
    -+A `<path>` can be written as unquoted bytes or a C-style quoted string:
    ++A `<path>` can be written as unquoted bytes or a C-style quoted string.
      
     -A path can use C-style string quoting; this is accepted in all cases
     -and mandatory if the filename starts with double quote or contains
     -`LF`. In C-style quoting, the complete name should be surrounded with
    -+When a `<path>` does not start with double quote (`"`), it is an
    +-double quotes, and any `LF`, backslash, or double quote characters
    +-must be escaped by preceding them with a backslash (e.g.,
    +-`"path/with\n, \\ and \" in it"`).
    ++When a `<path>` does not start with a double quote (`"`), it is an
     +unquoted string and is parsed as literal bytes without any escape
     +sequences. However, if the filename contains `LF` or starts with double
    -+quote, it must be written as a quoted string. Additionally, the source
    -+`<path>` in `filecopy` or `filerename` must be quoted if it contains SP.
    -+
    -+A `<path>` can use C-style string quoting; this is accepted in all cases
    -+and mandatory in the cases where the filename cannot be represented as
    -+an unquoted string. In C-style quoting, the complete name should be surrounded with
    - double quotes, and any `LF`, backslash, or double quote characters
    - must be escaped by preceding them with a backslash (e.g.,
    - `"path/with\n, \\ and \" in it"`).
    ++quote, it cannot be represented as an unquoted string and must be
    ++quoted. Additionally, the source `<path>` in `filecopy` or `filerename`
    ++must be quoted if it contains SP.
      
     -The value of `<path>` must be in canonical form. That is it must not:
    ++When a `<path>` starts with a double quote (`"`), it is a C-style quoted
    ++string, where the complete filename is enclosed in a pair of double
    ++quotes and escape sequences are used. Certain characters must be escaped
    ++by preceding them with a backslash: `LF` is written as `\n`, backslash
    ++as `\\`, and double quote as `\"`. All filenames can be represented as
    ++quoted strings.
    ++
     +A `<path>` must use UNIX-style directory separators (forward slash `/`)
    -+and must be in canonical form. That is it must not:
    ++and its value must be in canonical form. That is it must not:
      
      * contain an empty directory component (e.g. `foo//bar` is invalid),
      * end with a directory separator (e.g. `foo/` is invalid),
6:  1b07ddffe0 ! 6:  4b6017ded8 fast-import: document C-style escapes for paths
    @@ Commit message
         Signed-off-by: Thalia Archibald <thalia@archibald.dev>
     
      ## Documentation/git-fast-import.txt ##
    -@@ Documentation/git-fast-import.txt: quote, it must be written as a quoted string. Additionally, the source
    - 
    - A `<path>` can use C-style string quoting; this is accepted in all cases
    - and mandatory in the cases where the filename cannot be represented as
    --an unquoted string. In C-style quoting, the complete name should be surrounded with
    --double quotes, and any `LF`, backslash, or double quote characters
    --must be escaped by preceding them with a backslash (e.g.,
    --`"path/with\n, \\ and \" in it"`).
    -+an unquoted string. In C-style quoting, the complete filename is
    -+surrounded with double quote (`"`) and certain characters must be
    -+escaped by preceding them with a backslash: `LF` is written as `\n`,
    -+backslash as `\\`, and double quote as `\"`. Some characters may may
    -+optionally be written with escape sequences: `\a` for bell, `\b` for
    -+backspace, `\f` for form feed, `\n` for line feed, `\r` for carriage
    -+return, `\t` for horizontal tab, and `\v` for vertical tab. Any byte can
    -+be written with 3-digit octal codes (e.g., `\033`).
    +@@ Documentation/git-fast-import.txt: When a `<path>` starts with a double quote (`"`), it is a C-style quoted
    + string, where the complete filename is enclosed in a pair of double
    + quotes and escape sequences are used. Certain characters must be escaped
    + by preceding them with a backslash: `LF` is written as `\n`, backslash
    +-as `\\`, and double quote as `\"`. All filenames can be represented as
    ++as `\\`, and double quote as `\"`. Some characters may optionally be
    ++written with escape sequences: `\a` for bell, `\b` for backspace, `\f`
    ++for form feed, `\n` for line feed, `\r` for carriage return, `\t` for
    ++horizontal tab, and `\v` for vertical tab. Any byte can be written with
    ++3-digit octal codes (e.g., `\033`). All filenames can be represented as
    + quoted strings.
      
      A `<path>` must use UNIX-style directory separators (forward slash `/`)
    - and must be in canonical form. That is it must not:
     
      ## t/t9300-fast-import.sh ##
     @@ t/t9300-fast-import.sh: test_path_eol_success () {
7:  dc67464b6a = 7:  5b464f4b01 fast-import: forbid escaped NUL in paths
8:  5e02d887bc = 8:  6eb66fce45 fast-import: make comments more precise
-- 
2.44.0


^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v4 1/8] fast-import: tighten path unquoting
  2024-04-12  8:01     ` [PATCH v4 0/8] fast-import: tighten parsing of paths Thalia Archibald
@ 2024-04-12  8:02       ` Thalia Archibald
  2024-04-12 16:34         ` Junio C Hamano
  2024-04-12  8:03       ` [PATCH v4 2/8] fast-import: directly use strbufs for paths Thalia Archibald
                         ` (7 subsequent siblings)
  8 siblings, 1 reply; 84+ messages in thread
From: Thalia Archibald @ 2024-04-12  8:02 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Chris Torek, Elijah Newren,
	Thalia Archibald

Path parsing in fast-import is inconsistent and many unquoting errors
are suppressed or not checked.

<path> appears in the grammar in these places:

    filemodify ::= 'M' SP <mode> (<dataref> | 'inline') SP <path> LF
    filedelete ::= 'D' SP <path> LF
    filecopy   ::= 'C' SP <path> SP <path> LF
    filerename ::= 'R' SP <path> SP <path> LF
    ls         ::= 'ls' SP <dataref> SP <path> LF
    ls-commit  ::= 'ls' SP <path> LF

and fast-import.c parses them in five different ways:

1. For filemodify and filedelete:
   Try to unquote <path>. If it unquotes without errors, use the
   unquoted version; otherwise, treat it as literal bytes to the end of
   the line (including any number of SP).
2. For filecopy (source) and filerename (source):
   Try to unquote <path>. If it unquotes without errors, use the
   unquoted version; otherwise, treat it as literal bytes up to, but not
   including, the next SP.
3. For filecopy (dest) and filerename (dest):
   Like 1., but an unquoted empty string is forbidden.
4. For ls:
   If <path> starts with `"`, unquote it and report parse errors;
   otherwise, treat it as literal bytes to the end of the line
   (including any number of SP).
5. For ls-commit:
   Unquote <path> and report parse errors.
   (It must start with `"` to disambiguate from ls.)

In the first three, any errors from trying to unquote a string are
suppressed, so a quoted string that contains invalid escapes would be
interpreted as literal bytes. For example, `"\xff"` would fail to
unquote (because hex escapes are not supported), and it would instead be
interpreted as the byte sequence '"', '\\', 'x', 'f', 'f', '"', which is
certainly not intended. Some front-ends erroneously use their language's
standard quoting routine instead of matching Git's, which could silently
introduce escapes that would be incorrectly parsed due to this and lead
to data corruption.

The documentation states “To use a source path that contains SP the path
must be quoted.”, so it is expected that some implementations depend on
spaces being allowed in paths in the final position. Thus we have two
documented ways to parse paths, so simplify the implementation to that.

Now we have:

1. `parse_path_eol` for filemodify, filedelete, filecopy (dest),
   filerename (dest), ls, and ls-commit:

   If <path> starts with `"`, unquote it and report parse errors;
   otherwise, treat it as literal bytes to the end of the line
   (including any number of SP).

2. `parse_path_space` for filecopy (source) and filerename (source):

   If <path> starts with `"`, unquote it and report parse errors;
   otherwise, treat it as literal bytes up to, but not including, the
   next SP. It must be followed by SP.

There remain two special cases: The dest <path> in filecopy and rename
cannot be an unquoted empty string (this will be addressed subsequently)
and <path> in ls-commit must be quoted to disambiguate it from ls.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c  | 104 ++++++++++-------
 t/t9300-fast-import.sh | 258 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 318 insertions(+), 44 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 782bda007c..ce9231afe6 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2258,10 +2258,56 @@ static uintmax_t parse_mark_ref_space(const char **p)
 	return mark;
 }
 
+/*
+ * Parse the path string into the strbuf. It may be quoted with escape sequences
+ * or unquoted without escape sequences. When unquoted, it may only contain a
+ * space if `include_spaces` is nonzero.
+ */
+static void parse_path(struct strbuf *sb, const char *p, const char **endp,
+		int include_spaces, const char *field)
+{
+	if (*p == '"') {
+		if (unquote_c_style(sb, p, endp))
+			die("Invalid %s: %s", field, command_buf.buf);
+	} else {
+		if (include_spaces)
+			*endp = p + strlen(p);
+		else
+			*endp = strchrnul(p, ' ');
+		strbuf_add(sb, p, *endp - p);
+	}
+}
+
+/*
+ * Parse the path string into the strbuf, and complain if this is not the end of
+ * the string. It may contain spaces even when unquoted.
+ */
+static void parse_path_eol(struct strbuf *sb, const char *p, const char *field)
+{
+	const char *end;
+
+	parse_path(sb, p, &end, 1, field);
+	if (*end)
+		die("Garbage after %s: %s", field, command_buf.buf);
+}
+
+/*
+ * Parse the path string into the strbuf, and ensure it is followed by a space.
+ * It may not contain spaces when unquoted. Update *endp to point to the first
+ * character after the space.
+ */
+static void parse_path_space(struct strbuf *sb, const char *p,
+		const char **endp, const char *field)
+{
+	parse_path(sb, p, endp, 0, field);
+	if (**endp != ' ')
+		die("Missing space after %s: %s", field, command_buf.buf);
+	(*endp)++;
+}
+
 static void file_change_m(const char *p, struct branch *b)
 {
 	static struct strbuf uq = STRBUF_INIT;
-	const char *endp;
 	struct object_entry *oe;
 	struct object_id oid;
 	uint16_t mode, inline_data = 0;
@@ -2299,11 +2345,8 @@ static void file_change_m(const char *p, struct branch *b)
 	}
 
 	strbuf_reset(&uq);
-	if (!unquote_c_style(&uq, p, &endp)) {
-		if (*endp)
-			die("Garbage after path in: %s", command_buf.buf);
-		p = uq.buf;
-	}
+	parse_path_eol(&uq, p, "path");
+	p = uq.buf;
 
 	/* Git does not track empty, non-toplevel directories. */
 	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *p) {
@@ -2367,48 +2410,29 @@ static void file_change_m(const char *p, struct branch *b)
 static void file_change_d(const char *p, struct branch *b)
 {
 	static struct strbuf uq = STRBUF_INIT;
-	const char *endp;
 
 	strbuf_reset(&uq);
-	if (!unquote_c_style(&uq, p, &endp)) {
-		if (*endp)
-			die("Garbage after path in: %s", command_buf.buf);
-		p = uq.buf;
-	}
+	parse_path_eol(&uq, p, "path");
+	p = uq.buf;
 	tree_content_remove(&b->branch_tree, p, NULL, 1);
 }
 
-static void file_change_cr(const char *s, struct branch *b, int rename)
+static void file_change_cr(const char *p, struct branch *b, int rename)
 {
-	const char *d;
+	const char *s, *d;
 	static struct strbuf s_uq = STRBUF_INIT;
 	static struct strbuf d_uq = STRBUF_INIT;
-	const char *endp;
 	struct tree_entry leaf;
 
 	strbuf_reset(&s_uq);
-	if (!unquote_c_style(&s_uq, s, &endp)) {
-		if (*endp != ' ')
-			die("Missing space after source: %s", command_buf.buf);
-	} else {
-		endp = strchr(s, ' ');
-		if (!endp)
-			die("Missing space after source: %s", command_buf.buf);
-		strbuf_add(&s_uq, s, endp - s);
-	}
+	parse_path_space(&s_uq, p, &p, "source");
 	s = s_uq.buf;
 
-	endp++;
-	if (!*endp)
+	if (!*p)
 		die("Missing dest: %s", command_buf.buf);
-
-	d = endp;
 	strbuf_reset(&d_uq);
-	if (!unquote_c_style(&d_uq, d, &endp)) {
-		if (*endp)
-			die("Garbage after dest in: %s", command_buf.buf);
-		d = d_uq.buf;
-	}
+	parse_path_eol(&d_uq, p, "dest");
+	d = d_uq.buf;
 
 	memset(&leaf, 0, sizeof(leaf));
 	if (rename)
@@ -3152,6 +3176,7 @@ static void print_ls(int mode, const unsigned char *hash, const char *path)
 
 static void parse_ls(const char *p, struct branch *b)
 {
+	static struct strbuf uq = STRBUF_INIT;
 	struct tree_entry *root = NULL;
 	struct tree_entry leaf = {NULL};
 
@@ -3168,16 +3193,9 @@ static void parse_ls(const char *p, struct branch *b)
 			root->versions[1].mode = S_IFDIR;
 		load_tree(root);
 	}
-	if (*p == '"') {
-		static struct strbuf uq = STRBUF_INIT;
-		const char *endp;
-		strbuf_reset(&uq);
-		if (unquote_c_style(&uq, p, &endp))
-			die("Invalid path: %s", command_buf.buf);
-		if (*endp)
-			die("Garbage after path in: %s", command_buf.buf);
-		p = uq.buf;
-	}
+	strbuf_reset(&uq);
+	parse_path_eol(&uq, p, "path");
+	p = uq.buf;
 	tree_content_get(root, p, &leaf, 1);
 	/*
 	 * A directory in preparation would have a sha1 of zero
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 60e30fed3c..de2f1304e8 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -2142,6 +2142,7 @@ test_expect_success 'Q: deny note on empty branch' '
 	EOF
 	test_must_fail git fast-import <input
 '
+
 ###
 ### series R (feature and option)
 ###
@@ -2790,7 +2791,7 @@ test_expect_success 'R: blob appears only once' '
 '
 
 ###
-### series S
+### series S (mark and path parsing)
 ###
 #
 # Make sure missing spaces and EOLs after mark references
@@ -3060,6 +3061,261 @@ test_expect_success 'S: ls with garbage after sha1 must fail' '
 	test_grep "space after tree-ish" err
 '
 
+#
+# Path parsing
+#
+# There are two sorts of ways a path can be parsed, depending on whether it is
+# the last field on the line. Additionally, ls without a <dataref> has a special
+# case. Test every occurrence of <path> in the grammar against every error case.
+#
+
+#
+# Valid paths at the end of a line: filemodify, filedelete, filecopy (dest),
+# filerename (dest), and ls.
+#
+# commit :301 from root -- modify hello.c (for setup)
+# commit :302 from :301 -- modify $path
+# commit :303 from :302 -- delete $path
+# commit :304 from :301 -- copy hello.c $path
+# commit :305 from :301 -- rename hello.c $path
+# ls :305 $path
+#
+test_path_eol_success () {
+	local test="$1" path="$2" unquoted_path="$3"
+	test_expect_success "S: paths at EOL with $test must work" '
+		test_when_finished "git branch -D S-path-eol" &&
+
+		git fast-import --export-marks=marks.out <<-EOF >out 2>err &&
+		blob
+		mark :401
+		data <<BLOB
+		hello world
+		BLOB
+
+		blob
+		mark :402
+		data <<BLOB
+		hallo welt
+		BLOB
+
+		commit refs/heads/S-path-eol
+		mark :301
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		initial commit
+		COMMIT
+		M 100644 :401 hello.c
+
+		commit refs/heads/S-path-eol
+		mark :302
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filemodify
+		COMMIT
+		from :301
+		M 100644 :402 $path
+
+		commit refs/heads/S-path-eol
+		mark :303
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filedelete
+		COMMIT
+		from :302
+		D $path
+
+		commit refs/heads/S-path-eol
+		mark :304
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filecopy dest
+		COMMIT
+		from :301
+		C hello.c $path
+
+		commit refs/heads/S-path-eol
+		mark :305
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filerename dest
+		COMMIT
+		from :301
+		R hello.c $path
+
+		ls :305 $path
+		EOF
+
+		commit_m=$(grep :302 marks.out | cut -d\  -f2) &&
+		commit_d=$(grep :303 marks.out | cut -d\  -f2) &&
+		commit_c=$(grep :304 marks.out | cut -d\  -f2) &&
+		commit_r=$(grep :305 marks.out | cut -d\  -f2) &&
+		blob1=$(grep :401 marks.out | cut -d\  -f2) &&
+		blob2=$(grep :402 marks.out | cut -d\  -f2) &&
+
+		(
+			printf "100644 blob $blob2\t$unquoted_path\n" &&
+			printf "100644 blob $blob1\thello.c\n"
+		) | sort >tree_m.exp &&
+		git ls-tree $commit_m | sort >tree_m.out &&
+		test_cmp tree_m.exp tree_m.out &&
+
+		printf "100644 blob $blob1\thello.c\n" >tree_d.exp &&
+		git ls-tree $commit_d >tree_d.out &&
+		test_cmp tree_d.exp tree_d.out &&
+
+		(
+			printf "100644 blob $blob1\t$unquoted_path\n" &&
+			printf "100644 blob $blob1\thello.c\n"
+		) | sort >tree_c.exp &&
+		git ls-tree $commit_c | sort >tree_c.out &&
+		test_cmp tree_c.exp tree_c.out &&
+
+		printf "100644 blob $blob1\t$unquoted_path\n" >tree_r.exp &&
+		git ls-tree $commit_r >tree_r.out &&
+		test_cmp tree_r.exp tree_r.out &&
+
+		test_cmp out tree_r.exp
+	'
+}
+
+test_path_eol_success 'quoted spaces'   '" hello world.c "' ' hello world.c '
+test_path_eol_success 'unquoted spaces' ' hello world.c '   ' hello world.c '
+
+#
+# Valid paths before a space: filecopy (source) and filerename (source).
+#
+# commit :301 from root -- modify $path (for setup)
+# commit :302 from :301 -- copy $path hello2.c
+# commit :303 from :301 -- rename $path hello2.c
+#
+test_path_space_success () {
+	local test="$1" path="$2" unquoted_path="$3"
+	test_expect_success "S: paths before space with $test must work" '
+		test_when_finished "git branch -D S-path-space" &&
+
+		git fast-import --export-marks=marks.out <<-EOF 2>err &&
+		blob
+		mark :401
+		data <<BLOB
+		hello world
+		BLOB
+
+		commit refs/heads/S-path-space
+		mark :301
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		initial commit
+		COMMIT
+		M 100644 :401 $path
+
+		commit refs/heads/S-path-space
+		mark :302
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filecopy source
+		COMMIT
+		from :301
+		C $path hello2.c
+
+		commit refs/heads/S-path-space
+		mark :303
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filerename source
+		COMMIT
+		from :301
+		R $path hello2.c
+
+		EOF
+
+		commit_c=$(grep :302 marks.out | cut -d\  -f2) &&
+		commit_r=$(grep :303 marks.out | cut -d\  -f2) &&
+		blob=$(grep :401 marks.out | cut -d\  -f2) &&
+
+		(
+			printf "100644 blob $blob\t$unquoted_path\n" &&
+			printf "100644 blob $blob\thello2.c\n"
+		) | sort >tree_c.exp &&
+		git ls-tree $commit_c | sort >tree_c.out &&
+		test_cmp tree_c.exp tree_c.out &&
+
+		printf "100644 blob $blob\thello2.c\n" >tree_r.exp &&
+		git ls-tree $commit_r >tree_r.out &&
+		test_cmp tree_r.exp tree_r.out
+	'
+}
+
+test_path_space_success 'quoted spaces'      '" hello world.c "' ' hello world.c '
+test_path_space_success 'no unquoted spaces' 'hello_world.c'     'hello_world.c'
+
+#
+# Test a single commit change with an invalid path. Run it with all occurrences
+# of <path> in the grammar against all error kinds.
+#
+test_path_fail () {
+	local change="$1" what="$2" prefix="$3" path="$4" suffix="$5" err_grep="$6"
+	test_expect_success "S: $change with $what must fail" '
+		test_must_fail git fast-import <<-EOF 2>err &&
+		blob
+		mark :1
+		data <<BLOB
+		hello world
+		BLOB
+
+		commit refs/heads/S-path-fail
+		mark :2
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit setup
+		COMMIT
+		M 100644 :1 hello.c
+
+		commit refs/heads/S-path-fail
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit with bad path
+		COMMIT
+		from :2
+		$prefix$path$suffix
+		EOF
+
+		test_grep "$err_grep" err
+	'
+}
+
+test_path_base_fail () {
+	local change="$1" prefix="$2" field="$3" suffix="$4"
+	test_path_fail "$change" 'unclosed " in '"$field"          "$prefix" '"hello.c'    "$suffix" "Invalid $field"
+	test_path_fail "$change" "invalid escape in quoted $field" "$prefix" '"hello\xff"' "$suffix" "Invalid $field"
+}
+test_path_eol_quoted_fail () {
+	local change="$1" prefix="$2" field="$3"
+	test_path_base_fail "$change" "$prefix" "$field" ''
+	test_path_fail "$change" "garbage after quoted $field" "$prefix" '"hello.c"' 'x' "Garbage after $field"
+	test_path_fail "$change" "space after quoted $field"   "$prefix" '"hello.c"' ' ' "Garbage after $field"
+}
+test_path_eol_fail () {
+	local change="$1" prefix="$2" field="$3"
+	test_path_eol_quoted_fail "$change" "$prefix" "$field"
+}
+test_path_space_fail () {
+	local change="$1" prefix="$2" field="$3"
+	test_path_base_fail "$change" "$prefix" "$field" ' world.c'
+	test_path_fail "$change" "missing space after quoted $field"   "$prefix" '"hello.c"' 'x world.c' "Missing space after $field"
+	test_path_fail "$change" "missing space after unquoted $field" "$prefix" 'hello.c'   ''          "Missing space after $field"
+}
+
+test_path_eol_fail   filemodify       'M 100644 :1 ' path
+test_path_eol_fail   filedelete       'D '           path
+test_path_space_fail filecopy         'C '           source
+test_path_eol_fail   filecopy         'C hello.c '   dest
+test_path_space_fail filerename       'R '           source
+test_path_eol_fail   filerename       'R hello.c '   dest
+test_path_eol_fail   'ls (in commit)' 'ls :2 '       path
+
+# When 'ls' has no <dataref>, the <path> must be quoted.
+test_path_eol_quoted_fail 'ls (without dataref in commit)' 'ls ' path
+
 ###
 ### series T (ls)
 ###
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 2/8] fast-import: directly use strbufs for paths
  2024-04-12  8:01     ` [PATCH v4 0/8] fast-import: tighten parsing of paths Thalia Archibald
  2024-04-12  8:02       ` [PATCH v4 1/8] fast-import: tighten path unquoting Thalia Archibald
@ 2024-04-12  8:03       ` Thalia Archibald
  2024-04-12  8:03       ` [PATCH v4 3/8] fast-import: allow unquoted empty path for root Thalia Archibald
                         ` (6 subsequent siblings)
  8 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-12  8:03 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Chris Torek, Elijah Newren,
	Thalia Archibald

Previously, one case would not write the path to the strbuf: when the
path is unquoted and at the end of the string. It was essentially
copy-on-write. However, with the logic simplification of the previous
commit, this case was eliminated and the strbuf is always populated.

Directly use the strbufs now instead of an alias.

Since this already changes all the lines that use the strbufs, rename
them from `uq` to be more descriptive. That they are unquoted is not
their most important property, so name them after what they carry.

Additionally, `file_change_m` no longer needs to copy the path before
reading inline data.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c | 64 ++++++++++++++++++-------------------------
 1 file changed, 27 insertions(+), 37 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index ce9231afe6..8f6312fbaf 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2307,7 +2307,7 @@ static void parse_path_space(struct strbuf *sb, const char *p,
 
 static void file_change_m(const char *p, struct branch *b)
 {
-	static struct strbuf uq = STRBUF_INIT;
+	static struct strbuf path = STRBUF_INIT;
 	struct object_entry *oe;
 	struct object_id oid;
 	uint16_t mode, inline_data = 0;
@@ -2344,13 +2344,12 @@ static void file_change_m(const char *p, struct branch *b)
 			die("Missing space after SHA1: %s", command_buf.buf);
 	}
 
-	strbuf_reset(&uq);
-	parse_path_eol(&uq, p, "path");
-	p = uq.buf;
+	strbuf_reset(&path);
+	parse_path_eol(&path, p, "path");
 
 	/* Git does not track empty, non-toplevel directories. */
-	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *p) {
-		tree_content_remove(&b->branch_tree, p, NULL, 0);
+	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *path.buf) {
+		tree_content_remove(&b->branch_tree, path.buf, NULL, 0);
 		return;
 	}
 
@@ -2371,10 +2370,6 @@ static void file_change_m(const char *p, struct branch *b)
 		if (S_ISDIR(mode))
 			die("Directories cannot be specified 'inline': %s",
 				command_buf.buf);
-		if (p != uq.buf) {
-			strbuf_addstr(&uq, p);
-			p = uq.buf;
-		}
 		while (read_next_command() != EOF) {
 			const char *v;
 			if (skip_prefix(command_buf.buf, "cat-blob ", &v))
@@ -2400,55 +2395,51 @@ static void file_change_m(const char *p, struct branch *b)
 				command_buf.buf);
 	}
 
-	if (!*p) {
+	if (!*path.buf) {
 		tree_content_replace(&b->branch_tree, &oid, mode, NULL);
 		return;
 	}
-	tree_content_set(&b->branch_tree, p, &oid, mode, NULL);
+	tree_content_set(&b->branch_tree, path.buf, &oid, mode, NULL);
 }
 
 static void file_change_d(const char *p, struct branch *b)
 {
-	static struct strbuf uq = STRBUF_INIT;
+	static struct strbuf path = STRBUF_INIT;
 
-	strbuf_reset(&uq);
-	parse_path_eol(&uq, p, "path");
-	p = uq.buf;
-	tree_content_remove(&b->branch_tree, p, NULL, 1);
+	strbuf_reset(&path);
+	parse_path_eol(&path, p, "path");
+	tree_content_remove(&b->branch_tree, path.buf, NULL, 1);
 }
 
 static void file_change_cr(const char *p, struct branch *b, int rename)
 {
-	const char *s, *d;
-	static struct strbuf s_uq = STRBUF_INIT;
-	static struct strbuf d_uq = STRBUF_INIT;
+	static struct strbuf source = STRBUF_INIT;
+	static struct strbuf dest = STRBUF_INIT;
 	struct tree_entry leaf;
 
-	strbuf_reset(&s_uq);
-	parse_path_space(&s_uq, p, &p, "source");
-	s = s_uq.buf;
+	strbuf_reset(&source);
+	parse_path_space(&source, p, &p, "source");
 
 	if (!*p)
 		die("Missing dest: %s", command_buf.buf);
-	strbuf_reset(&d_uq);
-	parse_path_eol(&d_uq, p, "dest");
-	d = d_uq.buf;
+	strbuf_reset(&dest);
+	parse_path_eol(&dest, p, "dest");
 
 	memset(&leaf, 0, sizeof(leaf));
 	if (rename)
-		tree_content_remove(&b->branch_tree, s, &leaf, 1);
+		tree_content_remove(&b->branch_tree, source.buf, &leaf, 1);
 	else
-		tree_content_get(&b->branch_tree, s, &leaf, 1);
+		tree_content_get(&b->branch_tree, source.buf, &leaf, 1);
 	if (!leaf.versions[1].mode)
-		die("Path %s not in branch", s);
-	if (!*d) {	/* C "path/to/subdir" "" */
+		die("Path %s not in branch", source.buf);
+	if (!*dest.buf) {	/* C "path/to/subdir" "" */
 		tree_content_replace(&b->branch_tree,
 			&leaf.versions[1].oid,
 			leaf.versions[1].mode,
 			leaf.tree);
 		return;
 	}
-	tree_content_set(&b->branch_tree, d,
+	tree_content_set(&b->branch_tree, dest.buf,
 		&leaf.versions[1].oid,
 		leaf.versions[1].mode,
 		leaf.tree);
@@ -3176,7 +3167,7 @@ static void print_ls(int mode, const unsigned char *hash, const char *path)
 
 static void parse_ls(const char *p, struct branch *b)
 {
-	static struct strbuf uq = STRBUF_INIT;
+	static struct strbuf path = STRBUF_INIT;
 	struct tree_entry *root = NULL;
 	struct tree_entry leaf = {NULL};
 
@@ -3193,10 +3184,9 @@ static void parse_ls(const char *p, struct branch *b)
 			root->versions[1].mode = S_IFDIR;
 		load_tree(root);
 	}
-	strbuf_reset(&uq);
-	parse_path_eol(&uq, p, "path");
-	p = uq.buf;
-	tree_content_get(root, p, &leaf, 1);
+	strbuf_reset(&path);
+	parse_path_eol(&path, p, "path");
+	tree_content_get(root, path.buf, &leaf, 1);
 	/*
 	 * A directory in preparation would have a sha1 of zero
 	 * until it is saved.  Save, for simplicity.
@@ -3204,7 +3194,7 @@ static void parse_ls(const char *p, struct branch *b)
 	if (S_ISDIR(leaf.versions[1].mode))
 		store_tree(&leaf);
 
-	print_ls(leaf.versions[1].mode, leaf.versions[1].oid.hash, p);
+	print_ls(leaf.versions[1].mode, leaf.versions[1].oid.hash, path.buf);
 	if (leaf.tree)
 		release_tree_content_recursive(leaf.tree);
 	if (!b || root != &b->branch_tree)
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 3/8] fast-import: allow unquoted empty path for root
  2024-04-12  8:01     ` [PATCH v4 0/8] fast-import: tighten parsing of paths Thalia Archibald
  2024-04-12  8:02       ` [PATCH v4 1/8] fast-import: tighten path unquoting Thalia Archibald
  2024-04-12  8:03       ` [PATCH v4 2/8] fast-import: directly use strbufs for paths Thalia Archibald
@ 2024-04-12  8:03       ` Thalia Archibald
  2024-04-12  8:03       ` [PATCH v4 4/8] fast-import: remove dead strbuf Thalia Archibald
                         ` (5 subsequent siblings)
  8 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-12  8:03 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Chris Torek, Elijah Newren,
	Thalia Archibald

Ever since filerename was added in f39a946a1f (Support wholesale
directory renames in fast-import, 2007-07-09) and filecopy in b6f3481bb4
(Teach fast-import to recursively copy files/directories, 2007-07-15),
both have produced an error when the destination path is empty. Later,
when support for targeting the root directory with an empty string was
added in 2794ad5244 (fast-import: Allow filemodify to set the root,
2010-10-10), this had the effect of allowing the quoted empty string
(`""`), but forbidding its unquoted variant (``). This seems to have
been intended as simple data validation for parsing two paths, rather
than a syntax restriction, because it was not extended to the other
operations.

All other occurrences of paths (in filemodify, filedelete, the source of
filecopy and filerename, and ls) allow both.

For most of this feature's lifetime, the documentation has not
prescribed the use of quoted empty strings. In e5959106d6
(Documentation/fast-import: put explanation of M 040000 <dataref> "" in
context, 2011-01-15), its documentation was changed from “`<path>` may
also be an empty string (`""`) to specify the root of the tree” to “The
root of the tree can be represented by an empty string as `<path>`”.

Thus, we should assume that some front-ends have depended on this
behavior.

Remove this restriction for the destination paths of filecopy and
filerename and change tests targeting the root to test `""` and ``.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c  |   3 -
 t/t9300-fast-import.sh | 363 +++++++++++++++++++++--------------------
 2 files changed, 190 insertions(+), 176 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 8f6312fbaf..0da7e8a5a5 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2419,9 +2419,6 @@ static void file_change_cr(const char *p, struct branch *b, int rename)
 
 	strbuf_reset(&source);
 	parse_path_space(&source, p, &p, "source");
-
-	if (!*p)
-		die("Missing dest: %s", command_buf.buf);
 	strbuf_reset(&dest);
 	parse_path_eol(&dest, p, "dest");
 
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index de2f1304e8..13f98e6688 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -1059,30 +1059,33 @@ test_expect_success 'M: rename subdirectory to new subdirectory' '
 	compare_diff_raw expect actual
 '
 
-test_expect_success 'M: rename root to subdirectory' '
-	cat >input <<-INPUT_END &&
-	commit refs/heads/M4
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	rename root
-	COMMIT
+for root in '""' ''
+do
+	test_expect_success "M: rename root ($root) to subdirectory" '
+		cat >input <<-INPUT_END &&
+		commit refs/heads/M4
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		rename root
+		COMMIT
 
-	from refs/heads/M2^0
-	R "" sub
+		from refs/heads/M2^0
+		R $root sub
 
-	INPUT_END
+		INPUT_END
 
-	cat >expect <<-EOF &&
-	:100644 100644 $oldf $oldf R100	file2/oldf	sub/file2/oldf
-	:100755 100755 $f4id $f4id R100	file4	sub/file4
-	:100755 100755 $newf $newf R100	i/am/new/to/you	sub/i/am/new/to/you
-	:100755 100755 $f6id $f6id R100	newdir/exec.sh	sub/newdir/exec.sh
-	:100644 100644 $f5id $f5id R100	newdir/interesting	sub/newdir/interesting
-	EOF
-	git fast-import <input &&
-	git diff-tree -M -r M4^ M4 >actual &&
-	compare_diff_raw expect actual
-'
+		cat >expect <<-EOF &&
+		:100644 100644 $oldf $oldf R100	file2/oldf	sub/file2/oldf
+		:100755 100755 $f4id $f4id R100	file4	sub/file4
+		:100755 100755 $newf $newf R100	i/am/new/to/you	sub/i/am/new/to/you
+		:100755 100755 $f6id $f6id R100	newdir/exec.sh	sub/newdir/exec.sh
+		:100644 100644 $f5id $f5id R100	newdir/interesting	sub/newdir/interesting
+		EOF
+		git fast-import <input &&
+		git diff-tree -M -r M4^ M4 >actual &&
+		compare_diff_raw expect actual
+	'
+done
 
 ###
 ### series N
@@ -1259,49 +1262,52 @@ test_expect_success PIPE 'N: empty directory reads as missing' '
 	test_cmp expect actual
 '
 
-test_expect_success 'N: copy root directory by tree hash' '
-	cat >expect <<-EOF &&
-	:100755 000000 $newf $zero D	file3/newf
-	:100644 000000 $oldf $zero D	file3/oldf
-	EOF
-	root=$(git rev-parse refs/heads/branch^0^{tree}) &&
-	cat >input <<-INPUT_END &&
-	commit refs/heads/N6
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	copy root directory by tree hash
-	COMMIT
+for root in '""' ''
+do
+	test_expect_success "N: copy root ($root) by tree hash" '
+		cat >expect <<-EOF &&
+		:100755 000000 $newf $zero D	file3/newf
+		:100644 000000 $oldf $zero D	file3/oldf
+		EOF
+		root_tree=$(git rev-parse refs/heads/branch^0^{tree}) &&
+		cat >input <<-INPUT_END &&
+		commit refs/heads/N6
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		copy root directory by tree hash
+		COMMIT
 
-	from refs/heads/branch^0
-	M 040000 $root ""
-	INPUT_END
-	git fast-import <input &&
-	git diff-tree -C --find-copies-harder -r N4 N6 >actual &&
-	compare_diff_raw expect actual
-'
+		from refs/heads/branch^0
+		M 040000 $root_tree $root
+		INPUT_END
+		git fast-import <input &&
+		git diff-tree -C --find-copies-harder -r N4 N6 >actual &&
+		compare_diff_raw expect actual
+	'
 
-test_expect_success 'N: copy root by path' '
-	cat >expect <<-EOF &&
-	:100755 100755 $newf $newf C100	file2/newf	oldroot/file2/newf
-	:100644 100644 $oldf $oldf C100	file2/oldf	oldroot/file2/oldf
-	:100755 100755 $f4id $f4id C100	file4	oldroot/file4
-	:100755 100755 $f6id $f6id C100	newdir/exec.sh	oldroot/newdir/exec.sh
-	:100644 100644 $f5id $f5id C100	newdir/interesting	oldroot/newdir/interesting
-	EOF
-	cat >input <<-INPUT_END &&
-	commit refs/heads/N-copy-root-path
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	copy root directory by (empty) path
-	COMMIT
+	test_expect_success "N: copy root ($root) by path" '
+		cat >expect <<-EOF &&
+		:100755 100755 $newf $newf C100	file2/newf	oldroot/file2/newf
+		:100644 100644 $oldf $oldf C100	file2/oldf	oldroot/file2/oldf
+		:100755 100755 $f4id $f4id C100	file4	oldroot/file4
+		:100755 100755 $f6id $f6id C100	newdir/exec.sh	oldroot/newdir/exec.sh
+		:100644 100644 $f5id $f5id C100	newdir/interesting	oldroot/newdir/interesting
+		EOF
+		cat >input <<-INPUT_END &&
+		commit refs/heads/N-copy-root-path
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		copy root directory by (empty) path
+		COMMIT
 
-	from refs/heads/branch^0
-	C "" oldroot
-	INPUT_END
-	git fast-import <input &&
-	git diff-tree -C --find-copies-harder -r branch N-copy-root-path >actual &&
-	compare_diff_raw expect actual
-'
+		from refs/heads/branch^0
+		C $root oldroot
+		INPUT_END
+		git fast-import <input &&
+		git diff-tree -C --find-copies-harder -r branch N-copy-root-path >actual &&
+		compare_diff_raw expect actual
+	'
+done
 
 test_expect_success 'N: delete directory by copying' '
 	cat >expect <<-\EOF &&
@@ -1431,98 +1437,102 @@ test_expect_success 'N: reject foo/ syntax in ls argument' '
 	INPUT_END
 '
 
-test_expect_success 'N: copy to root by id and modify' '
-	echo "hello, world" >expect.foo &&
-	echo hello >expect.bar &&
-	git fast-import <<-SETUP_END &&
-	commit refs/heads/N7
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	hello, tree
-	COMMIT
+for root in '""' ''
+do
+	test_expect_success "N: copy to root ($root) by id and modify" '
+		echo "hello, world" >expect.foo &&
+		echo hello >expect.bar &&
+		git fast-import <<-SETUP_END &&
+		commit refs/heads/N7
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		hello, tree
+		COMMIT
 
-	deleteall
-	M 644 inline foo/bar
-	data <<EOF
-	hello
-	EOF
-	SETUP_END
+		deleteall
+		M 644 inline foo/bar
+		data <<EOF
+		hello
+		EOF
+		SETUP_END
 
-	tree=$(git rev-parse --verify N7:) &&
-	git fast-import <<-INPUT_END &&
-	commit refs/heads/N8
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	copy to root by id and modify
-	COMMIT
+		tree=$(git rev-parse --verify N7:) &&
+		git fast-import <<-INPUT_END &&
+		commit refs/heads/N8
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		copy to root by id and modify
+		COMMIT
 
-	M 040000 $tree ""
-	M 644 inline foo/foo
-	data <<EOF
-	hello, world
-	EOF
-	INPUT_END
-	git show N8:foo/foo >actual.foo &&
-	git show N8:foo/bar >actual.bar &&
-	test_cmp expect.foo actual.foo &&
-	test_cmp expect.bar actual.bar
-'
+		M 040000 $tree $root
+		M 644 inline foo/foo
+		data <<EOF
+		hello, world
+		EOF
+		INPUT_END
+		git show N8:foo/foo >actual.foo &&
+		git show N8:foo/bar >actual.bar &&
+		test_cmp expect.foo actual.foo &&
+		test_cmp expect.bar actual.bar
+	'
 
-test_expect_success 'N: extract subtree' '
-	branch=$(git rev-parse --verify refs/heads/branch^{tree}) &&
-	cat >input <<-INPUT_END &&
-	commit refs/heads/N9
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	extract subtree branch:newdir
-	COMMIT
+	test_expect_success "N: extract subtree to the root ($root)" '
+		branch=$(git rev-parse --verify refs/heads/branch^{tree}) &&
+		cat >input <<-INPUT_END &&
+		commit refs/heads/N9
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		extract subtree branch:newdir
+		COMMIT
 
-	M 040000 $branch ""
-	C "newdir" ""
-	INPUT_END
-	git fast-import <input &&
-	git diff --exit-code branch:newdir N9
-'
+		M 040000 $branch $root
+		C "newdir" $root
+		INPUT_END
+		git fast-import <input &&
+		git diff --exit-code branch:newdir N9
+	'
 
-test_expect_success 'N: modify subtree, extract it, and modify again' '
-	echo hello >expect.baz &&
-	echo hello, world >expect.qux &&
-	git fast-import <<-SETUP_END &&
-	commit refs/heads/N10
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	hello, tree
-	COMMIT
+	test_expect_success "N: modify subtree, extract it to the root ($root), and modify again" '
+		echo hello >expect.baz &&
+		echo hello, world >expect.qux &&
+		git fast-import <<-SETUP_END &&
+		commit refs/heads/N10
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		hello, tree
+		COMMIT
 
-	deleteall
-	M 644 inline foo/bar/baz
-	data <<EOF
-	hello
-	EOF
-	SETUP_END
+		deleteall
+		M 644 inline foo/bar/baz
+		data <<EOF
+		hello
+		EOF
+		SETUP_END
 
-	tree=$(git rev-parse --verify N10:) &&
-	git fast-import <<-INPUT_END &&
-	commit refs/heads/N11
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	copy to root by id and modify
-	COMMIT
+		tree=$(git rev-parse --verify N10:) &&
+		git fast-import <<-INPUT_END &&
+		commit refs/heads/N11
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		copy to root by id and modify
+		COMMIT
 
-	M 040000 $tree ""
-	M 100644 inline foo/bar/qux
-	data <<EOF
-	hello, world
-	EOF
-	R "foo" ""
-	C "bar/qux" "bar/quux"
-	INPUT_END
-	git show N11:bar/baz >actual.baz &&
-	git show N11:bar/qux >actual.qux &&
-	git show N11:bar/quux >actual.quux &&
-	test_cmp expect.baz actual.baz &&
-	test_cmp expect.qux actual.qux &&
-	test_cmp expect.qux actual.quux'
+		M 040000 $tree $root
+		M 100644 inline foo/bar/qux
+		data <<EOF
+		hello, world
+		EOF
+		R "foo" $root
+		C "bar/qux" "bar/quux"
+		INPUT_END
+		git show N11:bar/baz >actual.baz &&
+		git show N11:bar/qux >actual.qux &&
+		git show N11:bar/quux >actual.quux &&
+		test_cmp expect.baz actual.baz &&
+		test_cmp expect.qux actual.qux &&
+		test_cmp expect.qux actual.quux
+	'
+done
 
 ###
 ### series O
@@ -3067,6 +3077,7 @@ test_expect_success 'S: ls with garbage after sha1 must fail' '
 # There are two sorts of ways a path can be parsed, depending on whether it is
 # the last field on the line. Additionally, ls without a <dataref> has a special
 # case. Test every occurrence of <path> in the grammar against every error case.
+# Paths for the root (empty strings) are tested elsewhere.
 #
 
 #
@@ -3321,16 +3332,19 @@ test_path_eol_quoted_fail 'ls (without dataref in commit)' 'ls ' path
 ###
 # Setup is carried over from series S.
 
-test_expect_success 'T: ls root tree' '
-	sed -e "s/Z\$//" >expect <<-EOF &&
-	040000 tree $(git rev-parse S^{tree})	Z
-	EOF
-	sha1=$(git rev-parse --verify S) &&
-	git fast-import --import-marks=marks <<-EOF >actual &&
-	ls $sha1 ""
-	EOF
-	test_cmp expect actual
-'
+for root in '""' ''
+do
+	test_expect_success "T: ls root ($root) tree" '
+		sed -e "s/Z\$//" >expect <<-EOF &&
+		040000 tree $(git rev-parse S^{tree})	Z
+		EOF
+		sha1=$(git rev-parse --verify S) &&
+		git fast-import --import-marks=marks <<-EOF >actual &&
+		ls $sha1 $root
+		EOF
+		test_cmp expect actual
+	'
+done
 
 test_expect_success 'T: delete branch' '
 	git branch to-delete &&
@@ -3432,30 +3446,33 @@ test_expect_success 'U: validate directory delete result' '
 	compare_diff_raw expect actual
 '
 
-test_expect_success 'U: filedelete root succeeds' '
-	cat >input <<-INPUT_END &&
-	commit refs/heads/U
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	must succeed
-	COMMIT
-	from refs/heads/U^0
-	D ""
+for root in '""' ''
+do
+	test_expect_success "U: filedelete root ($root) succeeds" '
+		cat >input <<-INPUT_END &&
+		commit refs/heads/U-delete-root
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		must succeed
+		COMMIT
+		from refs/heads/U^0
+		D $root
 
-	INPUT_END
+		INPUT_END
 
-	git fast-import <input
-'
+		git fast-import <input
+	'
 
-test_expect_success 'U: validate root delete result' '
-	cat >expect <<-EOF &&
-	:100644 000000 $f7id $ZERO_OID D	hello.c
-	EOF
+	test_expect_success "U: validate root ($root) delete result" '
+		cat >expect <<-EOF &&
+		:100644 000000 $f7id $ZERO_OID D	hello.c
+		EOF
 
-	git diff-tree -M -r U^1 U >actual &&
+		git diff-tree -M -r U U-delete-root >actual &&
 
-	compare_diff_raw expect actual
-'
+		compare_diff_raw expect actual
+	'
+done
 
 ###
 ### series V (checkpoint)
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 4/8] fast-import: remove dead strbuf
  2024-04-12  8:01     ` [PATCH v4 0/8] fast-import: tighten parsing of paths Thalia Archibald
                         ` (2 preceding siblings ...)
  2024-04-12  8:03       ` [PATCH v4 3/8] fast-import: allow unquoted empty path for root Thalia Archibald
@ 2024-04-12  8:03       ` Thalia Archibald
  2024-04-12  8:03       ` [PATCH v4 5/8] fast-import: improve documentation for path quoting Thalia Archibald
                         ` (4 subsequent siblings)
  8 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-12  8:03 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Chris Torek, Elijah Newren,
	Thalia Archibald

The strbuf in `note_change_n` is to copy the remainder of `p` before
potentially invalidating it when reading the next line. However, `p` is
not used after that point. It has been unused since the function was
created in a8dd2e7d2b (fast-import: Add support for importing commit
notes, 2009-10-09) and looks to be a fossil from adapting
`file_change_m`. Remove it.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 0da7e8a5a5..7a398dc975 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2444,7 +2444,6 @@ static void file_change_cr(const char *p, struct branch *b, int rename)
 
 static void note_change_n(const char *p, struct branch *b, unsigned char *old_fanout)
 {
-	static struct strbuf uq = STRBUF_INIT;
 	struct object_entry *oe;
 	struct branch *s;
 	struct object_id oid, commit_oid;
@@ -2509,10 +2508,6 @@ static void note_change_n(const char *p, struct branch *b, unsigned char *old_fa
 		die("Invalid ref name or SHA1 expression: %s", p);
 
 	if (inline_data) {
-		if (p != uq.buf) {
-			strbuf_addstr(&uq, p);
-			p = uq.buf;
-		}
 		read_next_command();
 		parse_and_store_blob(&last_blob, &oid, 0);
 	} else if (oe) {
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 5/8] fast-import: improve documentation for path quoting
  2024-04-12  8:01     ` [PATCH v4 0/8] fast-import: tighten parsing of paths Thalia Archibald
                         ` (3 preceding siblings ...)
  2024-04-12  8:03       ` [PATCH v4 4/8] fast-import: remove dead strbuf Thalia Archibald
@ 2024-04-12  8:03       ` Thalia Archibald
  2024-04-12  8:03       ` [PATCH v4 6/8] fast-import: document C-style escapes for paths Thalia Archibald
                         ` (3 subsequent siblings)
  8 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-12  8:03 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Chris Torek, Elijah Newren,
	Thalia Archibald

It describes what characters cannot be in an unquoted path, but not
their semantics. Reframe it as a definition of unquoted paths. From the
perspective of the parser, whether it starts with `"` is what defines
whether it will parse it as quoted or unquoted.

The restrictions on characters in unquoted paths (with starting-", LF,
and spaces) are explained in the quoted paragraph. Move it to the
unquoted paragraph and reword.

The restriction that the source paths of filecopy and filerename cannot
contain SP is only stated in their respective sections. Restate it in
the <path> section.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 Documentation/git-fast-import.txt | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index b2607366b9..1882758b8a 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -630,18 +630,24 @@ in octal.  Git only supports the following modes:
 In both formats `<path>` is the complete path of the file to be added
 (if not already existing) or modified (if already existing).
 
-A `<path>` string must use UNIX-style directory separators (forward
-slash `/`), may contain any byte other than `LF`, and must not
-start with double quote (`"`).
+A `<path>` can be written as unquoted bytes or a C-style quoted string.
 
-A path can use C-style string quoting; this is accepted in all cases
-and mandatory if the filename starts with double quote or contains
-`LF`. In C-style quoting, the complete name should be surrounded with
-double quotes, and any `LF`, backslash, or double quote characters
-must be escaped by preceding them with a backslash (e.g.,
-`"path/with\n, \\ and \" in it"`).
+When a `<path>` does not start with a double quote (`"`), it is an
+unquoted string and is parsed as literal bytes without any escape
+sequences. However, if the filename contains `LF` or starts with double
+quote, it cannot be represented as an unquoted string and must be
+quoted. Additionally, the source `<path>` in `filecopy` or `filerename`
+must be quoted if it contains SP.
 
-The value of `<path>` must be in canonical form. That is it must not:
+When a `<path>` starts with a double quote (`"`), it is a C-style quoted
+string, where the complete filename is enclosed in a pair of double
+quotes and escape sequences are used. Certain characters must be escaped
+by preceding them with a backslash: `LF` is written as `\n`, backslash
+as `\\`, and double quote as `\"`. All filenames can be represented as
+quoted strings.
+
+A `<path>` must use UNIX-style directory separators (forward slash `/`)
+and its value must be in canonical form. That is it must not:
 
 * contain an empty directory component (e.g. `foo//bar` is invalid),
 * end with a directory separator (e.g. `foo/` is invalid),
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 6/8] fast-import: document C-style escapes for paths
  2024-04-12  8:01     ` [PATCH v4 0/8] fast-import: tighten parsing of paths Thalia Archibald
                         ` (4 preceding siblings ...)
  2024-04-12  8:03       ` [PATCH v4 5/8] fast-import: improve documentation for path quoting Thalia Archibald
@ 2024-04-12  8:03       ` Thalia Archibald
  2024-04-12  8:03       ` [PATCH v4 7/8] fast-import: forbid escaped NUL in paths Thalia Archibald
                         ` (2 subsequent siblings)
  8 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-12  8:03 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Chris Torek, Elijah Newren,
	Thalia Archibald

Simply saying “C-style” string quoting is imprecise, as only a subset of
C escapes are supported. Document the exact escapes.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 Documentation/git-fast-import.txt |  6 +++++-
 t/t9300-fast-import.sh            | 10 ++++++----
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index 1882758b8a..c6082c3b49 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -643,7 +643,11 @@ When a `<path>` starts with a double quote (`"`), it is a C-style quoted
 string, where the complete filename is enclosed in a pair of double
 quotes and escape sequences are used. Certain characters must be escaped
 by preceding them with a backslash: `LF` is written as `\n`, backslash
-as `\\`, and double quote as `\"`. All filenames can be represented as
+as `\\`, and double quote as `\"`. Some characters may optionally be
+written with escape sequences: `\a` for bell, `\b` for backspace, `\f`
+for form feed, `\n` for line feed, `\r` for carriage return, `\t` for
+horizontal tab, and `\v` for vertical tab. Any byte can be written with
+3-digit octal codes (e.g., `\033`). All filenames can be represented as
 quoted strings.
 
 A `<path>` must use UNIX-style directory separators (forward slash `/`)
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 13f98e6688..5cde8f8d01 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -3189,8 +3189,9 @@ test_path_eol_success () {
 	'
 }
 
-test_path_eol_success 'quoted spaces'   '" hello world.c "' ' hello world.c '
-test_path_eol_success 'unquoted spaces' ' hello world.c '   ' hello world.c '
+test_path_eol_success 'quoted spaces'   '" hello world.c "'  ' hello world.c '
+test_path_eol_success 'unquoted spaces' ' hello world.c '    ' hello world.c '
+test_path_eol_success 'octal escapes'   '"\150\151\056\143"' 'hi.c'
 
 #
 # Valid paths before a space: filecopy (source) and filerename (source).
@@ -3256,8 +3257,9 @@ test_path_space_success () {
 	'
 }
 
-test_path_space_success 'quoted spaces'      '" hello world.c "' ' hello world.c '
-test_path_space_success 'no unquoted spaces' 'hello_world.c'     'hello_world.c'
+test_path_space_success 'quoted spaces'      '" hello world.c "'  ' hello world.c '
+test_path_space_success 'no unquoted spaces' 'hello_world.c'      'hello_world.c'
+test_path_space_success 'octal escapes'      '"\150\151\056\143"' 'hi.c'
 
 #
 # Test a single commit change with an invalid path. Run it with all occurrences
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 7/8] fast-import: forbid escaped NUL in paths
  2024-04-12  8:01     ` [PATCH v4 0/8] fast-import: tighten parsing of paths Thalia Archibald
                         ` (5 preceding siblings ...)
  2024-04-12  8:03       ` [PATCH v4 6/8] fast-import: document C-style escapes for paths Thalia Archibald
@ 2024-04-12  8:03       ` Thalia Archibald
  2024-04-12  8:03       ` [PATCH v4 8/8] fast-import: make comments more precise Thalia Archibald
  2024-04-14  1:11       ` [PATCH v5 0/8] fast-import: tighten parsing of paths Thalia Archibald
  8 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-12  8:03 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Chris Torek, Elijah Newren,
	Thalia Archibald

NUL cannot appear in paths. Even disregarding filesystem path
limitations, the tree object format delimits with NUL, so such a path
cannot be encoded by Git.

When a quoted path is unquoted, it could possibly contain NUL from
"\000". Forbid it so it isn't truncated.

fast-import still has other issues with NUL, but those will be addressed
later.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 Documentation/git-fast-import.txt | 1 +
 builtin/fast-import.c             | 2 ++
 t/t9300-fast-import.sh            | 1 +
 3 files changed, 4 insertions(+)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index c6082c3b49..8b6dde45f1 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -661,6 +661,7 @@ and its value must be in canonical form. That is it must not:
 
 The root of the tree can be represented by an empty string as `<path>`.
 
+`<path>` cannot contain NUL, either literally or escaped as `\000`.
 It is recommended that `<path>` always be encoded using UTF-8.
 
 `filedelete`
diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 7a398dc975..98096b6fa7 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2269,6 +2269,8 @@ static void parse_path(struct strbuf *sb, const char *p, const char **endp,
 	if (*p == '"') {
 		if (unquote_c_style(sb, p, endp))
 			die("Invalid %s: %s", field, command_buf.buf);
+		if (strlen(sb->buf) != sb->len)
+			die("NUL in %s: %s", field, command_buf.buf);
 	} else {
 		if (include_spaces)
 			*endp = p + strlen(p);
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 5cde8f8d01..1e68426852 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -3300,6 +3300,7 @@ test_path_base_fail () {
 	local change="$1" prefix="$2" field="$3" suffix="$4"
 	test_path_fail "$change" 'unclosed " in '"$field"          "$prefix" '"hello.c'    "$suffix" "Invalid $field"
 	test_path_fail "$change" "invalid escape in quoted $field" "$prefix" '"hello\xff"' "$suffix" "Invalid $field"
+	test_path_fail "$change" "escaped NUL in quoted $field"    "$prefix" '"hello\000"' "$suffix" "NUL in $field"
 }
 test_path_eol_quoted_fail () {
 	local change="$1" prefix="$2" field="$3"
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 8/8] fast-import: make comments more precise
  2024-04-12  8:01     ` [PATCH v4 0/8] fast-import: tighten parsing of paths Thalia Archibald
                         ` (6 preceding siblings ...)
  2024-04-12  8:03       ` [PATCH v4 7/8] fast-import: forbid escaped NUL in paths Thalia Archibald
@ 2024-04-12  8:03       ` Thalia Archibald
  2024-04-14  1:11       ` [PATCH v5 0/8] fast-import: tighten parsing of paths Thalia Archibald
  8 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-12  8:03 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Chris Torek, Elijah Newren,
	Thalia Archibald

The former is somewhat imprecise. The latter became out of sync with the
behavior in e814c39c2f (fast-import: refactor parsing of spaces,
2014-06-18).

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 98096b6fa7..fd23a00150 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2210,7 +2210,7 @@ static int parse_mapped_oid_hex(const char *hex, struct object_id *oid, const ch
  *
  *   idnum ::= ':' bigint;
  *
- * Return the first character after the value in *endptr.
+ * Update *endptr to point to the first character after the value.
  *
  * Complain if the following character is not what is expected,
  * either a space or end of the string.
@@ -2243,8 +2243,8 @@ static uintmax_t parse_mark_ref_eol(const char *p)
 }
 
 /*
- * Parse the mark reference, demanding a trailing space.  Return a
- * pointer to the space.
+ * Parse the mark reference, demanding a trailing space. Update *p to
+ * point to the first character after the space.
  */
 static uintmax_t parse_mark_ref_space(const char **p)
 {
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 1/8] fast-import: tighten path unquoting
  2024-04-12  8:02       ` [PATCH v4 1/8] fast-import: tighten path unquoting Thalia Archibald
@ 2024-04-12 16:34         ` Junio C Hamano
  2024-04-13  0:07           ` Thalia Archibald
  0 siblings, 1 reply; 84+ messages in thread
From: Junio C Hamano @ 2024-04-12 16:34 UTC (permalink / raw)
  To: Thalia Archibald; +Cc: git, Patrick Steinhardt, Chris Torek, Elijah Newren

Thalia Archibald <thalia@archibald.dev> writes:

> diff --git a/builtin/fast-import.c b/builtin/fast-import.c
> index 782bda007c..ce9231afe6 100644
> --- a/builtin/fast-import.c
> +++ b/builtin/fast-import.c
> @@ -2258,10 +2258,56 @@ static uintmax_t parse_mark_ref_space(const char **p)
>  	return mark;
>  }
>  
> +/*
> + * Parse the path string into the strbuf. It may be quoted with escape sequences
> + * or unquoted without escape sequences. When unquoted, it may only contain a
> + * space if `include_spaces` is nonzero.
> + */

It took me three reads to understand the last sentence.  It would
have been easier if it were written as "it may contain a space only
if ...".  I'd also named it "allow_unquoted_spaces"---it is not like
this function includes extra spaces on top of whatever was given.

> +static void parse_path(struct strbuf *sb, const char *p, const char **endp,
> +		int include_spaces, const char *field)
> +{
> +	if (*p == '"') {
> +		if (unquote_c_style(sb, p, endp))
> +			die("Invalid %s: %s", field, command_buf.buf);
> +	} else {
> +		if (include_spaces)
> +			*endp = p + strlen(p);
> +		else
> +			*endp = strchrnul(p, ' ');
> +		strbuf_add(sb, p, *endp - p);
> +	}
> +}

A very straigtht-forward implementation.  Makes sense.

> +/*
> + * Parse the path string into the strbuf, and complain if this is not the end of
> + * the string. It may contain spaces even when unquoted.
> + */
> +static void parse_path_eol(struct strbuf *sb, const char *p, const char *field)
> +{
> +	const char *end;
> +
> +	parse_path(sb, p, &end, 1, field);
> +	if (*end)
> +		die("Garbage after %s: %s", field, command_buf.buf);
> +}

OK.

> +/*
> + * Parse the path string into the strbuf, and ensure it is followed by a space.
> + * It may not contain spaces when unquoted. Update *endp to point to the first
> + * character after the space.
> + */
> +static void parse_path_space(struct strbuf *sb, const char *p,
> +		const char **endp, const char *field)
> +{
> +	parse_path(sb, p, endp, 0, field);
> +	if (**endp != ' ')
> +		die("Missing space after %s: %s", field, command_buf.buf);
> +	(*endp)++;
> +}

OK.

The updated callers that use the above helper functions do read a
lot more easily, while filling the gaps in the original
implementation.  Very nicely done.

Thanks.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 1/8] fast-import: tighten path unquoting
  2024-04-12 16:34         ` Junio C Hamano
@ 2024-04-13  0:07           ` Thalia Archibald
  2024-04-13 18:33             ` Junio C Hamano
  0 siblings, 1 reply; 84+ messages in thread
From: Thalia Archibald @ 2024-04-13  0:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Patrick Steinhardt, Chris Torek, Elijah Newren

On Apr 12, 2024, at 09:34, Junio C Hamano <gitster@pobox.com> wrote:
> Thalia Archibald <thalia@archibald.dev> writes:
>> diff --git a/builtin/fast-import.c b/builtin/fast-import.c
>> index 782bda007c..ce9231afe6 100644
>> --- a/builtin/fast-import.c
>> +++ b/builtin/fast-import.c
>> @@ -2258,10 +2258,56 @@ static uintmax_t parse_mark_ref_space(const char **p)
>> return mark;
>> }
>> 
>> +/*
>> + * Parse the path string into the strbuf. It may be quoted with escape sequences
>> + * or unquoted without escape sequences. When unquoted, it may only contain a
>> + * space if `include_spaces` is nonzero.
>> + */
> 
> It took me three reads to understand the last sentence.  It would
> have been easier if it were written as "it may contain a space only
> if ...".  I'd also named it "allow_unquoted_spaces"---it is not like
> this function includes extra spaces on top of whatever was given.

Patrick commented on this earlier too:

> On Mar 28, 2024, at 01:21, Patrick Steinhardt <ps@pks.im> wrote:
>> 
>> On Fri, Mar 22, 2024 at 12:03:18AM +0000, Thalia Archibald wrote:
>>> +/*
>>> + * Parse the path string into the strbuf. It may be quoted with escape sequences
>>> + * or unquoted without escape sequences. When unquoted, it may only contain a
>>> + * space if `allow_spaces` is nonzero.
>>> + */
>>> +static void parse_path(struct strbuf *sb, const char *p, const char **endp, int allow_spaces, const char *field)
>>> +{
>>> + strbuf_reset(sb);
>>> + if (*p == '"') {
>>> + if (unquote_c_style(sb, p, endp))
>>> + die("Invalid %s: %s", field, command_buf.buf);
>>> + } else {
>>> + if (allow_spaces)
>>> + *endp = p + strlen(p);
>> 
>> I wonder whether `stop_at_unquoted_space` might be more telling. It's
>> not like we disallow spaces here, it's that we treat them as the
>> separator to the next field.
> 
> I agree, but I’d rather something shorter, so I’ve changed it to `include_spaces`.

With all that in mind, I think Patrick is right that the best way to
think of this is that space functions as a field separator, conditional
on this flag. In practice, that leads to restrictions on whether you
can write paths that contain spaces without quotes.

As to naming, `allow_spaces` and `include_spaces` are problematic for
the reasons you both have pointed out. I think `stop_at_unquoted_space`
is problematic, because that’s not where it stops when quoted, but
rather at the close quote. I think that `include_unquoted_spaces` is
good, because it describes that spaces are included in this field when
it is an unquoted string. `allow_unquoted_spaces` implies that its an
error to have a space, but no such error is raised here.

How’s this change? I’ve reworded the relevant sentence and specified any
“it”s and replaced the “when unquoted, …” qualifier with “unquoted
strings may …” to reduce ambiguity.

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index fd23a00150..2070c78c56 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2259,12 +2259,13 @@ static uintmax_t parse_mark_ref_space(const char **p)
}

/*
- * Parse the path string into the strbuf. It may be quoted with escape sequences
- * or unquoted without escape sequences. When unquoted, it may only contain a
- * space if `include_spaces` is nonzero.
+ * Parse the path string into the strbuf. The path can either be quoted with
+ * escape sequences or unquoted without escape sequences. Unquoted strings may
+ * contain spaces only if `include_unquoted_spaces` is nonzero; otherwise, it
+ * stops parsing at the first space.
 */
static void parse_path(struct strbuf *sb, const char *p, const char **endp,
-		int include_spaces, const char *field)
+		int include_unquoted_spaces, const char *field)
{
	if (*p == '"') {
		if (unquote_c_style(sb, p, endp))
@@ -2272,7 +2273,7 @@ static void parse_path(struct strbuf *sb, const char *p, const char **endp,
		if (strlen(sb->buf) != sb->len)
			die("NUL in %s: %s", field, command_buf.buf);
	} else {
-		if (include_spaces)
+		if (include_unquoted_spaces)
			*endp = p + strlen(p);
		else
			*endp = strchrnul(p, ' ');
@@ -2282,7 +2283,7 @@ static void parse_path(struct strbuf *sb, const char *p, const char **endp,

/*
 * Parse the path string into the strbuf, and complain if this is not the end of
- * the string. It may contain spaces even when unquoted.
+ * the string. Unquoted strings may contain spaces.
 */
static void parse_path_eol(struct strbuf *sb, const char *p, const char *field)
{
@@ -2295,7 +2296,7 @@ static void parse_path_eol(struct strbuf *sb, const char *p, const char *field)

/*
 * Parse the path string into the strbuf, and ensure it is followed by a space.
- * It may not contain spaces when unquoted. Update *endp to point to the first
+ * Unquoted strings may not contain spaces. Update *endp to point to the first
 * character after the space.
 */
static void parse_path_space(struct strbuf *sb, const char *p,


Thalia

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 1/8] fast-import: tighten path unquoting
  2024-04-13  0:07           ` Thalia Archibald
@ 2024-04-13 18:33             ` Junio C Hamano
  0 siblings, 0 replies; 84+ messages in thread
From: Junio C Hamano @ 2024-04-13 18:33 UTC (permalink / raw)
  To: Thalia Archibald; +Cc: git, Patrick Steinhardt, Chris Torek, Elijah Newren

Thalia Archibald <thalia@archibald.dev> writes:

> As to naming, `allow_spaces` and `include_spaces` are problematic for
> the reasons you both have pointed out. I think `stop_at_unquoted_space`
> is problematic, because that’s not where it stops when quoted, but
> rather at the close quote. I think that `include_unquoted_spaces` is
> good, because it describes that spaces are included in this field when
> it is an unquoted string. `allow_unquoted_spaces` implies that its an
> error to have a space, but no such error is raised here.

OK, so the bit tells the function if we are dealing with the last
field on the input line, because unquoted side needs to know when
to stop reading the path.

	static void parse_path(... int is_last_field ...)
	{
		if (*p == '"') {
			... handling of a quoted path is unchanged
			... regardless of where on the line it apears
		} else {
			/*
			 * unless we know this is the last field,
			 * an unquoted SP is the end of this field.
			 */
			*endp = is_last_field 
                              ? p + strlen(p)
			      : strchrnul(p, ' ');
		}
	}

Another way to look at it is that we are telling the function if a
space in an unquoted path is part of the path or not.

The distinction matters only if we require, in some record type, a
path field that is the last on the line to be quoted when it has a
SP in it, in which case, "is_last_field" is a wrong name, and we
need to say something like space_is_end_of_field_if_unquoted (or we
can reverse the polarity to say unquoted_space_is_part_of_the_path,
include_unqouted_space, etc.).  But if not, I find that "we normally
stop at SP when unquoted but the last field is a special case"
somewhat more natural.  I do not feel too strongly, though.

Thanks.



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v5 0/8] fast-import: tighten parsing of paths
  2024-04-12  8:01     ` [PATCH v4 0/8] fast-import: tighten parsing of paths Thalia Archibald
                         ` (7 preceding siblings ...)
  2024-04-12  8:03       ` [PATCH v4 8/8] fast-import: make comments more precise Thalia Archibald
@ 2024-04-14  1:11       ` Thalia Archibald
  2024-04-14  1:11         ` [PATCH v5 1/8] fast-import: tighten path unquoting Thalia Archibald
                           ` (8 more replies)
  8 siblings, 9 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-14  1:11 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Chris Torek, Elijah Newren,
	Thalia Archibald

> fast-import has subtle differences in how it parses file paths between each
> occurrence of <path> in the grammar. Many errors are suppressed or not checked,
> which could lead to silent data corruption. A particularly bad case is when a
> front-end sent escapes that Git doesn't recognize (e.g., hex escapes are not
> supported), it would be treated as literal bytes instead of a quoted string.
>
> Bring path parsing into line with the documented behavior and improve
> documentation to fill in missing details.

Changes since v4:
* Refine C comments and parameter name.

Thalia


Thalia Archibald (8):
  fast-import: tighten path unquoting
  fast-import: directly use strbufs for paths
  fast-import: allow unquoted empty path for root
  fast-import: remove dead strbuf
  fast-import: improve documentation for path quoting
  fast-import: document C-style escapes for paths
  fast-import: forbid escaped NUL in paths
  fast-import: make comments more precise

 Documentation/git-fast-import.txt |  31 +-
 builtin/fast-import.c             | 162 ++++----
 t/t9300-fast-import.sh            | 624 +++++++++++++++++++++---------
 3 files changed, 555 insertions(+), 262 deletions(-)

Range-diff against v4:
1:  d6ea8aca46 ! 1:  2c18fe5fe9 fast-import: tighten path unquoting
    @@ builtin/fast-import.c: static uintmax_t parse_mark_ref_space(const char **p)
      }
      
     +/*
    -+ * Parse the path string into the strbuf. It may be quoted with escape sequences
    -+ * or unquoted without escape sequences. When unquoted, it may only contain a
    -+ * space if `include_spaces` is nonzero.
    ++ * Parse the path string into the strbuf. The path can either be quoted with
    ++ * escape sequences or unquoted without escape sequences. Unquoted strings may
    ++ * contain spaces only if `is_last_field` is nonzero; otherwise, it stops
    ++ * parsing at the first space.
     + */
     +static void parse_path(struct strbuf *sb, const char *p, const char **endp,
    -+		int include_spaces, const char *field)
    ++		int is_last_field, const char *field)
     +{
     +	if (*p == '"') {
     +		if (unquote_c_style(sb, p, endp))
     +			die("Invalid %s: %s", field, command_buf.buf);
     +	} else {
    -+		if (include_spaces)
    -+			*endp = p + strlen(p);
    -+		else
    -+			*endp = strchrnul(p, ' ');
    ++		/*
    ++		 * Unless we are parsing the last field of a line,
    ++		 * SP is the end of this field.
    ++		 */
    ++		*endp = is_last_field
    ++			? p + strlen(p)
    ++			: strchrnul(p, ' ');
     +		strbuf_add(sb, p, *endp - p);
     +	}
     +}
     +
     +/*
     + * Parse the path string into the strbuf, and complain if this is not the end of
    -+ * the string. It may contain spaces even when unquoted.
    ++ * the string. Unquoted strings may contain spaces.
     + */
     +static void parse_path_eol(struct strbuf *sb, const char *p, const char *field)
     +{
    @@ builtin/fast-import.c: static uintmax_t parse_mark_ref_space(const char **p)
     +
     +/*
     + * Parse the path string into the strbuf, and ensure it is followed by a space.
    -+ * It may not contain spaces when unquoted. Update *endp to point to the first
    ++ * Unquoted strings may not contain spaces. Update *endp to point to the first
     + * character after the space.
     + */
     +static void parse_path_space(struct strbuf *sb, const char *p,
2:  9499f34aae = 2:  4e9f3aa52c fast-import: directly use strbufs for paths
3:  9b1e6b80f5 = 3:  cae5764cec fast-import: allow unquoted empty path for root
4:  1a2b0dc616 = 4:  96ff70895a fast-import: remove dead strbuf
5:  fb0d870d53 = 5:  e1a1b0395d fast-import: improve documentation for path quoting
6:  4b6017ded8 = 6:  08e6fb37be fast-import: document C-style escapes for paths
7:  5b464f4b01 = 7:  a01d0a1b25 fast-import: forbid escaped NUL in paths
8:  6eb66fce45 = 8:  65d7896e39 fast-import: make comments more precise
-- 
2.44.0



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v5 1/8] fast-import: tighten path unquoting
  2024-04-14  1:11       ` [PATCH v5 0/8] fast-import: tighten parsing of paths Thalia Archibald
@ 2024-04-14  1:11         ` Thalia Archibald
  2024-04-14  1:11         ` [PATCH v5 2/8] fast-import: directly use strbufs for paths Thalia Archibald
                           ` (7 subsequent siblings)
  8 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-14  1:11 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Chris Torek, Elijah Newren,
	Thalia Archibald

Path parsing in fast-import is inconsistent and many unquoting errors
are suppressed or not checked.

<path> appears in the grammar in these places:

    filemodify ::= 'M' SP <mode> (<dataref> | 'inline') SP <path> LF
    filedelete ::= 'D' SP <path> LF
    filecopy   ::= 'C' SP <path> SP <path> LF
    filerename ::= 'R' SP <path> SP <path> LF
    ls         ::= 'ls' SP <dataref> SP <path> LF
    ls-commit  ::= 'ls' SP <path> LF

and fast-import.c parses them in five different ways:

1. For filemodify and filedelete:
   Try to unquote <path>. If it unquotes without errors, use the
   unquoted version; otherwise, treat it as literal bytes to the end of
   the line (including any number of SP).
2. For filecopy (source) and filerename (source):
   Try to unquote <path>. If it unquotes without errors, use the
   unquoted version; otherwise, treat it as literal bytes up to, but not
   including, the next SP.
3. For filecopy (dest) and filerename (dest):
   Like 1., but an unquoted empty string is forbidden.
4. For ls:
   If <path> starts with `"`, unquote it and report parse errors;
   otherwise, treat it as literal bytes to the end of the line
   (including any number of SP).
5. For ls-commit:
   Unquote <path> and report parse errors.
   (It must start with `"` to disambiguate from ls.)

In the first three, any errors from trying to unquote a string are
suppressed, so a quoted string that contains invalid escapes would be
interpreted as literal bytes. For example, `"\xff"` would fail to
unquote (because hex escapes are not supported), and it would instead be
interpreted as the byte sequence '"', '\\', 'x', 'f', 'f', '"', which is
certainly not intended. Some front-ends erroneously use their language's
standard quoting routine instead of matching Git's, which could silently
introduce escapes that would be incorrectly parsed due to this and lead
to data corruption.

The documentation states “To use a source path that contains SP the path
must be quoted.”, so it is expected that some implementations depend on
spaces being allowed in paths in the final position. Thus we have two
documented ways to parse paths, so simplify the implementation to that.

Now we have:

1. `parse_path_eol` for filemodify, filedelete, filecopy (dest),
   filerename (dest), ls, and ls-commit:

   If <path> starts with `"`, unquote it and report parse errors;
   otherwise, treat it as literal bytes to the end of the line
   (including any number of SP).

2. `parse_path_space` for filecopy (source) and filerename (source):

   If <path> starts with `"`, unquote it and report parse errors;
   otherwise, treat it as literal bytes up to, but not including, the
   next SP. It must be followed by SP.

There remain two special cases: The dest <path> in filecopy and rename
cannot be an unquoted empty string (this will be addressed subsequently)
and <path> in ls-commit must be quoted to disambiguate it from ls.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c  | 108 ++++++++++-------
 t/t9300-fast-import.sh | 258 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 322 insertions(+), 44 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 782bda007c..8eba89689b 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2258,10 +2258,60 @@ static uintmax_t parse_mark_ref_space(const char **p)
 	return mark;
 }
 
+/*
+ * Parse the path string into the strbuf. The path can either be quoted with
+ * escape sequences or unquoted without escape sequences. Unquoted strings may
+ * contain spaces only if `is_last_field` is nonzero; otherwise, it stops
+ * parsing at the first space.
+ */
+static void parse_path(struct strbuf *sb, const char *p, const char **endp,
+		int is_last_field, const char *field)
+{
+	if (*p == '"') {
+		if (unquote_c_style(sb, p, endp))
+			die("Invalid %s: %s", field, command_buf.buf);
+	} else {
+		/*
+		 * Unless we are parsing the last field of a line,
+		 * SP is the end of this field.
+		 */
+		*endp = is_last_field
+			? p + strlen(p)
+			: strchrnul(p, ' ');
+		strbuf_add(sb, p, *endp - p);
+	}
+}
+
+/*
+ * Parse the path string into the strbuf, and complain if this is not the end of
+ * the string. Unquoted strings may contain spaces.
+ */
+static void parse_path_eol(struct strbuf *sb, const char *p, const char *field)
+{
+	const char *end;
+
+	parse_path(sb, p, &end, 1, field);
+	if (*end)
+		die("Garbage after %s: %s", field, command_buf.buf);
+}
+
+/*
+ * Parse the path string into the strbuf, and ensure it is followed by a space.
+ * Unquoted strings may not contain spaces. Update *endp to point to the first
+ * character after the space.
+ */
+static void parse_path_space(struct strbuf *sb, const char *p,
+		const char **endp, const char *field)
+{
+	parse_path(sb, p, endp, 0, field);
+	if (**endp != ' ')
+		die("Missing space after %s: %s", field, command_buf.buf);
+	(*endp)++;
+}
+
 static void file_change_m(const char *p, struct branch *b)
 {
 	static struct strbuf uq = STRBUF_INIT;
-	const char *endp;
 	struct object_entry *oe;
 	struct object_id oid;
 	uint16_t mode, inline_data = 0;
@@ -2299,11 +2349,8 @@ static void file_change_m(const char *p, struct branch *b)
 	}
 
 	strbuf_reset(&uq);
-	if (!unquote_c_style(&uq, p, &endp)) {
-		if (*endp)
-			die("Garbage after path in: %s", command_buf.buf);
-		p = uq.buf;
-	}
+	parse_path_eol(&uq, p, "path");
+	p = uq.buf;
 
 	/* Git does not track empty, non-toplevel directories. */
 	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *p) {
@@ -2367,48 +2414,29 @@ static void file_change_m(const char *p, struct branch *b)
 static void file_change_d(const char *p, struct branch *b)
 {
 	static struct strbuf uq = STRBUF_INIT;
-	const char *endp;
 
 	strbuf_reset(&uq);
-	if (!unquote_c_style(&uq, p, &endp)) {
-		if (*endp)
-			die("Garbage after path in: %s", command_buf.buf);
-		p = uq.buf;
-	}
+	parse_path_eol(&uq, p, "path");
+	p = uq.buf;
 	tree_content_remove(&b->branch_tree, p, NULL, 1);
 }
 
-static void file_change_cr(const char *s, struct branch *b, int rename)
+static void file_change_cr(const char *p, struct branch *b, int rename)
 {
-	const char *d;
+	const char *s, *d;
 	static struct strbuf s_uq = STRBUF_INIT;
 	static struct strbuf d_uq = STRBUF_INIT;
-	const char *endp;
 	struct tree_entry leaf;
 
 	strbuf_reset(&s_uq);
-	if (!unquote_c_style(&s_uq, s, &endp)) {
-		if (*endp != ' ')
-			die("Missing space after source: %s", command_buf.buf);
-	} else {
-		endp = strchr(s, ' ');
-		if (!endp)
-			die("Missing space after source: %s", command_buf.buf);
-		strbuf_add(&s_uq, s, endp - s);
-	}
+	parse_path_space(&s_uq, p, &p, "source");
 	s = s_uq.buf;
 
-	endp++;
-	if (!*endp)
+	if (!*p)
 		die("Missing dest: %s", command_buf.buf);
-
-	d = endp;
 	strbuf_reset(&d_uq);
-	if (!unquote_c_style(&d_uq, d, &endp)) {
-		if (*endp)
-			die("Garbage after dest in: %s", command_buf.buf);
-		d = d_uq.buf;
-	}
+	parse_path_eol(&d_uq, p, "dest");
+	d = d_uq.buf;
 
 	memset(&leaf, 0, sizeof(leaf));
 	if (rename)
@@ -3152,6 +3180,7 @@ static void print_ls(int mode, const unsigned char *hash, const char *path)
 
 static void parse_ls(const char *p, struct branch *b)
 {
+	static struct strbuf uq = STRBUF_INIT;
 	struct tree_entry *root = NULL;
 	struct tree_entry leaf = {NULL};
 
@@ -3168,16 +3197,9 @@ static void parse_ls(const char *p, struct branch *b)
 			root->versions[1].mode = S_IFDIR;
 		load_tree(root);
 	}
-	if (*p == '"') {
-		static struct strbuf uq = STRBUF_INIT;
-		const char *endp;
-		strbuf_reset(&uq);
-		if (unquote_c_style(&uq, p, &endp))
-			die("Invalid path: %s", command_buf.buf);
-		if (*endp)
-			die("Garbage after path in: %s", command_buf.buf);
-		p = uq.buf;
-	}
+	strbuf_reset(&uq);
+	parse_path_eol(&uq, p, "path");
+	p = uq.buf;
 	tree_content_get(root, p, &leaf, 1);
 	/*
 	 * A directory in preparation would have a sha1 of zero
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 60e30fed3c..de2f1304e8 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -2142,6 +2142,7 @@ test_expect_success 'Q: deny note on empty branch' '
 	EOF
 	test_must_fail git fast-import <input
 '
+
 ###
 ### series R (feature and option)
 ###
@@ -2790,7 +2791,7 @@ test_expect_success 'R: blob appears only once' '
 '
 
 ###
-### series S
+### series S (mark and path parsing)
 ###
 #
 # Make sure missing spaces and EOLs after mark references
@@ -3060,6 +3061,261 @@ test_expect_success 'S: ls with garbage after sha1 must fail' '
 	test_grep "space after tree-ish" err
 '
 
+#
+# Path parsing
+#
+# There are two sorts of ways a path can be parsed, depending on whether it is
+# the last field on the line. Additionally, ls without a <dataref> has a special
+# case. Test every occurrence of <path> in the grammar against every error case.
+#
+
+#
+# Valid paths at the end of a line: filemodify, filedelete, filecopy (dest),
+# filerename (dest), and ls.
+#
+# commit :301 from root -- modify hello.c (for setup)
+# commit :302 from :301 -- modify $path
+# commit :303 from :302 -- delete $path
+# commit :304 from :301 -- copy hello.c $path
+# commit :305 from :301 -- rename hello.c $path
+# ls :305 $path
+#
+test_path_eol_success () {
+	local test="$1" path="$2" unquoted_path="$3"
+	test_expect_success "S: paths at EOL with $test must work" '
+		test_when_finished "git branch -D S-path-eol" &&
+
+		git fast-import --export-marks=marks.out <<-EOF >out 2>err &&
+		blob
+		mark :401
+		data <<BLOB
+		hello world
+		BLOB
+
+		blob
+		mark :402
+		data <<BLOB
+		hallo welt
+		BLOB
+
+		commit refs/heads/S-path-eol
+		mark :301
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		initial commit
+		COMMIT
+		M 100644 :401 hello.c
+
+		commit refs/heads/S-path-eol
+		mark :302
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filemodify
+		COMMIT
+		from :301
+		M 100644 :402 $path
+
+		commit refs/heads/S-path-eol
+		mark :303
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filedelete
+		COMMIT
+		from :302
+		D $path
+
+		commit refs/heads/S-path-eol
+		mark :304
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filecopy dest
+		COMMIT
+		from :301
+		C hello.c $path
+
+		commit refs/heads/S-path-eol
+		mark :305
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filerename dest
+		COMMIT
+		from :301
+		R hello.c $path
+
+		ls :305 $path
+		EOF
+
+		commit_m=$(grep :302 marks.out | cut -d\  -f2) &&
+		commit_d=$(grep :303 marks.out | cut -d\  -f2) &&
+		commit_c=$(grep :304 marks.out | cut -d\  -f2) &&
+		commit_r=$(grep :305 marks.out | cut -d\  -f2) &&
+		blob1=$(grep :401 marks.out | cut -d\  -f2) &&
+		blob2=$(grep :402 marks.out | cut -d\  -f2) &&
+
+		(
+			printf "100644 blob $blob2\t$unquoted_path\n" &&
+			printf "100644 blob $blob1\thello.c\n"
+		) | sort >tree_m.exp &&
+		git ls-tree $commit_m | sort >tree_m.out &&
+		test_cmp tree_m.exp tree_m.out &&
+
+		printf "100644 blob $blob1\thello.c\n" >tree_d.exp &&
+		git ls-tree $commit_d >tree_d.out &&
+		test_cmp tree_d.exp tree_d.out &&
+
+		(
+			printf "100644 blob $blob1\t$unquoted_path\n" &&
+			printf "100644 blob $blob1\thello.c\n"
+		) | sort >tree_c.exp &&
+		git ls-tree $commit_c | sort >tree_c.out &&
+		test_cmp tree_c.exp tree_c.out &&
+
+		printf "100644 blob $blob1\t$unquoted_path\n" >tree_r.exp &&
+		git ls-tree $commit_r >tree_r.out &&
+		test_cmp tree_r.exp tree_r.out &&
+
+		test_cmp out tree_r.exp
+	'
+}
+
+test_path_eol_success 'quoted spaces'   '" hello world.c "' ' hello world.c '
+test_path_eol_success 'unquoted spaces' ' hello world.c '   ' hello world.c '
+
+#
+# Valid paths before a space: filecopy (source) and filerename (source).
+#
+# commit :301 from root -- modify $path (for setup)
+# commit :302 from :301 -- copy $path hello2.c
+# commit :303 from :301 -- rename $path hello2.c
+#
+test_path_space_success () {
+	local test="$1" path="$2" unquoted_path="$3"
+	test_expect_success "S: paths before space with $test must work" '
+		test_when_finished "git branch -D S-path-space" &&
+
+		git fast-import --export-marks=marks.out <<-EOF 2>err &&
+		blob
+		mark :401
+		data <<BLOB
+		hello world
+		BLOB
+
+		commit refs/heads/S-path-space
+		mark :301
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		initial commit
+		COMMIT
+		M 100644 :401 $path
+
+		commit refs/heads/S-path-space
+		mark :302
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filecopy source
+		COMMIT
+		from :301
+		C $path hello2.c
+
+		commit refs/heads/S-path-space
+		mark :303
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filerename source
+		COMMIT
+		from :301
+		R $path hello2.c
+
+		EOF
+
+		commit_c=$(grep :302 marks.out | cut -d\  -f2) &&
+		commit_r=$(grep :303 marks.out | cut -d\  -f2) &&
+		blob=$(grep :401 marks.out | cut -d\  -f2) &&
+
+		(
+			printf "100644 blob $blob\t$unquoted_path\n" &&
+			printf "100644 blob $blob\thello2.c\n"
+		) | sort >tree_c.exp &&
+		git ls-tree $commit_c | sort >tree_c.out &&
+		test_cmp tree_c.exp tree_c.out &&
+
+		printf "100644 blob $blob\thello2.c\n" >tree_r.exp &&
+		git ls-tree $commit_r >tree_r.out &&
+		test_cmp tree_r.exp tree_r.out
+	'
+}
+
+test_path_space_success 'quoted spaces'      '" hello world.c "' ' hello world.c '
+test_path_space_success 'no unquoted spaces' 'hello_world.c'     'hello_world.c'
+
+#
+# Test a single commit change with an invalid path. Run it with all occurrences
+# of <path> in the grammar against all error kinds.
+#
+test_path_fail () {
+	local change="$1" what="$2" prefix="$3" path="$4" suffix="$5" err_grep="$6"
+	test_expect_success "S: $change with $what must fail" '
+		test_must_fail git fast-import <<-EOF 2>err &&
+		blob
+		mark :1
+		data <<BLOB
+		hello world
+		BLOB
+
+		commit refs/heads/S-path-fail
+		mark :2
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit setup
+		COMMIT
+		M 100644 :1 hello.c
+
+		commit refs/heads/S-path-fail
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit with bad path
+		COMMIT
+		from :2
+		$prefix$path$suffix
+		EOF
+
+		test_grep "$err_grep" err
+	'
+}
+
+test_path_base_fail () {
+	local change="$1" prefix="$2" field="$3" suffix="$4"
+	test_path_fail "$change" 'unclosed " in '"$field"          "$prefix" '"hello.c'    "$suffix" "Invalid $field"
+	test_path_fail "$change" "invalid escape in quoted $field" "$prefix" '"hello\xff"' "$suffix" "Invalid $field"
+}
+test_path_eol_quoted_fail () {
+	local change="$1" prefix="$2" field="$3"
+	test_path_base_fail "$change" "$prefix" "$field" ''
+	test_path_fail "$change" "garbage after quoted $field" "$prefix" '"hello.c"' 'x' "Garbage after $field"
+	test_path_fail "$change" "space after quoted $field"   "$prefix" '"hello.c"' ' ' "Garbage after $field"
+}
+test_path_eol_fail () {
+	local change="$1" prefix="$2" field="$3"
+	test_path_eol_quoted_fail "$change" "$prefix" "$field"
+}
+test_path_space_fail () {
+	local change="$1" prefix="$2" field="$3"
+	test_path_base_fail "$change" "$prefix" "$field" ' world.c'
+	test_path_fail "$change" "missing space after quoted $field"   "$prefix" '"hello.c"' 'x world.c' "Missing space after $field"
+	test_path_fail "$change" "missing space after unquoted $field" "$prefix" 'hello.c'   ''          "Missing space after $field"
+}
+
+test_path_eol_fail   filemodify       'M 100644 :1 ' path
+test_path_eol_fail   filedelete       'D '           path
+test_path_space_fail filecopy         'C '           source
+test_path_eol_fail   filecopy         'C hello.c '   dest
+test_path_space_fail filerename       'R '           source
+test_path_eol_fail   filerename       'R hello.c '   dest
+test_path_eol_fail   'ls (in commit)' 'ls :2 '       path
+
+# When 'ls' has no <dataref>, the <path> must be quoted.
+test_path_eol_quoted_fail 'ls (without dataref in commit)' 'ls ' path
+
 ###
 ### series T (ls)
 ###
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v5 2/8] fast-import: directly use strbufs for paths
  2024-04-14  1:11       ` [PATCH v5 0/8] fast-import: tighten parsing of paths Thalia Archibald
  2024-04-14  1:11         ` [PATCH v5 1/8] fast-import: tighten path unquoting Thalia Archibald
@ 2024-04-14  1:11         ` Thalia Archibald
  2024-04-14  1:11         ` [PATCH v5 3/8] fast-import: allow unquoted empty path for root Thalia Archibald
                           ` (6 subsequent siblings)
  8 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-14  1:11 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Chris Torek, Elijah Newren,
	Thalia Archibald

Previously, one case would not write the path to the strbuf: when the
path is unquoted and at the end of the string. It was essentially
copy-on-write. However, with the logic simplification of the previous
commit, this case was eliminated and the strbuf is always populated.

Directly use the strbufs now instead of an alias.

Since this already changes all the lines that use the strbufs, rename
them from `uq` to be more descriptive. That they are unquoted is not
their most important property, so name them after what they carry.

Additionally, `file_change_m` no longer needs to copy the path before
reading inline data.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c | 64 ++++++++++++++++++-------------------------
 1 file changed, 27 insertions(+), 37 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 8eba89689b..765429a66c 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2311,7 +2311,7 @@ static void parse_path_space(struct strbuf *sb, const char *p,
 
 static void file_change_m(const char *p, struct branch *b)
 {
-	static struct strbuf uq = STRBUF_INIT;
+	static struct strbuf path = STRBUF_INIT;
 	struct object_entry *oe;
 	struct object_id oid;
 	uint16_t mode, inline_data = 0;
@@ -2348,13 +2348,12 @@ static void file_change_m(const char *p, struct branch *b)
 			die("Missing space after SHA1: %s", command_buf.buf);
 	}
 
-	strbuf_reset(&uq);
-	parse_path_eol(&uq, p, "path");
-	p = uq.buf;
+	strbuf_reset(&path);
+	parse_path_eol(&path, p, "path");
 
 	/* Git does not track empty, non-toplevel directories. */
-	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *p) {
-		tree_content_remove(&b->branch_tree, p, NULL, 0);
+	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *path.buf) {
+		tree_content_remove(&b->branch_tree, path.buf, NULL, 0);
 		return;
 	}
 
@@ -2375,10 +2374,6 @@ static void file_change_m(const char *p, struct branch *b)
 		if (S_ISDIR(mode))
 			die("Directories cannot be specified 'inline': %s",
 				command_buf.buf);
-		if (p != uq.buf) {
-			strbuf_addstr(&uq, p);
-			p = uq.buf;
-		}
 		while (read_next_command() != EOF) {
 			const char *v;
 			if (skip_prefix(command_buf.buf, "cat-blob ", &v))
@@ -2404,55 +2399,51 @@ static void file_change_m(const char *p, struct branch *b)
 				command_buf.buf);
 	}
 
-	if (!*p) {
+	if (!*path.buf) {
 		tree_content_replace(&b->branch_tree, &oid, mode, NULL);
 		return;
 	}
-	tree_content_set(&b->branch_tree, p, &oid, mode, NULL);
+	tree_content_set(&b->branch_tree, path.buf, &oid, mode, NULL);
 }
 
 static void file_change_d(const char *p, struct branch *b)
 {
-	static struct strbuf uq = STRBUF_INIT;
+	static struct strbuf path = STRBUF_INIT;
 
-	strbuf_reset(&uq);
-	parse_path_eol(&uq, p, "path");
-	p = uq.buf;
-	tree_content_remove(&b->branch_tree, p, NULL, 1);
+	strbuf_reset(&path);
+	parse_path_eol(&path, p, "path");
+	tree_content_remove(&b->branch_tree, path.buf, NULL, 1);
 }
 
 static void file_change_cr(const char *p, struct branch *b, int rename)
 {
-	const char *s, *d;
-	static struct strbuf s_uq = STRBUF_INIT;
-	static struct strbuf d_uq = STRBUF_INIT;
+	static struct strbuf source = STRBUF_INIT;
+	static struct strbuf dest = STRBUF_INIT;
 	struct tree_entry leaf;
 
-	strbuf_reset(&s_uq);
-	parse_path_space(&s_uq, p, &p, "source");
-	s = s_uq.buf;
+	strbuf_reset(&source);
+	parse_path_space(&source, p, &p, "source");
 
 	if (!*p)
 		die("Missing dest: %s", command_buf.buf);
-	strbuf_reset(&d_uq);
-	parse_path_eol(&d_uq, p, "dest");
-	d = d_uq.buf;
+	strbuf_reset(&dest);
+	parse_path_eol(&dest, p, "dest");
 
 	memset(&leaf, 0, sizeof(leaf));
 	if (rename)
-		tree_content_remove(&b->branch_tree, s, &leaf, 1);
+		tree_content_remove(&b->branch_tree, source.buf, &leaf, 1);
 	else
-		tree_content_get(&b->branch_tree, s, &leaf, 1);
+		tree_content_get(&b->branch_tree, source.buf, &leaf, 1);
 	if (!leaf.versions[1].mode)
-		die("Path %s not in branch", s);
-	if (!*d) {	/* C "path/to/subdir" "" */
+		die("Path %s not in branch", source.buf);
+	if (!*dest.buf) {	/* C "path/to/subdir" "" */
 		tree_content_replace(&b->branch_tree,
 			&leaf.versions[1].oid,
 			leaf.versions[1].mode,
 			leaf.tree);
 		return;
 	}
-	tree_content_set(&b->branch_tree, d,
+	tree_content_set(&b->branch_tree, dest.buf,
 		&leaf.versions[1].oid,
 		leaf.versions[1].mode,
 		leaf.tree);
@@ -3180,7 +3171,7 @@ static void print_ls(int mode, const unsigned char *hash, const char *path)
 
 static void parse_ls(const char *p, struct branch *b)
 {
-	static struct strbuf uq = STRBUF_INIT;
+	static struct strbuf path = STRBUF_INIT;
 	struct tree_entry *root = NULL;
 	struct tree_entry leaf = {NULL};
 
@@ -3197,10 +3188,9 @@ static void parse_ls(const char *p, struct branch *b)
 			root->versions[1].mode = S_IFDIR;
 		load_tree(root);
 	}
-	strbuf_reset(&uq);
-	parse_path_eol(&uq, p, "path");
-	p = uq.buf;
-	tree_content_get(root, p, &leaf, 1);
+	strbuf_reset(&path);
+	parse_path_eol(&path, p, "path");
+	tree_content_get(root, path.buf, &leaf, 1);
 	/*
 	 * A directory in preparation would have a sha1 of zero
 	 * until it is saved.  Save, for simplicity.
@@ -3208,7 +3198,7 @@ static void parse_ls(const char *p, struct branch *b)
 	if (S_ISDIR(leaf.versions[1].mode))
 		store_tree(&leaf);
 
-	print_ls(leaf.versions[1].mode, leaf.versions[1].oid.hash, p);
+	print_ls(leaf.versions[1].mode, leaf.versions[1].oid.hash, path.buf);
 	if (leaf.tree)
 		release_tree_content_recursive(leaf.tree);
 	if (!b || root != &b->branch_tree)
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v5 3/8] fast-import: allow unquoted empty path for root
  2024-04-14  1:11       ` [PATCH v5 0/8] fast-import: tighten parsing of paths Thalia Archibald
  2024-04-14  1:11         ` [PATCH v5 1/8] fast-import: tighten path unquoting Thalia Archibald
  2024-04-14  1:11         ` [PATCH v5 2/8] fast-import: directly use strbufs for paths Thalia Archibald
@ 2024-04-14  1:11         ` Thalia Archibald
  2024-04-14  1:11         ` [PATCH v5 4/8] fast-import: remove dead strbuf Thalia Archibald
                           ` (5 subsequent siblings)
  8 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-14  1:11 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Chris Torek, Elijah Newren,
	Thalia Archibald

Ever since filerename was added in f39a946a1f (Support wholesale
directory renames in fast-import, 2007-07-09) and filecopy in b6f3481bb4
(Teach fast-import to recursively copy files/directories, 2007-07-15),
both have produced an error when the destination path is empty. Later,
when support for targeting the root directory with an empty string was
added in 2794ad5244 (fast-import: Allow filemodify to set the root,
2010-10-10), this had the effect of allowing the quoted empty string
(`""`), but forbidding its unquoted variant (``). This seems to have
been intended as simple data validation for parsing two paths, rather
than a syntax restriction, because it was not extended to the other
operations.

All other occurrences of paths (in filemodify, filedelete, the source of
filecopy and filerename, and ls) allow both.

For most of this feature's lifetime, the documentation has not
prescribed the use of quoted empty strings. In e5959106d6
(Documentation/fast-import: put explanation of M 040000 <dataref> "" in
context, 2011-01-15), its documentation was changed from “`<path>` may
also be an empty string (`""`) to specify the root of the tree” to “The
root of the tree can be represented by an empty string as `<path>`”.

Thus, we should assume that some front-ends have depended on this
behavior.

Remove this restriction for the destination paths of filecopy and
filerename and change tests targeting the root to test `""` and ``.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c  |   3 -
 t/t9300-fast-import.sh | 363 +++++++++++++++++++++--------------------
 2 files changed, 190 insertions(+), 176 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 765429a66c..c8a1e3ef3d 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2423,9 +2423,6 @@ static void file_change_cr(const char *p, struct branch *b, int rename)
 
 	strbuf_reset(&source);
 	parse_path_space(&source, p, &p, "source");
-
-	if (!*p)
-		die("Missing dest: %s", command_buf.buf);
 	strbuf_reset(&dest);
 	parse_path_eol(&dest, p, "dest");
 
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index de2f1304e8..13f98e6688 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -1059,30 +1059,33 @@ test_expect_success 'M: rename subdirectory to new subdirectory' '
 	compare_diff_raw expect actual
 '
 
-test_expect_success 'M: rename root to subdirectory' '
-	cat >input <<-INPUT_END &&
-	commit refs/heads/M4
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	rename root
-	COMMIT
+for root in '""' ''
+do
+	test_expect_success "M: rename root ($root) to subdirectory" '
+		cat >input <<-INPUT_END &&
+		commit refs/heads/M4
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		rename root
+		COMMIT
 
-	from refs/heads/M2^0
-	R "" sub
+		from refs/heads/M2^0
+		R $root sub
 
-	INPUT_END
+		INPUT_END
 
-	cat >expect <<-EOF &&
-	:100644 100644 $oldf $oldf R100	file2/oldf	sub/file2/oldf
-	:100755 100755 $f4id $f4id R100	file4	sub/file4
-	:100755 100755 $newf $newf R100	i/am/new/to/you	sub/i/am/new/to/you
-	:100755 100755 $f6id $f6id R100	newdir/exec.sh	sub/newdir/exec.sh
-	:100644 100644 $f5id $f5id R100	newdir/interesting	sub/newdir/interesting
-	EOF
-	git fast-import <input &&
-	git diff-tree -M -r M4^ M4 >actual &&
-	compare_diff_raw expect actual
-'
+		cat >expect <<-EOF &&
+		:100644 100644 $oldf $oldf R100	file2/oldf	sub/file2/oldf
+		:100755 100755 $f4id $f4id R100	file4	sub/file4
+		:100755 100755 $newf $newf R100	i/am/new/to/you	sub/i/am/new/to/you
+		:100755 100755 $f6id $f6id R100	newdir/exec.sh	sub/newdir/exec.sh
+		:100644 100644 $f5id $f5id R100	newdir/interesting	sub/newdir/interesting
+		EOF
+		git fast-import <input &&
+		git diff-tree -M -r M4^ M4 >actual &&
+		compare_diff_raw expect actual
+	'
+done
 
 ###
 ### series N
@@ -1259,49 +1262,52 @@ test_expect_success PIPE 'N: empty directory reads as missing' '
 	test_cmp expect actual
 '
 
-test_expect_success 'N: copy root directory by tree hash' '
-	cat >expect <<-EOF &&
-	:100755 000000 $newf $zero D	file3/newf
-	:100644 000000 $oldf $zero D	file3/oldf
-	EOF
-	root=$(git rev-parse refs/heads/branch^0^{tree}) &&
-	cat >input <<-INPUT_END &&
-	commit refs/heads/N6
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	copy root directory by tree hash
-	COMMIT
+for root in '""' ''
+do
+	test_expect_success "N: copy root ($root) by tree hash" '
+		cat >expect <<-EOF &&
+		:100755 000000 $newf $zero D	file3/newf
+		:100644 000000 $oldf $zero D	file3/oldf
+		EOF
+		root_tree=$(git rev-parse refs/heads/branch^0^{tree}) &&
+		cat >input <<-INPUT_END &&
+		commit refs/heads/N6
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		copy root directory by tree hash
+		COMMIT
 
-	from refs/heads/branch^0
-	M 040000 $root ""
-	INPUT_END
-	git fast-import <input &&
-	git diff-tree -C --find-copies-harder -r N4 N6 >actual &&
-	compare_diff_raw expect actual
-'
+		from refs/heads/branch^0
+		M 040000 $root_tree $root
+		INPUT_END
+		git fast-import <input &&
+		git diff-tree -C --find-copies-harder -r N4 N6 >actual &&
+		compare_diff_raw expect actual
+	'
 
-test_expect_success 'N: copy root by path' '
-	cat >expect <<-EOF &&
-	:100755 100755 $newf $newf C100	file2/newf	oldroot/file2/newf
-	:100644 100644 $oldf $oldf C100	file2/oldf	oldroot/file2/oldf
-	:100755 100755 $f4id $f4id C100	file4	oldroot/file4
-	:100755 100755 $f6id $f6id C100	newdir/exec.sh	oldroot/newdir/exec.sh
-	:100644 100644 $f5id $f5id C100	newdir/interesting	oldroot/newdir/interesting
-	EOF
-	cat >input <<-INPUT_END &&
-	commit refs/heads/N-copy-root-path
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	copy root directory by (empty) path
-	COMMIT
+	test_expect_success "N: copy root ($root) by path" '
+		cat >expect <<-EOF &&
+		:100755 100755 $newf $newf C100	file2/newf	oldroot/file2/newf
+		:100644 100644 $oldf $oldf C100	file2/oldf	oldroot/file2/oldf
+		:100755 100755 $f4id $f4id C100	file4	oldroot/file4
+		:100755 100755 $f6id $f6id C100	newdir/exec.sh	oldroot/newdir/exec.sh
+		:100644 100644 $f5id $f5id C100	newdir/interesting	oldroot/newdir/interesting
+		EOF
+		cat >input <<-INPUT_END &&
+		commit refs/heads/N-copy-root-path
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		copy root directory by (empty) path
+		COMMIT
 
-	from refs/heads/branch^0
-	C "" oldroot
-	INPUT_END
-	git fast-import <input &&
-	git diff-tree -C --find-copies-harder -r branch N-copy-root-path >actual &&
-	compare_diff_raw expect actual
-'
+		from refs/heads/branch^0
+		C $root oldroot
+		INPUT_END
+		git fast-import <input &&
+		git diff-tree -C --find-copies-harder -r branch N-copy-root-path >actual &&
+		compare_diff_raw expect actual
+	'
+done
 
 test_expect_success 'N: delete directory by copying' '
 	cat >expect <<-\EOF &&
@@ -1431,98 +1437,102 @@ test_expect_success 'N: reject foo/ syntax in ls argument' '
 	INPUT_END
 '
 
-test_expect_success 'N: copy to root by id and modify' '
-	echo "hello, world" >expect.foo &&
-	echo hello >expect.bar &&
-	git fast-import <<-SETUP_END &&
-	commit refs/heads/N7
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	hello, tree
-	COMMIT
+for root in '""' ''
+do
+	test_expect_success "N: copy to root ($root) by id and modify" '
+		echo "hello, world" >expect.foo &&
+		echo hello >expect.bar &&
+		git fast-import <<-SETUP_END &&
+		commit refs/heads/N7
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		hello, tree
+		COMMIT
 
-	deleteall
-	M 644 inline foo/bar
-	data <<EOF
-	hello
-	EOF
-	SETUP_END
+		deleteall
+		M 644 inline foo/bar
+		data <<EOF
+		hello
+		EOF
+		SETUP_END
 
-	tree=$(git rev-parse --verify N7:) &&
-	git fast-import <<-INPUT_END &&
-	commit refs/heads/N8
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	copy to root by id and modify
-	COMMIT
+		tree=$(git rev-parse --verify N7:) &&
+		git fast-import <<-INPUT_END &&
+		commit refs/heads/N8
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		copy to root by id and modify
+		COMMIT
 
-	M 040000 $tree ""
-	M 644 inline foo/foo
-	data <<EOF
-	hello, world
-	EOF
-	INPUT_END
-	git show N8:foo/foo >actual.foo &&
-	git show N8:foo/bar >actual.bar &&
-	test_cmp expect.foo actual.foo &&
-	test_cmp expect.bar actual.bar
-'
+		M 040000 $tree $root
+		M 644 inline foo/foo
+		data <<EOF
+		hello, world
+		EOF
+		INPUT_END
+		git show N8:foo/foo >actual.foo &&
+		git show N8:foo/bar >actual.bar &&
+		test_cmp expect.foo actual.foo &&
+		test_cmp expect.bar actual.bar
+	'
 
-test_expect_success 'N: extract subtree' '
-	branch=$(git rev-parse --verify refs/heads/branch^{tree}) &&
-	cat >input <<-INPUT_END &&
-	commit refs/heads/N9
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	extract subtree branch:newdir
-	COMMIT
+	test_expect_success "N: extract subtree to the root ($root)" '
+		branch=$(git rev-parse --verify refs/heads/branch^{tree}) &&
+		cat >input <<-INPUT_END &&
+		commit refs/heads/N9
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		extract subtree branch:newdir
+		COMMIT
 
-	M 040000 $branch ""
-	C "newdir" ""
-	INPUT_END
-	git fast-import <input &&
-	git diff --exit-code branch:newdir N9
-'
+		M 040000 $branch $root
+		C "newdir" $root
+		INPUT_END
+		git fast-import <input &&
+		git diff --exit-code branch:newdir N9
+	'
 
-test_expect_success 'N: modify subtree, extract it, and modify again' '
-	echo hello >expect.baz &&
-	echo hello, world >expect.qux &&
-	git fast-import <<-SETUP_END &&
-	commit refs/heads/N10
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	hello, tree
-	COMMIT
+	test_expect_success "N: modify subtree, extract it to the root ($root), and modify again" '
+		echo hello >expect.baz &&
+		echo hello, world >expect.qux &&
+		git fast-import <<-SETUP_END &&
+		commit refs/heads/N10
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		hello, tree
+		COMMIT
 
-	deleteall
-	M 644 inline foo/bar/baz
-	data <<EOF
-	hello
-	EOF
-	SETUP_END
+		deleteall
+		M 644 inline foo/bar/baz
+		data <<EOF
+		hello
+		EOF
+		SETUP_END
 
-	tree=$(git rev-parse --verify N10:) &&
-	git fast-import <<-INPUT_END &&
-	commit refs/heads/N11
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	copy to root by id and modify
-	COMMIT
+		tree=$(git rev-parse --verify N10:) &&
+		git fast-import <<-INPUT_END &&
+		commit refs/heads/N11
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		copy to root by id and modify
+		COMMIT
 
-	M 040000 $tree ""
-	M 100644 inline foo/bar/qux
-	data <<EOF
-	hello, world
-	EOF
-	R "foo" ""
-	C "bar/qux" "bar/quux"
-	INPUT_END
-	git show N11:bar/baz >actual.baz &&
-	git show N11:bar/qux >actual.qux &&
-	git show N11:bar/quux >actual.quux &&
-	test_cmp expect.baz actual.baz &&
-	test_cmp expect.qux actual.qux &&
-	test_cmp expect.qux actual.quux'
+		M 040000 $tree $root
+		M 100644 inline foo/bar/qux
+		data <<EOF
+		hello, world
+		EOF
+		R "foo" $root
+		C "bar/qux" "bar/quux"
+		INPUT_END
+		git show N11:bar/baz >actual.baz &&
+		git show N11:bar/qux >actual.qux &&
+		git show N11:bar/quux >actual.quux &&
+		test_cmp expect.baz actual.baz &&
+		test_cmp expect.qux actual.qux &&
+		test_cmp expect.qux actual.quux
+	'
+done
 
 ###
 ### series O
@@ -3067,6 +3077,7 @@ test_expect_success 'S: ls with garbage after sha1 must fail' '
 # There are two sorts of ways a path can be parsed, depending on whether it is
 # the last field on the line. Additionally, ls without a <dataref> has a special
 # case. Test every occurrence of <path> in the grammar against every error case.
+# Paths for the root (empty strings) are tested elsewhere.
 #
 
 #
@@ -3321,16 +3332,19 @@ test_path_eol_quoted_fail 'ls (without dataref in commit)' 'ls ' path
 ###
 # Setup is carried over from series S.
 
-test_expect_success 'T: ls root tree' '
-	sed -e "s/Z\$//" >expect <<-EOF &&
-	040000 tree $(git rev-parse S^{tree})	Z
-	EOF
-	sha1=$(git rev-parse --verify S) &&
-	git fast-import --import-marks=marks <<-EOF >actual &&
-	ls $sha1 ""
-	EOF
-	test_cmp expect actual
-'
+for root in '""' ''
+do
+	test_expect_success "T: ls root ($root) tree" '
+		sed -e "s/Z\$//" >expect <<-EOF &&
+		040000 tree $(git rev-parse S^{tree})	Z
+		EOF
+		sha1=$(git rev-parse --verify S) &&
+		git fast-import --import-marks=marks <<-EOF >actual &&
+		ls $sha1 $root
+		EOF
+		test_cmp expect actual
+	'
+done
 
 test_expect_success 'T: delete branch' '
 	git branch to-delete &&
@@ -3432,30 +3446,33 @@ test_expect_success 'U: validate directory delete result' '
 	compare_diff_raw expect actual
 '
 
-test_expect_success 'U: filedelete root succeeds' '
-	cat >input <<-INPUT_END &&
-	commit refs/heads/U
-	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
-	data <<COMMIT
-	must succeed
-	COMMIT
-	from refs/heads/U^0
-	D ""
+for root in '""' ''
+do
+	test_expect_success "U: filedelete root ($root) succeeds" '
+		cat >input <<-INPUT_END &&
+		commit refs/heads/U-delete-root
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		must succeed
+		COMMIT
+		from refs/heads/U^0
+		D $root
 
-	INPUT_END
+		INPUT_END
 
-	git fast-import <input
-'
+		git fast-import <input
+	'
 
-test_expect_success 'U: validate root delete result' '
-	cat >expect <<-EOF &&
-	:100644 000000 $f7id $ZERO_OID D	hello.c
-	EOF
+	test_expect_success "U: validate root ($root) delete result" '
+		cat >expect <<-EOF &&
+		:100644 000000 $f7id $ZERO_OID D	hello.c
+		EOF
 
-	git diff-tree -M -r U^1 U >actual &&
+		git diff-tree -M -r U U-delete-root >actual &&
 
-	compare_diff_raw expect actual
-'
+		compare_diff_raw expect actual
+	'
+done
 
 ###
 ### series V (checkpoint)
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v5 4/8] fast-import: remove dead strbuf
  2024-04-14  1:11       ` [PATCH v5 0/8] fast-import: tighten parsing of paths Thalia Archibald
                           ` (2 preceding siblings ...)
  2024-04-14  1:11         ` [PATCH v5 3/8] fast-import: allow unquoted empty path for root Thalia Archibald
@ 2024-04-14  1:11         ` Thalia Archibald
  2024-04-14  1:12         ` [PATCH v5 5/8] fast-import: improve documentation for path quoting Thalia Archibald
                           ` (4 subsequent siblings)
  8 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-14  1:11 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Chris Torek, Elijah Newren,
	Thalia Archibald

The strbuf in `note_change_n` is to copy the remainder of `p` before
potentially invalidating it when reading the next line. However, `p` is
not used after that point. It has been unused since the function was
created in a8dd2e7d2b (fast-import: Add support for importing commit
notes, 2009-10-09) and looks to be a fossil from adapting
`file_change_m`. Remove it.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index c8a1e3ef3d..832d0055f9 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2448,7 +2448,6 @@ static void file_change_cr(const char *p, struct branch *b, int rename)
 
 static void note_change_n(const char *p, struct branch *b, unsigned char *old_fanout)
 {
-	static struct strbuf uq = STRBUF_INIT;
 	struct object_entry *oe;
 	struct branch *s;
 	struct object_id oid, commit_oid;
@@ -2513,10 +2512,6 @@ static void note_change_n(const char *p, struct branch *b, unsigned char *old_fa
 		die("Invalid ref name or SHA1 expression: %s", p);
 
 	if (inline_data) {
-		if (p != uq.buf) {
-			strbuf_addstr(&uq, p);
-			p = uq.buf;
-		}
 		read_next_command();
 		parse_and_store_blob(&last_blob, &oid, 0);
 	} else if (oe) {
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v5 5/8] fast-import: improve documentation for path quoting
  2024-04-14  1:11       ` [PATCH v5 0/8] fast-import: tighten parsing of paths Thalia Archibald
                           ` (3 preceding siblings ...)
  2024-04-14  1:11         ` [PATCH v5 4/8] fast-import: remove dead strbuf Thalia Archibald
@ 2024-04-14  1:12         ` Thalia Archibald
  2024-04-14  1:12         ` [PATCH v5 6/8] fast-import: document C-style escapes for paths Thalia Archibald
                           ` (3 subsequent siblings)
  8 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-14  1:12 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Chris Torek, Elijah Newren,
	Thalia Archibald

It describes what characters cannot be in an unquoted path, but not
their semantics. Reframe it as a definition of unquoted paths. From the
perspective of the parser, whether it starts with `"` is what defines
whether it will parse it as quoted or unquoted.

The restrictions on characters in unquoted paths (with starting-", LF,
and spaces) are explained in the quoted paragraph. Move it to the
unquoted paragraph and reword.

The restriction that the source paths of filecopy and filerename cannot
contain SP is only stated in their respective sections. Restate it in
the <path> section.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 Documentation/git-fast-import.txt | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index b2607366b9..1882758b8a 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -630,18 +630,24 @@ in octal.  Git only supports the following modes:
 In both formats `<path>` is the complete path of the file to be added
 (if not already existing) or modified (if already existing).
 
-A `<path>` string must use UNIX-style directory separators (forward
-slash `/`), may contain any byte other than `LF`, and must not
-start with double quote (`"`).
+A `<path>` can be written as unquoted bytes or a C-style quoted string.
 
-A path can use C-style string quoting; this is accepted in all cases
-and mandatory if the filename starts with double quote or contains
-`LF`. In C-style quoting, the complete name should be surrounded with
-double quotes, and any `LF`, backslash, or double quote characters
-must be escaped by preceding them with a backslash (e.g.,
-`"path/with\n, \\ and \" in it"`).
+When a `<path>` does not start with a double quote (`"`), it is an
+unquoted string and is parsed as literal bytes without any escape
+sequences. However, if the filename contains `LF` or starts with double
+quote, it cannot be represented as an unquoted string and must be
+quoted. Additionally, the source `<path>` in `filecopy` or `filerename`
+must be quoted if it contains SP.
 
-The value of `<path>` must be in canonical form. That is it must not:
+When a `<path>` starts with a double quote (`"`), it is a C-style quoted
+string, where the complete filename is enclosed in a pair of double
+quotes and escape sequences are used. Certain characters must be escaped
+by preceding them with a backslash: `LF` is written as `\n`, backslash
+as `\\`, and double quote as `\"`. All filenames can be represented as
+quoted strings.
+
+A `<path>` must use UNIX-style directory separators (forward slash `/`)
+and its value must be in canonical form. That is it must not:
 
 * contain an empty directory component (e.g. `foo//bar` is invalid),
 * end with a directory separator (e.g. `foo/` is invalid),
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v5 6/8] fast-import: document C-style escapes for paths
  2024-04-14  1:11       ` [PATCH v5 0/8] fast-import: tighten parsing of paths Thalia Archibald
                           ` (4 preceding siblings ...)
  2024-04-14  1:12         ` [PATCH v5 5/8] fast-import: improve documentation for path quoting Thalia Archibald
@ 2024-04-14  1:12         ` Thalia Archibald
  2024-04-14  1:12         ` [PATCH v5 7/8] fast-import: forbid escaped NUL in paths Thalia Archibald
                           ` (2 subsequent siblings)
  8 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-14  1:12 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Chris Torek, Elijah Newren,
	Thalia Archibald

Simply saying “C-style” string quoting is imprecise, as only a subset of
C escapes are supported. Document the exact escapes.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 Documentation/git-fast-import.txt |  6 +++++-
 t/t9300-fast-import.sh            | 10 ++++++----
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index 1882758b8a..c6082c3b49 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -643,7 +643,11 @@ When a `<path>` starts with a double quote (`"`), it is a C-style quoted
 string, where the complete filename is enclosed in a pair of double
 quotes and escape sequences are used. Certain characters must be escaped
 by preceding them with a backslash: `LF` is written as `\n`, backslash
-as `\\`, and double quote as `\"`. All filenames can be represented as
+as `\\`, and double quote as `\"`. Some characters may optionally be
+written with escape sequences: `\a` for bell, `\b` for backspace, `\f`
+for form feed, `\n` for line feed, `\r` for carriage return, `\t` for
+horizontal tab, and `\v` for vertical tab. Any byte can be written with
+3-digit octal codes (e.g., `\033`). All filenames can be represented as
 quoted strings.
 
 A `<path>` must use UNIX-style directory separators (forward slash `/`)
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 13f98e6688..5cde8f8d01 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -3189,8 +3189,9 @@ test_path_eol_success () {
 	'
 }
 
-test_path_eol_success 'quoted spaces'   '" hello world.c "' ' hello world.c '
-test_path_eol_success 'unquoted spaces' ' hello world.c '   ' hello world.c '
+test_path_eol_success 'quoted spaces'   '" hello world.c "'  ' hello world.c '
+test_path_eol_success 'unquoted spaces' ' hello world.c '    ' hello world.c '
+test_path_eol_success 'octal escapes'   '"\150\151\056\143"' 'hi.c'
 
 #
 # Valid paths before a space: filecopy (source) and filerename (source).
@@ -3256,8 +3257,9 @@ test_path_space_success () {
 	'
 }
 
-test_path_space_success 'quoted spaces'      '" hello world.c "' ' hello world.c '
-test_path_space_success 'no unquoted spaces' 'hello_world.c'     'hello_world.c'
+test_path_space_success 'quoted spaces'      '" hello world.c "'  ' hello world.c '
+test_path_space_success 'no unquoted spaces' 'hello_world.c'      'hello_world.c'
+test_path_space_success 'octal escapes'      '"\150\151\056\143"' 'hi.c'
 
 #
 # Test a single commit change with an invalid path. Run it with all occurrences
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v5 7/8] fast-import: forbid escaped NUL in paths
  2024-04-14  1:11       ` [PATCH v5 0/8] fast-import: tighten parsing of paths Thalia Archibald
                           ` (5 preceding siblings ...)
  2024-04-14  1:12         ` [PATCH v5 6/8] fast-import: document C-style escapes for paths Thalia Archibald
@ 2024-04-14  1:12         ` Thalia Archibald
  2024-04-14  1:12         ` [PATCH v5 8/8] fast-import: make comments more precise Thalia Archibald
  2024-04-15  7:06         ` [PATCH v5 0/8] fast-import: tighten parsing of paths Patrick Steinhardt
  8 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-14  1:12 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Chris Torek, Elijah Newren,
	Thalia Archibald

NUL cannot appear in paths. Even disregarding filesystem path
limitations, the tree object format delimits with NUL, so such a path
cannot be encoded by Git.

When a quoted path is unquoted, it could possibly contain NUL from
"\000". Forbid it so it isn't truncated.

fast-import still has other issues with NUL, but those will be addressed
later.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 Documentation/git-fast-import.txt | 1 +
 builtin/fast-import.c             | 2 ++
 t/t9300-fast-import.sh            | 1 +
 3 files changed, 4 insertions(+)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index c6082c3b49..8b6dde45f1 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -661,6 +661,7 @@ and its value must be in canonical form. That is it must not:
 
 The root of the tree can be represented by an empty string as `<path>`.
 
+`<path>` cannot contain NUL, either literally or escaped as `\000`.
 It is recommended that `<path>` always be encoded using UTF-8.
 
 `filedelete`
diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 832d0055f9..419ffdcdb5 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2270,6 +2270,8 @@ static void parse_path(struct strbuf *sb, const char *p, const char **endp,
 	if (*p == '"') {
 		if (unquote_c_style(sb, p, endp))
 			die("Invalid %s: %s", field, command_buf.buf);
+		if (strlen(sb->buf) != sb->len)
+			die("NUL in %s: %s", field, command_buf.buf);
 	} else {
 		/*
 		 * Unless we are parsing the last field of a line,
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 5cde8f8d01..1e68426852 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -3300,6 +3300,7 @@ test_path_base_fail () {
 	local change="$1" prefix="$2" field="$3" suffix="$4"
 	test_path_fail "$change" 'unclosed " in '"$field"          "$prefix" '"hello.c'    "$suffix" "Invalid $field"
 	test_path_fail "$change" "invalid escape in quoted $field" "$prefix" '"hello\xff"' "$suffix" "Invalid $field"
+	test_path_fail "$change" "escaped NUL in quoted $field"    "$prefix" '"hello\000"' "$suffix" "NUL in $field"
 }
 test_path_eol_quoted_fail () {
 	local change="$1" prefix="$2" field="$3"
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v5 8/8] fast-import: make comments more precise
  2024-04-14  1:11       ` [PATCH v5 0/8] fast-import: tighten parsing of paths Thalia Archibald
                           ` (6 preceding siblings ...)
  2024-04-14  1:12         ` [PATCH v5 7/8] fast-import: forbid escaped NUL in paths Thalia Archibald
@ 2024-04-14  1:12         ` Thalia Archibald
  2024-04-15  7:06         ` [PATCH v5 0/8] fast-import: tighten parsing of paths Patrick Steinhardt
  8 siblings, 0 replies; 84+ messages in thread
From: Thalia Archibald @ 2024-04-14  1:12 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Patrick Steinhardt, Chris Torek, Elijah Newren,
	Thalia Archibald

The former is somewhat imprecise. The latter became out of sync with the
behavior in e814c39c2f (fast-import: refactor parsing of spaces,
2014-06-18).

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 builtin/fast-import.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 419ffdcdb5..dc5a9d32dd 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2210,7 +2210,7 @@ static int parse_mapped_oid_hex(const char *hex, struct object_id *oid, const ch
  *
  *   idnum ::= ':' bigint;
  *
- * Return the first character after the value in *endptr.
+ * Update *endptr to point to the first character after the value.
  *
  * Complain if the following character is not what is expected,
  * either a space or end of the string.
@@ -2243,8 +2243,8 @@ static uintmax_t parse_mark_ref_eol(const char *p)
 }
 
 /*
- * Parse the mark reference, demanding a trailing space.  Return a
- * pointer to the space.
+ * Parse the mark reference, demanding a trailing space. Update *p to
+ * point to the first character after the space.
  */
 static uintmax_t parse_mark_ref_space(const char **p)
 {
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 0/8] fast-import: tighten parsing of paths
  2024-04-14  1:11       ` [PATCH v5 0/8] fast-import: tighten parsing of paths Thalia Archibald
                           ` (7 preceding siblings ...)
  2024-04-14  1:12         ` [PATCH v5 8/8] fast-import: make comments more precise Thalia Archibald
@ 2024-04-15  7:06         ` Patrick Steinhardt
  2024-04-15 17:07           ` Junio C Hamano
  8 siblings, 1 reply; 84+ messages in thread
From: Patrick Steinhardt @ 2024-04-15  7:06 UTC (permalink / raw)
  To: Thalia Archibald; +Cc: git, Junio C Hamano, Chris Torek, Elijah Newren

[-- Attachment #1: Type: text/plain, Size: 857 bytes --]

On Sun, Apr 14, 2024 at 01:11:32AM +0000, Thalia Archibald wrote:
> > fast-import has subtle differences in how it parses file paths between each
> > occurrence of <path> in the grammar. Many errors are suppressed or not checked,
> > which could lead to silent data corruption. A particularly bad case is when a
> > front-end sent escapes that Git doesn't recognize (e.g., hex escapes are not
> > supported), it would be treated as literal bytes instead of a quoted string.
> >
> > Bring path parsing into line with the documented behavior and improve
> > documentation to fill in missing details.
> 
> Changes since v4:
> * Refine C comments and parameter name.
> 
> Thalia

I had another cursory read of this patch series that relied on the range
diffs for most of the part. In any way, this version looks good to me.
Thanks!

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 0/8] fast-import: tighten parsing of paths
  2024-04-15  7:06         ` [PATCH v5 0/8] fast-import: tighten parsing of paths Patrick Steinhardt
@ 2024-04-15 17:07           ` Junio C Hamano
  0 siblings, 0 replies; 84+ messages in thread
From: Junio C Hamano @ 2024-04-15 17:07 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Thalia Archibald, git, Chris Torek, Elijah Newren

Patrick Steinhardt <ps@pks.im> writes:

> I had another cursory read of this patch series that relied on the range
> diffs for most of the part. In any way, this version looks good to me.
> Thanks!

Likewise.  Let's mark it for 'next'.

^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2024-04-15 17:10 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-22  0:03 [PATCH 0/6] fast-import: tighten parsing of paths Thalia Archibald
2024-03-22  0:03 ` [PATCH 1/6] " Thalia Archibald
2024-03-22  0:11   ` Thalia Archibald
2024-03-28  8:21   ` Patrick Steinhardt
     [not found]     ` <E01C617F-3720-42C0-83EE-04BB01643C86@archibald.dev>
2024-04-01  9:05       ` Thalia Archibald
2024-03-22  0:03 ` [PATCH 2/6] fast-import: directly use strbufs for paths Thalia Archibald
2024-03-28  8:21   ` Patrick Steinhardt
2024-03-22  0:03 ` [PATCH 3/6] fast-import: release unfreed strbufs Thalia Archibald
2024-03-28  8:21   ` Patrick Steinhardt
2024-04-01  9:06     ` Thalia Archibald
2024-03-22  0:03 ` [PATCH 4/6] fast-import: remove dead strbuf Thalia Archibald
2024-03-28  8:21   ` Patrick Steinhardt
2024-03-22  0:03 ` [PATCH 5/6] fast-import: document C-style escapes for paths Thalia Archibald
2024-03-28  8:21   ` Patrick Steinhardt
2024-04-01  9:06     ` Thalia Archibald
2024-03-22  0:03 ` [PATCH 6/6] fast-import: forbid escaped NUL in paths Thalia Archibald
2024-04-01  9:02 ` [PATCH v2 0/8] fast-import: tighten parsing of paths Thalia Archibald
2024-04-01  9:02   ` [PATCH v2 1/8] fast-import: tighten path unquoting Thalia Archibald
2024-04-10  6:27     ` Patrick Steinhardt
2024-04-10  8:18       ` Chris Torek
2024-04-10  8:44         ` Thalia Archibald
2024-04-10  8:51           ` Chris Torek
2024-04-10  9:14             ` Thalia Archibald
2024-04-10  9:42               ` Patrick Steinhardt
2024-04-10  9:16             ` Thalia Archibald
2024-04-10  9:12       ` Thalia Archibald
2024-04-01  9:03   ` [PATCH v2 2/8] fast-import: directly use strbufs for paths Thalia Archibald
2024-04-10  6:27     ` Patrick Steinhardt
2024-04-10 10:07       ` Thalia Archibald
2024-04-10 10:18         ` Patrick Steinhardt
2024-04-01  9:03   ` [PATCH v2 3/8] fast-import: allow unquoted empty path for root Thalia Archibald
2024-04-10  6:27     ` Patrick Steinhardt
2024-04-01  9:03   ` [PATCH v2 4/8] fast-import: remove dead strbuf Thalia Archibald
2024-04-01  9:03   ` [PATCH v2 5/8] fast-import: improve documentation for unquoted paths Thalia Archibald
2024-04-01  9:03   ` [PATCH v2 6/8] fast-import: document C-style escapes for paths Thalia Archibald
2024-04-01  9:03   ` [PATCH v2 7/8] fast-import: forbid escaped NUL in paths Thalia Archibald
2024-04-01  9:03   ` [PATCH v2 8/8] fast-import: make comments more precise Thalia Archibald
2024-04-07 21:19   ` [PATCH v2 0/8] fast-import: tighten parsing of paths Thalia Archibald
2024-04-07 23:46     ` Eric Sunshine
2024-04-08  6:25       ` Patrick Steinhardt
2024-04-08  7:15         ` Thalia Archibald
2024-04-08  9:07           ` Patrick Steinhardt
2024-04-08 14:52         ` Junio C Hamano
2024-04-10  9:54   ` [PATCH v3 " Thalia Archibald
2024-04-10  9:55     ` [PATCH v3 1/8] fast-import: tighten path unquoting Thalia Archibald
2024-04-10  9:55     ` [PATCH v3 2/8] fast-import: directly use strbufs for paths Thalia Archibald
2024-04-10  9:55     ` [PATCH v3 3/8] fast-import: allow unquoted empty path for root Thalia Archibald
2024-04-11 19:59       ` Junio C Hamano
2024-04-10  9:55     ` [PATCH v3 4/8] fast-import: remove dead strbuf Thalia Archibald
2024-04-11 19:53       ` Junio C Hamano
2024-04-10  9:55     ` [PATCH v3 5/8] fast-import: improve documentation for unquoted paths Thalia Archibald
2024-04-11 19:51       ` Junio C Hamano
2024-04-10  9:56     ` [PATCH v3 6/8] fast-import: document C-style escapes for paths Thalia Archibald
2024-04-10 18:28       ` Junio C Hamano
2024-04-10 22:50         ` Thalia Archibald
2024-04-11  5:32           ` Junio C Hamano
2024-04-11  9:14             ` Patrick Steinhardt
2024-04-10  9:56     ` [PATCH v3 7/8] fast-import: forbid escaped NUL in paths Thalia Archibald
2024-04-10 18:51       ` Junio C Hamano
2024-04-10  9:56     ` [PATCH v3 8/8] fast-import: make comments more precise Thalia Archibald
2024-04-10 19:21       ` Junio C Hamano
2024-04-12  8:01     ` [PATCH v4 0/8] fast-import: tighten parsing of paths Thalia Archibald
2024-04-12  8:02       ` [PATCH v4 1/8] fast-import: tighten path unquoting Thalia Archibald
2024-04-12 16:34         ` Junio C Hamano
2024-04-13  0:07           ` Thalia Archibald
2024-04-13 18:33             ` Junio C Hamano
2024-04-12  8:03       ` [PATCH v4 2/8] fast-import: directly use strbufs for paths Thalia Archibald
2024-04-12  8:03       ` [PATCH v4 3/8] fast-import: allow unquoted empty path for root Thalia Archibald
2024-04-12  8:03       ` [PATCH v4 4/8] fast-import: remove dead strbuf Thalia Archibald
2024-04-12  8:03       ` [PATCH v4 5/8] fast-import: improve documentation for path quoting Thalia Archibald
2024-04-12  8:03       ` [PATCH v4 6/8] fast-import: document C-style escapes for paths Thalia Archibald
2024-04-12  8:03       ` [PATCH v4 7/8] fast-import: forbid escaped NUL in paths Thalia Archibald
2024-04-12  8:03       ` [PATCH v4 8/8] fast-import: make comments more precise Thalia Archibald
2024-04-14  1:11       ` [PATCH v5 0/8] fast-import: tighten parsing of paths Thalia Archibald
2024-04-14  1:11         ` [PATCH v5 1/8] fast-import: tighten path unquoting Thalia Archibald
2024-04-14  1:11         ` [PATCH v5 2/8] fast-import: directly use strbufs for paths Thalia Archibald
2024-04-14  1:11         ` [PATCH v5 3/8] fast-import: allow unquoted empty path for root Thalia Archibald
2024-04-14  1:11         ` [PATCH v5 4/8] fast-import: remove dead strbuf Thalia Archibald
2024-04-14  1:12         ` [PATCH v5 5/8] fast-import: improve documentation for path quoting Thalia Archibald
2024-04-14  1:12         ` [PATCH v5 6/8] fast-import: document C-style escapes for paths Thalia Archibald
2024-04-14  1:12         ` [PATCH v5 7/8] fast-import: forbid escaped NUL in paths Thalia Archibald
2024-04-14  1:12         ` [PATCH v5 8/8] fast-import: make comments more precise Thalia Archibald
2024-04-15  7:06         ` [PATCH v5 0/8] fast-import: tighten parsing of paths Patrick Steinhardt
2024-04-15 17:07           ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).