All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] Parallel submodule status
@ 2024-03-05  1:21 Atneya Nair
  2024-03-05  1:21 ` [RFC PATCH 1/3] Make read_gitfile and resolve_gitfile thread safe Atneya Nair
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Atneya Nair @ 2024-03-05  1:21 UTC (permalink / raw)
  To: git; +Cc: gitster, jeffhost, me, nasamuffin, Atneya Nair

This set tries to speed up git status (and other commands which examine
the tree state) for repositories with many submodules.

Currently, submodules, unlike regular files, are examined serially
despite the expense of this operation. Fsmonitor also doesn't support
submodules, so there isn't a great alternative here.

Atneya Nair (3):
  Make read_gitfile and resolve_gitfile thread safe
  Make ce_compare_gitlink thread-safe
  Preload submodule state in refresh_index

 builtin/init-db.c   |  7 ++++---
 builtin/rev-parse.c |  4 +++-
 config.c            |  3 ++-
 config.h            |  2 +-
 preload-index.c     | 25 ++++++++++++++++++++++---
 read-cache-ll.h     |  1 +
 read-cache.c        |  3 +++
 refs.c              |  9 +++++++++
 repository.c        |  9 +++++----
 setup.c             | 36 +++++++++++++++++++++++++-----------
 setup.h             |  7 +++----
 submodule.c         | 32 +++++++++++++++++++++++---------
 worktree.c          | 27 +++++++++++++--------------
 13 files changed, 114 insertions(+), 51 deletions(-)


base-commit: f41f85c9ec8d4d46de0fd5fded88db94d3ec8c11
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 1/3] Make read_gitfile and resolve_gitfile thread safe
  2024-03-05  1:21 [RFC PATCH 0/3] Parallel submodule status Atneya Nair
@ 2024-03-05  1:21 ` Atneya Nair
  2024-03-05  2:22   ` Junio C Hamano
                     ` (2 more replies)
  2024-03-05  1:21 ` [RFC PATCH 2/3] Make ce_compare_gitlink thread-safe Atneya Nair
  2024-03-05  1:21 ` [RFC PATCH 3/3] Preload submodule state in refresh_index Atneya Nair
  2 siblings, 3 replies; 17+ messages in thread
From: Atneya Nair @ 2024-03-05  1:21 UTC (permalink / raw)
  To: git; +Cc: gitster, jeffhost, me, nasamuffin, Atneya Nair

Continue the work in commit: 3d7747e318532a36a263c61cdf92f2decb6424ff
to remove unsafe shared buffer usage in read_gitfile_gently.

Migrate read_gitfile_gently and resolve_gitfile_gently to take strbuf
out params to allocate their return values into, rather than returning
a view into a shared buffer.

Leave the shared buffer in case a caller passes null for this param (for
cases we haven't cleaned up yet).

Migrate callers of resolve_gitfile to resolve_gitfile_gently.

Signed-off-by: Atneya Nair <atneya@google.com>
---

Notes:
    Open questions:
     - Is checking the return value of read_gitfile necessary if on error,
       we are supposed to die, or set the error field to a non-zero value?
     - Should we clean up the other call-sites of read_gitfile?

 builtin/init-db.c   |  7 ++++---
 builtin/rev-parse.c |  4 +++-
 repository.c        |  9 +++++----
 setup.c             | 36 +++++++++++++++++++++++++-----------
 setup.h             |  7 +++----
 submodule.c         | 32 +++++++++++++++++++++++---------
 worktree.c          | 27 +++++++++++++--------------
 7 files changed, 76 insertions(+), 46 deletions(-)

diff --git a/builtin/init-db.c b/builtin/init-db.c
index 0170469b84..9135d07a0d 100644
--- a/builtin/init-db.c
+++ b/builtin/init-db.c
@@ -198,11 +198,11 @@ int cmd_init_db(int argc, const char **argv, const char *prefix)
 	 */
 	if (real_git_dir) {
 		int err;
-		const char *p;
 		struct strbuf sb = STRBUF_INIT;
+		struct strbuf gitfile = STRBUF_INIT;
 
-		p = read_gitfile_gently(git_dir, &err);
-		if (p && get_common_dir(&sb, p)) {
+		read_gitfile_gently(git_dir, &err, &gitfile);
+		if (!err && get_common_dir(&sb, gitfile.buf)) {
 			struct strbuf mainwt = STRBUF_INIT;
 
 			strbuf_addbuf(&mainwt, &sb);
@@ -213,6 +213,7 @@ int cmd_init_db(int argc, const char **argv, const char *prefix)
 			git_dir = strbuf_detach(&sb, NULL);
 		}
 		strbuf_release(&sb);
+		strbuf_release(&gitfile);
 	}
 
 	if (is_bare_repository_cfg < 0)
diff --git a/builtin/rev-parse.c b/builtin/rev-parse.c
index d08987646a..e1db6b3231 100644
--- a/builtin/rev-parse.c
+++ b/builtin/rev-parse.c
@@ -728,12 +728,14 @@ int cmd_rev_parse(int argc, const char **argv, const char *prefix)
 			}
 			if (!strcmp(arg, "--resolve-git-dir")) {
 				const char *gitdir = argv[++i];
+				struct strbuf resolved_gitdir = STRBUF_INIT;
 				if (!gitdir)
 					die(_("--resolve-git-dir requires an argument"));
-				gitdir = resolve_gitdir(gitdir);
+				gitdir = resolve_gitdir_gently(gitdir, NULL, &resolved_gitdir);
 				if (!gitdir)
 					die(_("not a gitdir '%s'"), argv[i]);
 				puts(gitdir);
+				strbuf_release(&resolved_gitdir);
 				continue;
 			}
 		}
diff --git a/repository.c b/repository.c
index 7aacb51b65..3ca6dbcf16 100644
--- a/repository.c
+++ b/repository.c
@@ -118,7 +118,7 @@ static int repo_init_gitdir(struct repository *repo, const char *gitdir)
 	int ret = 0;
 	int error = 0;
 	char *abspath = NULL;
-	const char *resolved_gitdir;
+	struct strbuf resolved_gitdir = STRBUF_INIT;
 	struct set_gitdir_args args = { NULL };
 
 	abspath = real_pathdup(gitdir, 0);
@@ -128,15 +128,16 @@ static int repo_init_gitdir(struct repository *repo, const char *gitdir)
 	}
 
 	/* 'gitdir' must reference the gitdir directly */
-	resolved_gitdir = resolve_gitdir_gently(abspath, &error);
-	if (!resolved_gitdir) {
+	resolve_gitdir_gently(abspath, &error, &resolved_gitdir);
+	if (error) {
 		ret = -1;
 		goto out;
 	}
 
-	repo_set_gitdir(repo, resolved_gitdir, &args);
+	repo_set_gitdir(repo, resolved_gitdir.buf, &args);
 
 out:
+	strbuf_release(&resolved_gitdir);
 	free(abspath);
 	return ret;
 }
diff --git a/setup.c b/setup.c
index b69b1cbc2a..2e118cf216 100644
--- a/setup.c
+++ b/setup.c
@@ -397,14 +397,17 @@ int is_nonbare_repository_dir(struct strbuf *path)
 	int ret = 0;
 	int gitfile_error;
 	size_t orig_path_len = path->len;
+	struct strbuf gitfile = STRBUF_INIT;
+
 	assert(orig_path_len != 0);
 	strbuf_complete(path, '/');
 	strbuf_addstr(path, ".git");
-	if (read_gitfile_gently(path->buf, &gitfile_error) || is_git_directory(path->buf))
+	if (read_gitfile_gently(path->buf, &gitfile_error, &gitfile) || is_git_directory(path->buf))
 		ret = 1;
 	if (gitfile_error == READ_GITFILE_ERR_OPEN_FAILED ||
 	    gitfile_error == READ_GITFILE_ERR_READ_FAILED)
 		ret = 1;
+	strbuf_release(&gitfile);
 	strbuf_setlen(path, orig_path_len);
 	return ret;
 }
@@ -830,7 +833,8 @@ void read_gitfile_error_die(int error_code, const char *path, const char *dir)
 
 /*
  * Try to read the location of the git directory from the .git file,
- * return path to git directory if found. The return value comes from
+ * return path to git directory if found. If passed a valid strbuf, the return
+ * value is is a ptr to within the buffer. If strbuf is null, the return value comes from
  * a shared buffer.
  *
  * On failure, if return_error_code is not NULL, return_error_code
@@ -838,7 +842,7 @@ void read_gitfile_error_die(int error_code, const char *path, const char *dir)
  * return_error_code is NULL the function will die instead (for most
  * cases).
  */
-const char *read_gitfile_gently(const char *path, int *return_error_code)
+const char *read_gitfile_gently(const char *path, int *return_error_code, struct strbuf* result_buf)
 {
 	const int max_file_size = 1 << 20;  /* 1MB */
 	int error_code = 0;
@@ -848,7 +852,10 @@ const char *read_gitfile_gently(const char *path, int *return_error_code)
 	struct stat st;
 	int fd;
 	ssize_t len;
-	static struct strbuf realpath = STRBUF_INIT;
+	static struct strbuf shared = STRBUF_INIT;
+	if (!result_buf) {
+		result_buf = &shared;
+	}
 
 	if (stat(path, &st)) {
 		/* NEEDSWORK: discern between ENOENT vs other errors */
@@ -900,8 +907,10 @@ const char *read_gitfile_gently(const char *path, int *return_error_code)
 		goto cleanup_return;
 	}
 
-	strbuf_realpath(&realpath, dir, 1);
-	path = realpath.buf;
+	strbuf_realpath(result_buf, dir, 1);
+	path = result_buf->buf;
+	// TODO is this valid?
+	if (!path) die(_("Unexpected null from realpath '%s'"), dir);
 
 cleanup_return:
 	if (return_error_code)
@@ -1316,12 +1325,13 @@ static enum discovery_result setup_git_directory_gently_1(struct strbuf *dir,
 		int offset = dir->len, error_code = 0;
 		char *gitdir_path = NULL;
 		char *gitfile = NULL;
+		struct strbuf gitdirenvbuf = STRBUF_INIT;
 
 		if (offset > min_offset)
 			strbuf_addch(dir, '/');
 		strbuf_addstr(dir, DEFAULT_GIT_DIR_ENVIRONMENT);
 		gitdirenv = read_gitfile_gently(dir->buf, die_on_error ?
-						NULL : &error_code);
+						NULL : &error_code, &gitdirenvbuf);
 		if (!gitdirenv) {
 			if (die_on_error ||
 			    error_code == READ_GITFILE_ERR_NOT_A_FILE) {
@@ -1330,8 +1340,10 @@ static enum discovery_result setup_git_directory_gently_1(struct strbuf *dir,
 					gitdirenv = DEFAULT_GIT_DIR_ENVIRONMENT;
 					gitdir_path = xstrdup(dir->buf);
 				}
-			} else if (error_code != READ_GITFILE_ERR_STAT_FAILED)
+			} else if (error_code != READ_GITFILE_ERR_STAT_FAILED) {
+				strbuf_release(&gitdirenvbuf);
 				return GIT_DIR_INVALID_GITFILE;
+			}
 		} else
 			gitfile = xstrdup(dir->buf);
 		/*
@@ -1365,9 +1377,10 @@ static enum discovery_result setup_git_directory_gently_1(struct strbuf *dir,
 			 */
 			free(gitdir_path);
 			free(gitfile);
-
+			strbuf_release(&gitdirenvbuf);
 			return ret;
 		}
+		strbuf_release(&gitdirenvbuf);
 
 		if (is_git_directory(dir->buf)) {
 			trace2_data_string("setup", NULL, "implicit-bare-repository", dir->buf);
@@ -1692,11 +1705,12 @@ const char *setup_git_directory(void)
 	return setup_git_directory_gently(NULL);
 }
 
-const char *resolve_gitdir_gently(const char *suspect, int *return_error_code)
+const char *resolve_gitdir_gently(const char *suspect, int *return_error_code,
+		struct strbuf* return_buf)
 {
 	if (is_git_directory(suspect))
 		return suspect;
-	return read_gitfile_gently(suspect, return_error_code);
+	return read_gitfile_gently(suspect, return_error_code, return_buf);
 }
 
 /* if any standard file descriptor is missing open it to /dev/null */
diff --git a/setup.h b/setup.h
index 3599aec93c..cf5a63762b 100644
--- a/setup.h
+++ b/setup.h
@@ -36,10 +36,9 @@ int is_nonbare_repository_dir(struct strbuf *path);
 #define READ_GITFILE_ERR_NOT_A_REPO 7
 #define READ_GITFILE_ERR_TOO_LARGE 8
 void read_gitfile_error_die(int error_code, const char *path, const char *dir);
-const char *read_gitfile_gently(const char *path, int *return_error_code);
-#define read_gitfile(path) read_gitfile_gently((path), NULL)
-const char *resolve_gitdir_gently(const char *suspect, int *return_error_code);
-#define resolve_gitdir(path) resolve_gitdir_gently((path), NULL)
+const char *read_gitfile_gently(const char *path, int *return_error_code, struct strbuf *buf);
+#define read_gitfile(path) read_gitfile_gently((path), NULL, NULL)
+const char *resolve_gitdir_gently(const char *suspect, int *return_error_code, struct strbuf *buf);
 
 void setup_work_tree(void);
 
diff --git a/submodule.c b/submodule.c
index 213da79f66..455444321b 100644
--- a/submodule.c
+++ b/submodule.c
@@ -316,9 +316,10 @@ int is_submodule_populated_gently(const char *path, int *return_error_code)
 	int ret = 0;
 	char *gitdir = xstrfmt("%s/.git", path);
 
-	if (resolve_gitdir_gently(gitdir, return_error_code))
+	struct strbuf resolved_gitdir_buf = STRBUF_INIT;
+	if (resolve_gitdir_gently(gitdir, return_error_code, &resolved_gitdir_buf))
 		ret = 1;
-
+	strbuf_release(&resolved_gitdir_buf);
 	free(gitdir);
 	return ret;
 }
@@ -1879,22 +1880,25 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 {
 	struct child_process cp = CHILD_PROCESS_INIT;
 	struct strbuf buf = STRBUF_INIT;
+	struct strbuf gitdirbuf = STRBUF_INIT;
 	FILE *fp;
 	unsigned dirty_submodule = 0;
 	const char *git_dir;
 	int ignore_cp_exit_code = 0;
 
 	strbuf_addf(&buf, "%s/.git", path);
-	git_dir = read_gitfile(buf.buf);
+	git_dir = read_gitfile_gently(buf.buf, NULL, &gitdirbuf);
 	if (!git_dir)
 		git_dir = buf.buf;
 	if (!is_git_directory(git_dir)) {
 		if (is_directory(git_dir))
 			die(_("'%s' not recognized as a git repository"), git_dir);
 		strbuf_release(&buf);
+		strbuf_release(&gitdirbuf);
 		/* The submodule is not checked out, so it is not modified */
 		return 0;
 	}
+		strbuf_release(&gitdirbuf);
 	strbuf_reset(&buf);
 
 	strvec_pushl(&cp.args, "status", "--porcelain=2", NULL);
@@ -1958,15 +1962,16 @@ int submodule_uses_gitfile(const char *path)
 {
 	struct child_process cp = CHILD_PROCESS_INIT;
 	struct strbuf buf = STRBUF_INIT;
-	const char *git_dir;
+	struct strbuf gitfilebuf = STRBUF_INIT;
 
 	strbuf_addf(&buf, "%s/.git", path);
-	git_dir = read_gitfile(buf.buf);
-	if (!git_dir) {
+	read_gitfile_gently(buf.buf, NULL, &gitfilebuf);
+	if (!gitfilebuf.buf) {
 		strbuf_release(&buf);
 		return 0;
 	}
 	strbuf_release(&buf);
+	strbuf_release(&gitfilebuf);
 
 	/* Now test that all nested submodules use a gitfile too */
 	strvec_pushl(&cp.args,
@@ -2276,6 +2281,7 @@ static void relocate_single_git_dir_into_superproject(const char *path,
 {
 	char *old_git_dir = NULL, *real_old_git_dir = NULL, *real_new_git_dir = NULL;
 	struct strbuf new_gitdir = STRBUF_INIT;
+	struct strbuf gitfilebuf = STRBUF_INIT;
 	const struct submodule *sub;
 
 	if (submodule_uses_worktrees(path))
@@ -2283,9 +2289,12 @@ static void relocate_single_git_dir_into_superproject(const char *path,
 		      "more than one worktree not supported"), path);
 
 	old_git_dir = xstrfmt("%s/.git", path);
-	if (read_gitfile(old_git_dir))
+	if (read_gitfile_gently(old_git_dir, NULL, &gitfilebuf)) {
 		/* If it is an actual gitfile, it doesn't need migration. */
+		strbuf_release(&gitfilebuf);
 		return;
+	}
+	strbuf_release(&gitfilebuf);
 
 	real_old_git_dir = real_pathdup(old_git_dir, 1);
 
@@ -2343,8 +2352,9 @@ void absorb_git_dir_into_superproject(const char *path,
 	int err_code;
 	const char *sub_git_dir;
 	struct strbuf gitdir = STRBUF_INIT;
+	struct strbuf resolved_gitdir_buf = STRBUF_INIT;
 	strbuf_addf(&gitdir, "%s/.git", path);
-	sub_git_dir = resolve_gitdir_gently(gitdir.buf, &err_code);
+	sub_git_dir = resolve_gitdir_gently(gitdir.buf, &err_code, &resolved_gitdir_buf);
 
 	/* Not populated? */
 	if (!sub_git_dir) {
@@ -2385,6 +2395,8 @@ void absorb_git_dir_into_superproject(const char *path,
 		free(real_sub_git_dir);
 		free(real_common_git_dir);
 	}
+
+	strbuf_release(&resolved_gitdir_buf);
 	strbuf_release(&gitdir);
 
 	absorb_git_dir_into_superproject_recurse(path, super_prefix);
@@ -2484,17 +2496,19 @@ int submodule_to_gitdir(struct strbuf *buf, const char *submodule)
 	const struct submodule *sub;
 	const char *git_dir;
 	int ret = 0;
+	struct strbuf gitfilebuf = STRBUF_INIT;
 
 	strbuf_reset(buf);
 	strbuf_addstr(buf, submodule);
 	strbuf_complete(buf, '/');
 	strbuf_addstr(buf, ".git");
 
-	git_dir = read_gitfile(buf->buf);
+	git_dir = read_gitfile_gently(buf->buf, NULL, &gitfilebuf);
 	if (git_dir) {
 		strbuf_reset(buf);
 		strbuf_addstr(buf, git_dir);
 	}
+	strbuf_release(&gitfilebuf);
 	if (!is_git_directory(buf->buf)) {
 		sub = submodule_from_path(the_repository, null_oid(),
 					  submodule);
diff --git a/worktree.c b/worktree.c
index b02a05a74a..a6f125c8da 100644
--- a/worktree.c
+++ b/worktree.c
@@ -309,7 +309,7 @@ int validate_worktree(const struct worktree *wt, struct strbuf *errmsg,
 {
 	struct strbuf wt_path = STRBUF_INIT;
 	struct strbuf realpath = STRBUF_INIT;
-	char *path = NULL;
+	struct strbuf gitfile = STRBUF_INIT;
 	int err, ret = -1;
 
 	strbuf_addf(&wt_path, "%s/.git", wt->path);
@@ -353,21 +353,20 @@ int validate_worktree(const struct worktree *wt, struct strbuf *errmsg,
 		goto done;
 	}
 
-	path = xstrdup_or_null(read_gitfile_gently(wt_path.buf, &err));
-	if (!path) {
+	if (!read_gitfile_gently(wt_path.buf, &err, &gitfile)) {
 		strbuf_addf_gently(errmsg, _("'%s' is not a .git file, error code %d"),
 				   wt_path.buf, err);
 		goto done;
 	}
 
 	strbuf_realpath(&realpath, git_common_path("worktrees/%s", wt->id), 1);
-	ret = fspathcmp(path, realpath.buf);
+	ret = fspathcmp(gitfile.buf, realpath.buf);
 
 	if (ret)
 		strbuf_addf_gently(errmsg, _("'%s' does not point back to '%s'"),
 				   wt->path, git_common_path("worktrees/%s", wt->id));
 done:
-	free(path);
+	strbuf_release(&gitfile);
 	strbuf_release(&wt_path);
 	strbuf_release(&realpath);
 	return ret;
@@ -567,7 +566,7 @@ static void repair_gitfile(struct worktree *wt,
 {
 	struct strbuf dotgit = STRBUF_INIT;
 	struct strbuf repo = STRBUF_INIT;
-	char *backlink;
+	struct strbuf backlink = STRBUF_INIT;
 	const char *repair = NULL;
 	int err;
 
@@ -582,13 +581,13 @@ static void repair_gitfile(struct worktree *wt,
 
 	strbuf_realpath(&repo, git_common_path("worktrees/%s", wt->id), 1);
 	strbuf_addf(&dotgit, "%s/.git", wt->path);
-	backlink = xstrdup_or_null(read_gitfile_gently(dotgit.buf, &err));
+	read_gitfile_gently(dotgit.buf, &err, &backlink);
 
 	if (err == READ_GITFILE_ERR_NOT_A_FILE)
 		fn(1, wt->path, _(".git is not a file"), cb_data);
 	else if (err)
 		repair = _(".git file broken");
-	else if (fspathcmp(backlink, repo.buf))
+	else if (fspathcmp(backlink.buf, repo.buf))
 		repair = _(".git file incorrect");
 
 	if (repair) {
@@ -596,7 +595,7 @@ static void repair_gitfile(struct worktree *wt,
 		write_file(dotgit.buf, "gitdir: %s", repo.buf);
 	}
 
-	free(backlink);
+	strbuf_release(&backlink);
 	strbuf_release(&repo);
 	strbuf_release(&dotgit);
 }
@@ -685,7 +684,7 @@ void repair_worktree_at_path(const char *path,
 	struct strbuf realdotgit = STRBUF_INIT;
 	struct strbuf gitdir = STRBUF_INIT;
 	struct strbuf olddotgit = STRBUF_INIT;
-	char *backlink = NULL;
+	struct strbuf backlink = STRBUF_INIT;
 	const char *repair = NULL;
 	int err;
 
@@ -701,12 +700,12 @@ void repair_worktree_at_path(const char *path,
 		goto done;
 	}
 
-	backlink = xstrdup_or_null(read_gitfile_gently(realdotgit.buf, &err));
+	read_gitfile_gently(realdotgit.buf, &err, &backlink);
 	if (err == READ_GITFILE_ERR_NOT_A_FILE) {
 		fn(1, realdotgit.buf, _("unable to locate repository; .git is not a file"), cb_data);
 		goto done;
 	} else if (err == READ_GITFILE_ERR_NOT_A_REPO) {
-		if (!(backlink = infer_backlink(realdotgit.buf))) {
+		if (!(backlink.buf = infer_backlink(realdotgit.buf))) {
 			fn(1, realdotgit.buf, _("unable to locate repository; .git file does not reference a repository"), cb_data);
 			goto done;
 		}
@@ -715,7 +714,7 @@ void repair_worktree_at_path(const char *path,
 		goto done;
 	}
 
-	strbuf_addf(&gitdir, "%s/gitdir", backlink);
+	strbuf_addf(&gitdir, "%s/gitdir", backlink.buf);
 	if (strbuf_read_file(&olddotgit, gitdir.buf, 0) < 0)
 		repair = _("gitdir unreadable");
 	else {
@@ -729,7 +728,7 @@ void repair_worktree_at_path(const char *path,
 		write_file(gitdir.buf, "%s", realdotgit.buf);
 	}
 done:
-	free(backlink);
+	strbuf_release(&backlink);
 	strbuf_release(&olddotgit);
 	strbuf_release(&gitdir);
 	strbuf_release(&realdotgit);
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 2/3] Make ce_compare_gitlink thread-safe
  2024-03-05  1:21 [RFC PATCH 0/3] Parallel submodule status Atneya Nair
  2024-03-05  1:21 ` [RFC PATCH 1/3] Make read_gitfile and resolve_gitfile thread safe Atneya Nair
@ 2024-03-05  1:21 ` Atneya Nair
  2024-03-05 17:08   ` Junio C Hamano
  2024-03-05  1:21 ` [RFC PATCH 3/3] Preload submodule state in refresh_index Atneya Nair
  2 siblings, 1 reply; 17+ messages in thread
From: Atneya Nair @ 2024-03-05  1:21 UTC (permalink / raw)
  To: git; +Cc: gitster, jeffhost, me, nasamuffin, Atneya Nair

To enable parallel update of the read cache for submodules,
ce_compare_gitlink must be thread safe (for different objects).

Remove string interning in do_config_from (called from
repo_submodule_init) and add locking around accessing the ref_store_map.

Signed-off-by: Atneya Nair <atneya@google.com>
---

Notes:
    Chasing down thread unsafe code was done using tsan.
    
    Open questions:
     - Is there any additional thread-unsafety that was missed?
     - Is it safe to strdup in do_config_from (do the filenames need static
       lifetime)? What is the performance implication (it seems small)?
     - Is the locking around ref_store_map appropriate and/or can it be made
       more granular?

 config.c | 3 ++-
 config.h | 2 +-
 refs.c   | 9 +++++++++
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/config.c b/config.c
index 3cfeb3d8bd..d7f73d8745 100644
--- a/config.c
+++ b/config.c
@@ -1017,7 +1017,7 @@ static void kvi_from_source(struct config_source *cs,
 			    enum config_scope scope,
 			    struct key_value_info *out)
 {
-	out->filename = strintern(cs->name);
+	out->filename = strdup(cs->name);
 	out->origin_type = cs->origin_type;
 	out->linenr = cs->linenr;
 	out->scope = scope;
@@ -1857,6 +1857,7 @@ static int do_config_from(struct config_source *top, config_fn_t fn,
 
 	strbuf_release(&top->value);
 	strbuf_release(&top->var);
+	free(kvi.filename);
 
 	return ret;
 }
diff --git a/config.h b/config.h
index 5dba984f77..b78f1b6667 100644
--- a/config.h
+++ b/config.h
@@ -118,7 +118,7 @@ struct config_options {
 
 /* Config source metadata for a given config key-value pair */
 struct key_value_info {
-	const char *filename;
+	char *filename;
 	int linenr;
 	enum config_origin_type origin_type;
 	enum config_scope scope;
diff --git a/refs.c b/refs.c
index c633abf284..cce8a31b22 100644
--- a/refs.c
+++ b/refs.c
@@ -2126,6 +2126,9 @@ struct ref_store *get_submodule_ref_store(const char *submodule)
 	size_t len;
 	struct repository *subrepo;
 
+	// TODO is this locking tolerable, and/or can we get any finer
+	static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
+
 	if (!submodule)
 		return NULL;
 
@@ -2139,7 +2142,9 @@ struct ref_store *get_submodule_ref_store(const char *submodule)
 		/* We need to strip off one or more trailing slashes */
 		submodule = to_free = xmemdupz(submodule, len);
 
+	pthread_mutex_lock(&lock);
 	refs = lookup_ref_store_map(&submodule_ref_stores, submodule);
+	pthread_mutex_unlock(&lock);
 	if (refs)
 		goto done;
 
@@ -2162,10 +2167,14 @@ struct ref_store *get_submodule_ref_store(const char *submodule)
 		free(subrepo);
 		goto done;
 	}
+
+	pthread_mutex_lock(&lock);
+	// TODO maybe lock this separately
 	refs = ref_store_init(subrepo, submodule_sb.buf,
 			      REF_STORE_READ | REF_STORE_ODB);
 	register_ref_store_map(&submodule_ref_stores, "submodule",
 			       refs, submodule);
+	pthread_mutex_unlock(&lock);
 
 done:
 	strbuf_release(&submodule_sb);
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 3/3] Preload submodule state in refresh_index
  2024-03-05  1:21 [RFC PATCH 0/3] Parallel submodule status Atneya Nair
  2024-03-05  1:21 ` [RFC PATCH 1/3] Make read_gitfile and resolve_gitfile thread safe Atneya Nair
  2024-03-05  1:21 ` [RFC PATCH 2/3] Make ce_compare_gitlink thread-safe Atneya Nair
@ 2024-03-05  1:21 ` Atneya Nair
  2 siblings, 0 replies; 17+ messages in thread
From: Atneya Nair @ 2024-03-05  1:21 UTC (permalink / raw)
  To: git; +Cc: gitster, jeffhost, me, nasamuffin, Atneya Nair

refresh_index currently parallelizes updating cache_entries for regular
files based on lstat. Expand preload to parallelize exploring the
checked out state of submodules, which is substantially more expensive.

Cache the state of the submodule in memory, to avoid unnecessary
re-computation (similar to regular files).

This speeds up git status, and other operations which examine the read
index, especially in repositories with many submodules.

Signed-off-by: Atneya Nair <atneya@google.com>
---

Notes:
    For now, I added a new field to store the submodule state.
    
    Open questions:
    - Where can we efficiently store the submodule state? I assume we can
    re-use some of the ce_flags which aren't used for submodules?
    
    - Why can threads only go up to 64? Can we make this user configurable?

 preload-index.c | 25 ++++++++++++++++++++++---
 read-cache-ll.h |  1 +
 read-cache.c    |  3 +++
 3 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/preload-index.c b/preload-index.c
index 63fd35d64b..091b22fa4c 100644
--- a/preload-index.c
+++ b/preload-index.c
@@ -22,7 +22,7 @@
  * to have at least 500 lstat's per thread for it to
  * be worth starting a thread.
  */
-#define MAX_PARALLEL (20)
+#define MAX_PARALLEL (60)
 #define THREAD_COST (500)
 
 struct progress_data {
@@ -59,8 +59,21 @@ static void *preload_thread(void *_data)
 
 		if (ce_stage(ce))
 			continue;
-		if (S_ISGITLINK(ce->ce_mode))
+		if (S_ISGITLINK(ce->ce_mode)) {
+			// This call evaluates the submodule HEAD for GITLINK, which really does determine
+			// if there is a change (for index purposes). We can't use the traditional path of
+			// marking as VALID, because valid can't be used for submodules due to other code
+			// paths in which valid may skip investigation of the worktree in the submodule.
+			// Gitlinks also aren't statable, or fsmonitorable, so caching doesn't have the same
+			// semantics.
+			// Use a special entry to mark the ref change state and its validity. Future calls
+			// to ce_compare_gitlink will leverage this.
+			if (lstat(ce->name, &st))
+				continue;
+			ce->sub_ref_state = (!!(ie_match_stat(index, ce, &st,
+					CE_MATCH_RACY_IS_DIRTY|CE_MATCH_IGNORE_FSMONITOR) & DATA_CHANGED) << 1) | 0x1;
 			continue;
+		}
 		if (ce_uptodate(ce))
 			continue;
 		if (ce_skip_worktree(ce))
@@ -107,11 +120,17 @@ void preload_index(struct index_state *index,
 	struct thread_data data[MAX_PARALLEL];
 	struct progress_data pd;
 	int t2_sum_lstat = 0;
+	int link_count = 0;
 
 	if (!HAVE_THREADS || !core_preload_index)
 		return;
 
-	threads = index->cache_nr / THREAD_COST;
+	for (i = 0; i < index->cache_nr; i++) {
+		link_count += (S_ISGITLINK(index->cache[i]->ce_mode));
+	}
+	// Exploring gitlinks are much more expensive than lstat, so modify the cost
+	threads = (index->cache_nr / THREAD_COST) + (link_count / 25);
+
 	if ((index->cache_nr > 1) && (threads < 2) && git_env_bool("GIT_TEST_PRELOAD_INDEX", 0))
 		threads = 2;
 	if (threads < 2)
diff --git a/read-cache-ll.h b/read-cache-ll.h
index 2a50a784f0..5555bb0ae9 100644
--- a/read-cache-ll.h
+++ b/read-cache-ll.h
@@ -27,6 +27,7 @@ struct cache_entry {
 	unsigned int mem_pool_allocated;
 	unsigned int ce_namelen;
 	unsigned int index;	/* for link extension */
+	unsigned int sub_ref_state; /* TODO pack somewhere. Lowest bit valid, second lowest dirty. */
 	struct object_id oid;
 	char name[FLEX_ARRAY]; /* more */
 };
diff --git a/read-cache.c b/read-cache.c
index f546cf7875..541d40ca30 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -271,6 +271,9 @@ static int ce_compare_gitlink(const struct cache_entry *ce)
 	 *
 	 * If so, we consider it always to match.
 	 */
+	if (ce->sub_ref_state & 0x1)
+		/* Check the cached value */
+		return ce->sub_ref_state >> 1;
 	if (resolve_gitlink_ref(ce->name, "HEAD", &oid) < 0)
 		return 0;
 	return !oideq(&oid, &ce->oid);
-- 
2.44.0.rc1.240.g4c46232300-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/3] Make read_gitfile and resolve_gitfile thread safe
  2024-03-05  1:21 ` [RFC PATCH 1/3] Make read_gitfile and resolve_gitfile thread safe Atneya Nair
@ 2024-03-05  2:22   ` Junio C Hamano
  2024-03-05  4:29   ` Eric Sunshine
  2024-03-06  8:13   ` Jean-Noël Avila
  2 siblings, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2024-03-05  2:22 UTC (permalink / raw)
  To: Atneya Nair; +Cc: git, jeffhost, me, nasamuffin

Atneya Nair <atneya@google.com> writes:

> Continue the work in commit: 3d7747e318532a36a263c61cdf92f2decb6424ff
> to remove unsafe shared buffer usage in read_gitfile_gently.

The usual way to compose a log message of this project is to

 - Give an observation on how the current system work in the present
   tense (so no need to say "Currently X is Y", just "X is Y"), and
   discuss what you perceive as a problem in it.

 - Propose a solution (optional---often, problem description
   trivially leads to an obvious solution in reader's minds).

 - Give commands to the codebase to "become like so".

in this order.  The introductory part may become like so.  Notice
the format in which we usually refer to an existing commit:

    3d7747e3 (real_path: remove unsafe API, 2020-03-10) started the
    process of making more API functions thread-safe.  Many callers
    of read_gitfile() still depend on the shared buffer used to hold
    the return value, and convenience of not having to allocate/free
    their own storage, but this hinders multi-threading.

And the remainder of the proposed log message should propose a
solution and tell the codebase to become like so.

See Documentation/SubmittingPatches for other general guidelines.
Documentation/CodingGuidelines would also be helpful.

> Migrate read_gitfile_gently and resolve_gitfile_gently to take strbuf
> out params to allocate their return values into, rather than returning
> a view into a shared buffer.
>
> Leave the shared buffer in case a caller passes null for this param (for
> cases we haven't cleaned up yet).
>
> Migrate callers of resolve_gitfile to resolve_gitfile_gently.
>
> Signed-off-by: Atneya Nair <atneya@google.com>
> ---
>
> Notes:
>     Open questions:
>      - Is checking the return value of read_gitfile necessary if on error,
>        we are supposed to die, or set the error field to a non-zero value?

The conversion done by this step seems to be a more or less faithful
rewrite.  If the original checked for an error from the original
read_gitfile_gently() by seeing if it returned a NULL, the updated
code should instead check for what is returned in &err.  And that
seems to be what you did, which is good.

>      - Should we clean up the other call-sites of read_gitfile?

It is OK to do the ones you care about (read: needed to be converted
before you can do the multi-threading) first, and then update the
rest.

The conversion itself in the patch looked obvious and trivially
correct from a cursory read, but I have to admit that my
concentration level was not sufficiently high while I reviewed the
patch, as my reading was constantly hiccupping whenever I saw a
coding style glitches.  The authors can help by sticking to what the
Documentation/CodingGuidelines document says, paying special
attention to the comment styles, decl-after-statement, rules
regarding braces around a single-statement block.

Thanks.

>  builtin/init-db.c   |  7 ++++---
>  builtin/rev-parse.c |  4 +++-
>  repository.c        |  9 +++++----
>  setup.c             | 36 +++++++++++++++++++++++++-----------
>  setup.h             |  7 +++----
>  submodule.c         | 32 +++++++++++++++++++++++---------
>  worktree.c          | 27 +++++++++++++--------------
>  7 files changed, 76 insertions(+), 46 deletions(-)
>
> diff --git a/builtin/init-db.c b/builtin/init-db.c
> index 0170469b84..9135d07a0d 100644
> --- a/builtin/init-db.c
> +++ b/builtin/init-db.c
> @@ -198,11 +198,11 @@ int cmd_init_db(int argc, const char **argv, const char *prefix)
>  	 */
>  	if (real_git_dir) {
>  		int err;
> -		const char *p;
>  		struct strbuf sb = STRBUF_INIT;
> +		struct strbuf gitfile = STRBUF_INIT;
>  
> -		p = read_gitfile_gently(git_dir, &err);
> -		if (p && get_common_dir(&sb, p)) {
> +		read_gitfile_gently(git_dir, &err, &gitfile);
> +		if (!err && get_common_dir(&sb, gitfile.buf)) {
>  			struct strbuf mainwt = STRBUF_INIT;
>  
>  			strbuf_addbuf(&mainwt, &sb);
> @@ -213,6 +213,7 @@ int cmd_init_db(int argc, const char **argv, const char *prefix)
>  			git_dir = strbuf_detach(&sb, NULL);
>  		}
>  		strbuf_release(&sb);
> +		strbuf_release(&gitfile);
>  	}
>  
>  	if (is_bare_repository_cfg < 0)
> diff --git a/builtin/rev-parse.c b/builtin/rev-parse.c
> index d08987646a..e1db6b3231 100644
> --- a/builtin/rev-parse.c
> +++ b/builtin/rev-parse.c
> @@ -728,12 +728,14 @@ int cmd_rev_parse(int argc, const char **argv, const char *prefix)
>  			}
>  			if (!strcmp(arg, "--resolve-git-dir")) {
>  				const char *gitdir = argv[++i];
> +				struct strbuf resolved_gitdir = STRBUF_INIT;
>  				if (!gitdir)
>  					die(_("--resolve-git-dir requires an argument"));
> -				gitdir = resolve_gitdir(gitdir);
> +				gitdir = resolve_gitdir_gently(gitdir, NULL, &resolved_gitdir);
>  				if (!gitdir)
>  					die(_("not a gitdir '%s'"), argv[i]);
>  				puts(gitdir);
> +				strbuf_release(&resolved_gitdir);
>  				continue;
>  			}
>  		}
> diff --git a/repository.c b/repository.c
> index 7aacb51b65..3ca6dbcf16 100644
> --- a/repository.c
> +++ b/repository.c
> @@ -118,7 +118,7 @@ static int repo_init_gitdir(struct repository *repo, const char *gitdir)
>  	int ret = 0;
>  	int error = 0;
>  	char *abspath = NULL;
> -	const char *resolved_gitdir;
> +	struct strbuf resolved_gitdir = STRBUF_INIT;
>  	struct set_gitdir_args args = { NULL };
>  
>  	abspath = real_pathdup(gitdir, 0);
> @@ -128,15 +128,16 @@ static int repo_init_gitdir(struct repository *repo, const char *gitdir)
>  	}
>  
>  	/* 'gitdir' must reference the gitdir directly */
> -	resolved_gitdir = resolve_gitdir_gently(abspath, &error);
> -	if (!resolved_gitdir) {
> +	resolve_gitdir_gently(abspath, &error, &resolved_gitdir);
> +	if (error) {
>  		ret = -1;
>  		goto out;
>  	}
>  
> -	repo_set_gitdir(repo, resolved_gitdir, &args);
> +	repo_set_gitdir(repo, resolved_gitdir.buf, &args);
>  
>  out:
> +	strbuf_release(&resolved_gitdir);
>  	free(abspath);
>  	return ret;
>  }
> diff --git a/setup.c b/setup.c
> index b69b1cbc2a..2e118cf216 100644
> --- a/setup.c
> +++ b/setup.c
> @@ -397,14 +397,17 @@ int is_nonbare_repository_dir(struct strbuf *path)
>  	int ret = 0;
>  	int gitfile_error;
>  	size_t orig_path_len = path->len;
> +	struct strbuf gitfile = STRBUF_INIT;
> +
>  	assert(orig_path_len != 0);
>  	strbuf_complete(path, '/');
>  	strbuf_addstr(path, ".git");
> -	if (read_gitfile_gently(path->buf, &gitfile_error) || is_git_directory(path->buf))
> +	if (read_gitfile_gently(path->buf, &gitfile_error, &gitfile) || is_git_directory(path->buf))
>  		ret = 1;
>  	if (gitfile_error == READ_GITFILE_ERR_OPEN_FAILED ||
>  	    gitfile_error == READ_GITFILE_ERR_READ_FAILED)
>  		ret = 1;
> +	strbuf_release(&gitfile);
>  	strbuf_setlen(path, orig_path_len);
>  	return ret;
>  }
> @@ -830,7 +833,8 @@ void read_gitfile_error_die(int error_code, const char *path, const char *dir)
>  
>  /*
>   * Try to read the location of the git directory from the .git file,
> - * return path to git directory if found. The return value comes from
> + * return path to git directory if found. If passed a valid strbuf, the return
> + * value is is a ptr to within the buffer. If strbuf is null, the return value comes from
>   * a shared buffer.
>   *
>   * On failure, if return_error_code is not NULL, return_error_code
> @@ -838,7 +842,7 @@ void read_gitfile_error_die(int error_code, const char *path, const char *dir)
>   * return_error_code is NULL the function will die instead (for most
>   * cases).
>   */
> -const char *read_gitfile_gently(const char *path, int *return_error_code)
> +const char *read_gitfile_gently(const char *path, int *return_error_code, struct strbuf* result_buf)
>  {
>  	const int max_file_size = 1 << 20;  /* 1MB */
>  	int error_code = 0;
> @@ -848,7 +852,10 @@ const char *read_gitfile_gently(const char *path, int *return_error_code)
>  	struct stat st;
>  	int fd;
>  	ssize_t len;
> -	static struct strbuf realpath = STRBUF_INIT;
> +	static struct strbuf shared = STRBUF_INIT;
> +	if (!result_buf) {
> +		result_buf = &shared;
> +	}
>  
>  	if (stat(path, &st)) {
>  		/* NEEDSWORK: discern between ENOENT vs other errors */
> @@ -900,8 +907,10 @@ const char *read_gitfile_gently(const char *path, int *return_error_code)
>  		goto cleanup_return;
>  	}
>  
> -	strbuf_realpath(&realpath, dir, 1);
> -	path = realpath.buf;
> +	strbuf_realpath(result_buf, dir, 1);
> +	path = result_buf->buf;
> +	// TODO is this valid?
> +	if (!path) die(_("Unexpected null from realpath '%s'"), dir);
>  
>  cleanup_return:
>  	if (return_error_code)
> @@ -1316,12 +1325,13 @@ static enum discovery_result setup_git_directory_gently_1(struct strbuf *dir,
>  		int offset = dir->len, error_code = 0;
>  		char *gitdir_path = NULL;
>  		char *gitfile = NULL;
> +		struct strbuf gitdirenvbuf = STRBUF_INIT;
>  
>  		if (offset > min_offset)
>  			strbuf_addch(dir, '/');
>  		strbuf_addstr(dir, DEFAULT_GIT_DIR_ENVIRONMENT);
>  		gitdirenv = read_gitfile_gently(dir->buf, die_on_error ?
> -						NULL : &error_code);
> +						NULL : &error_code, &gitdirenvbuf);
>  		if (!gitdirenv) {
>  			if (die_on_error ||
>  			    error_code == READ_GITFILE_ERR_NOT_A_FILE) {
> @@ -1330,8 +1340,10 @@ static enum discovery_result setup_git_directory_gently_1(struct strbuf *dir,
>  					gitdirenv = DEFAULT_GIT_DIR_ENVIRONMENT;
>  					gitdir_path = xstrdup(dir->buf);
>  				}
> -			} else if (error_code != READ_GITFILE_ERR_STAT_FAILED)
> +			} else if (error_code != READ_GITFILE_ERR_STAT_FAILED) {
> +				strbuf_release(&gitdirenvbuf);
>  				return GIT_DIR_INVALID_GITFILE;
> +			}
>  		} else
>  			gitfile = xstrdup(dir->buf);
>  		/*
> @@ -1365,9 +1377,10 @@ static enum discovery_result setup_git_directory_gently_1(struct strbuf *dir,
>  			 */
>  			free(gitdir_path);
>  			free(gitfile);
> -
> +			strbuf_release(&gitdirenvbuf);
>  			return ret;
>  		}
> +		strbuf_release(&gitdirenvbuf);
>  
>  		if (is_git_directory(dir->buf)) {
>  			trace2_data_string("setup", NULL, "implicit-bare-repository", dir->buf);
> @@ -1692,11 +1705,12 @@ const char *setup_git_directory(void)
>  	return setup_git_directory_gently(NULL);
>  }
>  
> -const char *resolve_gitdir_gently(const char *suspect, int *return_error_code)
> +const char *resolve_gitdir_gently(const char *suspect, int *return_error_code,
> +		struct strbuf* return_buf)
>  {
>  	if (is_git_directory(suspect))
>  		return suspect;
> -	return read_gitfile_gently(suspect, return_error_code);
> +	return read_gitfile_gently(suspect, return_error_code, return_buf);
>  }
>  
>  /* if any standard file descriptor is missing open it to /dev/null */
> diff --git a/setup.h b/setup.h
> index 3599aec93c..cf5a63762b 100644
> --- a/setup.h
> +++ b/setup.h
> @@ -36,10 +36,9 @@ int is_nonbare_repository_dir(struct strbuf *path);
>  #define READ_GITFILE_ERR_NOT_A_REPO 7
>  #define READ_GITFILE_ERR_TOO_LARGE 8
>  void read_gitfile_error_die(int error_code, const char *path, const char *dir);
> -const char *read_gitfile_gently(const char *path, int *return_error_code);
> -#define read_gitfile(path) read_gitfile_gently((path), NULL)
> -const char *resolve_gitdir_gently(const char *suspect, int *return_error_code);
> -#define resolve_gitdir(path) resolve_gitdir_gently((path), NULL)
> +const char *read_gitfile_gently(const char *path, int *return_error_code, struct strbuf *buf);
> +#define read_gitfile(path) read_gitfile_gently((path), NULL, NULL)
> +const char *resolve_gitdir_gently(const char *suspect, int *return_error_code, struct strbuf *buf);
>  
>  void setup_work_tree(void);
>  
> diff --git a/submodule.c b/submodule.c
> index 213da79f66..455444321b 100644
> --- a/submodule.c
> +++ b/submodule.c
> @@ -316,9 +316,10 @@ int is_submodule_populated_gently(const char *path, int *return_error_code)
>  	int ret = 0;
>  	char *gitdir = xstrfmt("%s/.git", path);
>  
> -	if (resolve_gitdir_gently(gitdir, return_error_code))
> +	struct strbuf resolved_gitdir_buf = STRBUF_INIT;
> +	if (resolve_gitdir_gently(gitdir, return_error_code, &resolved_gitdir_buf))
>  		ret = 1;
> -
> +	strbuf_release(&resolved_gitdir_buf);
>  	free(gitdir);
>  	return ret;
>  }
> @@ -1879,22 +1880,25 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
>  {
>  	struct child_process cp = CHILD_PROCESS_INIT;
>  	struct strbuf buf = STRBUF_INIT;
> +	struct strbuf gitdirbuf = STRBUF_INIT;
>  	FILE *fp;
>  	unsigned dirty_submodule = 0;
>  	const char *git_dir;
>  	int ignore_cp_exit_code = 0;
>  
>  	strbuf_addf(&buf, "%s/.git", path);
> -	git_dir = read_gitfile(buf.buf);
> +	git_dir = read_gitfile_gently(buf.buf, NULL, &gitdirbuf);
>  	if (!git_dir)
>  		git_dir = buf.buf;
>  	if (!is_git_directory(git_dir)) {
>  		if (is_directory(git_dir))
>  			die(_("'%s' not recognized as a git repository"), git_dir);
>  		strbuf_release(&buf);
> +		strbuf_release(&gitdirbuf);
>  		/* The submodule is not checked out, so it is not modified */
>  		return 0;
>  	}
> +		strbuf_release(&gitdirbuf);
>  	strbuf_reset(&buf);
>  
>  	strvec_pushl(&cp.args, "status", "--porcelain=2", NULL);
> @@ -1958,15 +1962,16 @@ int submodule_uses_gitfile(const char *path)
>  {
>  	struct child_process cp = CHILD_PROCESS_INIT;
>  	struct strbuf buf = STRBUF_INIT;
> -	const char *git_dir;
> +	struct strbuf gitfilebuf = STRBUF_INIT;
>  
>  	strbuf_addf(&buf, "%s/.git", path);
> -	git_dir = read_gitfile(buf.buf);
> -	if (!git_dir) {
> +	read_gitfile_gently(buf.buf, NULL, &gitfilebuf);
> +	if (!gitfilebuf.buf) {
>  		strbuf_release(&buf);
>  		return 0;
>  	}
>  	strbuf_release(&buf);
> +	strbuf_release(&gitfilebuf);
>  
>  	/* Now test that all nested submodules use a gitfile too */
>  	strvec_pushl(&cp.args,
> @@ -2276,6 +2281,7 @@ static void relocate_single_git_dir_into_superproject(const char *path,
>  {
>  	char *old_git_dir = NULL, *real_old_git_dir = NULL, *real_new_git_dir = NULL;
>  	struct strbuf new_gitdir = STRBUF_INIT;
> +	struct strbuf gitfilebuf = STRBUF_INIT;
>  	const struct submodule *sub;
>  
>  	if (submodule_uses_worktrees(path))
> @@ -2283,9 +2289,12 @@ static void relocate_single_git_dir_into_superproject(const char *path,
>  		      "more than one worktree not supported"), path);
>  
>  	old_git_dir = xstrfmt("%s/.git", path);
> -	if (read_gitfile(old_git_dir))
> +	if (read_gitfile_gently(old_git_dir, NULL, &gitfilebuf)) {
>  		/* If it is an actual gitfile, it doesn't need migration. */
> +		strbuf_release(&gitfilebuf);
>  		return;
> +	}
> +	strbuf_release(&gitfilebuf);
>  
>  	real_old_git_dir = real_pathdup(old_git_dir, 1);
>  
> @@ -2343,8 +2352,9 @@ void absorb_git_dir_into_superproject(const char *path,
>  	int err_code;
>  	const char *sub_git_dir;
>  	struct strbuf gitdir = STRBUF_INIT;
> +	struct strbuf resolved_gitdir_buf = STRBUF_INIT;
>  	strbuf_addf(&gitdir, "%s/.git", path);
> -	sub_git_dir = resolve_gitdir_gently(gitdir.buf, &err_code);
> +	sub_git_dir = resolve_gitdir_gently(gitdir.buf, &err_code, &resolved_gitdir_buf);
>  
>  	/* Not populated? */
>  	if (!sub_git_dir) {
> @@ -2385,6 +2395,8 @@ void absorb_git_dir_into_superproject(const char *path,
>  		free(real_sub_git_dir);
>  		free(real_common_git_dir);
>  	}
> +
> +	strbuf_release(&resolved_gitdir_buf);
>  	strbuf_release(&gitdir);
>  
>  	absorb_git_dir_into_superproject_recurse(path, super_prefix);
> @@ -2484,17 +2496,19 @@ int submodule_to_gitdir(struct strbuf *buf, const char *submodule)
>  	const struct submodule *sub;
>  	const char *git_dir;
>  	int ret = 0;
> +	struct strbuf gitfilebuf = STRBUF_INIT;
>  
>  	strbuf_reset(buf);
>  	strbuf_addstr(buf, submodule);
>  	strbuf_complete(buf, '/');
>  	strbuf_addstr(buf, ".git");
>  
> -	git_dir = read_gitfile(buf->buf);
> +	git_dir = read_gitfile_gently(buf->buf, NULL, &gitfilebuf);
>  	if (git_dir) {
>  		strbuf_reset(buf);
>  		strbuf_addstr(buf, git_dir);
>  	}
> +	strbuf_release(&gitfilebuf);
>  	if (!is_git_directory(buf->buf)) {
>  		sub = submodule_from_path(the_repository, null_oid(),
>  					  submodule);
> diff --git a/worktree.c b/worktree.c
> index b02a05a74a..a6f125c8da 100644
> --- a/worktree.c
> +++ b/worktree.c
> @@ -309,7 +309,7 @@ int validate_worktree(const struct worktree *wt, struct strbuf *errmsg,
>  {
>  	struct strbuf wt_path = STRBUF_INIT;
>  	struct strbuf realpath = STRBUF_INIT;
> -	char *path = NULL;
> +	struct strbuf gitfile = STRBUF_INIT;
>  	int err, ret = -1;
>  
>  	strbuf_addf(&wt_path, "%s/.git", wt->path);
> @@ -353,21 +353,20 @@ int validate_worktree(const struct worktree *wt, struct strbuf *errmsg,
>  		goto done;
>  	}
>  
> -	path = xstrdup_or_null(read_gitfile_gently(wt_path.buf, &err));
> -	if (!path) {
> +	if (!read_gitfile_gently(wt_path.buf, &err, &gitfile)) {
>  		strbuf_addf_gently(errmsg, _("'%s' is not a .git file, error code %d"),
>  				   wt_path.buf, err);
>  		goto done;
>  	}
>  
>  	strbuf_realpath(&realpath, git_common_path("worktrees/%s", wt->id), 1);
> -	ret = fspathcmp(path, realpath.buf);
> +	ret = fspathcmp(gitfile.buf, realpath.buf);
>  
>  	if (ret)
>  		strbuf_addf_gently(errmsg, _("'%s' does not point back to '%s'"),
>  				   wt->path, git_common_path("worktrees/%s", wt->id));
>  done:
> -	free(path);
> +	strbuf_release(&gitfile);
>  	strbuf_release(&wt_path);
>  	strbuf_release(&realpath);
>  	return ret;
> @@ -567,7 +566,7 @@ static void repair_gitfile(struct worktree *wt,
>  {
>  	struct strbuf dotgit = STRBUF_INIT;
>  	struct strbuf repo = STRBUF_INIT;
> -	char *backlink;
> +	struct strbuf backlink = STRBUF_INIT;
>  	const char *repair = NULL;
>  	int err;
>  
> @@ -582,13 +581,13 @@ static void repair_gitfile(struct worktree *wt,
>  
>  	strbuf_realpath(&repo, git_common_path("worktrees/%s", wt->id), 1);
>  	strbuf_addf(&dotgit, "%s/.git", wt->path);
> -	backlink = xstrdup_or_null(read_gitfile_gently(dotgit.buf, &err));
> +	read_gitfile_gently(dotgit.buf, &err, &backlink);
>  
>  	if (err == READ_GITFILE_ERR_NOT_A_FILE)
>  		fn(1, wt->path, _(".git is not a file"), cb_data);
>  	else if (err)
>  		repair = _(".git file broken");
> -	else if (fspathcmp(backlink, repo.buf))
> +	else if (fspathcmp(backlink.buf, repo.buf))
>  		repair = _(".git file incorrect");
>  
>  	if (repair) {
> @@ -596,7 +595,7 @@ static void repair_gitfile(struct worktree *wt,
>  		write_file(dotgit.buf, "gitdir: %s", repo.buf);
>  	}
>  
> -	free(backlink);
> +	strbuf_release(&backlink);
>  	strbuf_release(&repo);
>  	strbuf_release(&dotgit);
>  }
> @@ -685,7 +684,7 @@ void repair_worktree_at_path(const char *path,
>  	struct strbuf realdotgit = STRBUF_INIT;
>  	struct strbuf gitdir = STRBUF_INIT;
>  	struct strbuf olddotgit = STRBUF_INIT;
> -	char *backlink = NULL;
> +	struct strbuf backlink = STRBUF_INIT;
>  	const char *repair = NULL;
>  	int err;
>  
> @@ -701,12 +700,12 @@ void repair_worktree_at_path(const char *path,
>  		goto done;
>  	}
>  
> -	backlink = xstrdup_or_null(read_gitfile_gently(realdotgit.buf, &err));
> +	read_gitfile_gently(realdotgit.buf, &err, &backlink);
>  	if (err == READ_GITFILE_ERR_NOT_A_FILE) {
>  		fn(1, realdotgit.buf, _("unable to locate repository; .git is not a file"), cb_data);
>  		goto done;
>  	} else if (err == READ_GITFILE_ERR_NOT_A_REPO) {
> -		if (!(backlink = infer_backlink(realdotgit.buf))) {
> +		if (!(backlink.buf = infer_backlink(realdotgit.buf))) {
>  			fn(1, realdotgit.buf, _("unable to locate repository; .git file does not reference a repository"), cb_data);
>  			goto done;
>  		}
> @@ -715,7 +714,7 @@ void repair_worktree_at_path(const char *path,
>  		goto done;
>  	}
>  
> -	strbuf_addf(&gitdir, "%s/gitdir", backlink);
> +	strbuf_addf(&gitdir, "%s/gitdir", backlink.buf);
>  	if (strbuf_read_file(&olddotgit, gitdir.buf, 0) < 0)
>  		repair = _("gitdir unreadable");
>  	else {
> @@ -729,7 +728,7 @@ void repair_worktree_at_path(const char *path,
>  		write_file(gitdir.buf, "%s", realdotgit.buf);
>  	}
>  done:
> -	free(backlink);
> +	strbuf_release(&backlink);
>  	strbuf_release(&olddotgit);
>  	strbuf_release(&gitdir);
>  	strbuf_release(&realdotgit);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/3] Make read_gitfile and resolve_gitfile thread safe
  2024-03-05  1:21 ` [RFC PATCH 1/3] Make read_gitfile and resolve_gitfile thread safe Atneya Nair
  2024-03-05  2:22   ` Junio C Hamano
@ 2024-03-05  4:29   ` Eric Sunshine
  2024-03-12 20:38     ` Atneya Nair
  2024-03-06  8:13   ` Jean-Noël Avila
  2 siblings, 1 reply; 17+ messages in thread
From: Eric Sunshine @ 2024-03-05  4:29 UTC (permalink / raw)
  To: Atneya Nair; +Cc: git, gitster, jeffhost, me, nasamuffin

On Mon, Mar 4, 2024 at 8:22 PM Atneya Nair <atneya@google.com> wrote:
> Continue the work in commit: 3d7747e318532a36a263c61cdf92f2decb6424ff
> to remove unsafe shared buffer usage in read_gitfile_gently.
>
> Migrate read_gitfile_gently and resolve_gitfile_gently to take strbuf
> out params to allocate their return values into, rather than returning
> a view into a shared buffer.
>
> Leave the shared buffer in case a caller passes null for this param (for
> cases we haven't cleaned up yet).
>
> Migrate callers of resolve_gitfile to resolve_gitfile_gently.
>
> Signed-off-by: Atneya Nair <atneya@google.com>
> ---
> diff --git a/builtin/init-db.c b/builtin/init-db.c
> @@ -198,11 +198,11 @@ int cmd_init_db(int argc, const char **argv, const char *prefix)
> -               const char *p;
> +               struct strbuf gitfile = STRBUF_INIT;
>
> -               p = read_gitfile_gently(git_dir, &err);
> -               if (p && get_common_dir(&sb, p)) {
> +               read_gitfile_gently(git_dir, &err, &gitfile);
> +               if (!err && get_common_dir(&sb, gitfile.buf)) {
>                         struct strbuf mainwt = STRBUF_INIT;

If you're going to adopt this idiom of checking `err` rather than the
return code of read_gitfile_gently(), then you should document that
`err` will be set to zero in the success case. Presently, the
documentation for read_gitfile_gently() only talks about the failure
case and doesn't mention that zero indicates success.

> diff --git a/setup.c b/setup.c
> @@ -830,7 +833,8 @@ void read_gitfile_error_die(int error_code, const char *path, const char *dir)
>  /*
>   * Try to read the location of the git directory from the .git file,
> - * return path to git directory if found. The return value comes from
> + * return path to git directory if found. If passed a valid strbuf, the return
> + * value is is a ptr to within the buffer. If strbuf is null, the return value comes from
>   * a shared buffer.

What is "a valid strbuf"? Perhaps say instead "if `result_buf` is not
NULL, ...". The "is not NULL" wording is consistent with the existing
wording used below...

Also...
  s/is is/is/
  s/ptr/pointer/

>   * On failure, if return_error_code is not NULL, return_error_code

... "is not NULL" wording is already used here.

> @@ -848,7 +852,10 @@ const char *read_gitfile_gently(const char *path, int *return_error_code)
> -       static struct strbuf realpath = STRBUF_INIT;
> +       static struct strbuf shared = STRBUF_INIT;
> +       if (!result_buf) {
> +               result_buf = &shared;
> +       }

Junio mentioned style violations in his response. Omit braces around
one line `if` bodies.

> @@ -900,8 +907,10 @@ const char *read_gitfile_gently(const char *path, int *return_error_code)
> -       strbuf_realpath(&realpath, dir, 1);
> -       path = realpath.buf;
> +       strbuf_realpath(result_buf, dir, 1);
> +       path = result_buf->buf;

It's a minor thing, but if you name the function argument `realpath`,
then the diff becomes less noisy since changes such as these do not
need to be made. On the other hand, if `realpath` isn't a good output
variable name, then by all means choose a better name.

> @@ -1316,12 +1325,13 @@ static enum discovery_result setup_git_directory_gently_1(struct strbuf *dir,
> +               struct strbuf gitdirenvbuf = STRBUF_INIT;
>                 gitdirenv = read_gitfile_gently(dir->buf, die_on_error ?
> -                                               NULL : &error_code);
> +                                               NULL : &error_code, &gitdirenvbuf);
>                 if (!gitdirenv) {
> @@ -1330,8 +1340,10 @@ static enum discovery_result setup_git_directory_gently_1(struct strbuf *dir,
> -                       } else if (error_code != READ_GITFILE_ERR_STAT_FAILED)
> +                       } else if (error_code != READ_GITFILE_ERR_STAT_FAILED) {
> +                               strbuf_release(&gitdirenvbuf);
>                                 return GIT_DIR_INVALID_GITFILE;
> +                       }

Releasing the strbuf before `return`. Good.

> @@ -1365,9 +1377,10 @@ static enum discovery_result setup_git_directory_gently_1(struct strbuf *dir,
>                         free(gitdir_path);
>                         free(gitfile);
> +                       strbuf_release(&gitdirenvbuf);
>                         return ret;

Likewise. Good.

>                 }
> +               strbuf_release(&gitdirenvbuf);
>
>                 if (is_git_directory(dir->buf)) {
>                         trace2_data_string("setup", NULL, "implicit-bare-repository", dir->buf);

There are additional `return` statements (not shown in the context)
following this code, but you make this final strbuf_release() call
before any of those other `return` statements can be taken. Good.

> diff --git a/submodule.c b/submodule.c
> @@ -316,9 +316,10 @@ int is_submodule_populated_gently(const char *path, int *return_error_code)
>         int ret = 0;
>         char *gitdir = xstrfmt("%s/.git", path);
>
> -       if (resolve_gitdir_gently(gitdir, return_error_code))
> +       struct strbuf resolved_gitdir_buf = STRBUF_INIT;
> +       if (resolve_gitdir_gently(gitdir, return_error_code, &resolved_gitdir_buf))
>                 ret = 1;

Style: Declare `resolved_gitdir_buf` along with `ret` and `gitdir`,
then have a blank line before the actual code.

> @@ -1879,22 +1880,25 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
> +       struct strbuf gitdirbuf = STRBUF_INIT;
> +               strbuf_release(&gitdirbuf);
>                 /* The submodule is not checked out, so it is not modified */
>                 return 0;
>         }
> +               strbuf_release(&gitdirbuf);
>         strbuf_reset(&buf);

Style: Strange indentation?

> @@ -1958,15 +1962,16 @@ int submodule_uses_gitfile(const char *path)
> -       const char *git_dir;
> +       struct strbuf gitfilebuf = STRBUF_INIT;
>
>         strbuf_addf(&buf, "%s/.git", path);
> -       git_dir = read_gitfile(buf.buf);
> -       if (!git_dir) {
> +       read_gitfile_gently(buf.buf, NULL, &gitfilebuf);
> +       if (!gitfilebuf.buf) {
>                 strbuf_release(&buf);
>                 return 0;
>         }

Not sure what you're trying to do here. strbuf guarantees that its
`buf` member will never be NULL, so the new `if (!gitfilebuf.buf)`
conditional seems to be dead code. If you really want to check whether
an error occurred, pass non-NULL for the second argument and check the
return value of read_gitfile_gently() or check the error code.

> diff --git a/worktree.c b/worktree.c
> @@ -685,7 +684,7 @@ void repair_worktree_at_path(const char *path,
> -       char *backlink = NULL;
> +       struct strbuf backlink = STRBUF_INIT;
> @@ -701,12 +700,12 @@ void repair_worktree_at_path(const char *path,
> -       backlink = xstrdup_or_null(read_gitfile_gently(realdotgit.buf, &err));
> +       read_gitfile_gently(realdotgit.buf, &err, &backlink);
>         if (err == READ_GITFILE_ERR_NOT_A_FILE) {
>                 fn(1, realdotgit.buf, _("unable to locate repository; .git is not a file"), cb_data);
>                 goto done;
>         } else if (err == READ_GITFILE_ERR_NOT_A_REPO) {
> -               if (!(backlink = infer_backlink(realdotgit.buf))) {
> +               if (!(backlink.buf = infer_backlink(realdotgit.buf))) {

Don't do this. Never modify the internal state of strbuf directly;
consider the state read-only. Modifications should only be made via
the API. You'll need to rewrite this code a bit to make it work
correctly with the changes proposed by this patch.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3] Make ce_compare_gitlink thread-safe
  2024-03-05  1:21 ` [RFC PATCH 2/3] Make ce_compare_gitlink thread-safe Atneya Nair
@ 2024-03-05 17:08   ` Junio C Hamano
  2024-03-05 18:53     ` Junio C Hamano
  2024-03-12 22:15     ` Atneya Nair
  0 siblings, 2 replies; 17+ messages in thread
From: Junio C Hamano @ 2024-03-05 17:08 UTC (permalink / raw)
  To: Atneya Nair; +Cc: git, jeffhost, me, nasamuffin, Tanay Abhra, Glen Choo

Atneya Nair <atneya@google.com> writes:

> To enable parallel update of the read cache for submodules,
> ce_compare_gitlink must be thread safe (for different objects).
>
> Remove string interning in do_config_from (called from
> repo_submodule_init) and add locking around accessing the ref_store_map.

This step does two independent things, even though they may have
dependencies, i.e., for one to be a solution for the problem it is
tackling, the other may have to be there already.  E.g., even after
calls to ce_compare_gitlink() get serialized via a mutex, it may for
some reason not work without giving each kvi.filename its own copy
[*], and if that is the case, you may need to have the "stop
interning" step in a single patch with its own justification, and
then "have mutex around ref_store calls" patch has to come after it.

    Side note: I do not know if that is the case myself.  I didn't
    write this commit, you did.  The above is just a sample to
    illustrate the expected level of depth to explain your thinking
    in the log message.

Or if these two things must happen at the same time, please explain
in the proposed log message why they have to happen in the same
commit.  The two paragraphs you wrote there don't explain that, so I
am assuming that it is not the case.

The use of strintern() comes originally from 3df8fd62 (add line
number and file name info to `config_set`, 2014-08-07) by Tanay
Abhra <tanayabh@gmail.com>, and survived a handful of changes

    809d8680 (config: pass ctx with config files, 2023-06-28)
    a798a56c (config.c: plumb the_reader through callbacks, 2023-03-28)
    c97f3ed2 (config.c: plumb config_source through static fns, 2023-03-28)

all of which were done by Glen Choo <glencbz@gmail.com>, so they may
know how safe the change on the config side would be (I still do
not understand why you'd want to do this in the first place, though,
especially if you are protecting the callsites with mutex).

I also think Emily's (who you already have on the "CC:" line) group
wants to libify the config machinery and suspect they may still be
making changes to the code, so you may want to coordinate with them
to avoid duplicated work and overlapping changes.

> Signed-off-by: Atneya Nair <atneya@google.com>
> ---
>
> Notes:
>     Chasing down thread unsafe code was done using tsan.

Very nice to know.

>  config.c | 3 ++-
>  config.h | 2 +-
>  refs.c   | 9 +++++++++
>  3 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/config.c b/config.c
> index 3cfeb3d8bd..d7f73d8745 100644
> --- a/config.c
> +++ b/config.c
> @@ -1017,7 +1017,7 @@ static void kvi_from_source(struct config_source *cs,
>  			    enum config_scope scope,
>  			    struct key_value_info *out)
>  {
> -	out->filename = strintern(cs->name);
> +	out->filename = strdup(cs->name);
>  	out->origin_type = cs->origin_type;
>  	out->linenr = cs->linenr;
>  	out->scope = scope;
> @@ -1857,6 +1857,7 @@ static int do_config_from(struct config_source *top, config_fn_t fn,
>  
>  	strbuf_release(&top->value);
>  	strbuf_release(&top->var);
> +	free(kvi.filename);
>  
>  	return ret;
>  }
> diff --git a/config.h b/config.h
> index 5dba984f77..b78f1b6667 100644
> --- a/config.h
> +++ b/config.h
> @@ -118,7 +118,7 @@ struct config_options {
>  
>  /* Config source metadata for a given config key-value pair */
>  struct key_value_info {
> -	const char *filename;
> +	char *filename;
>  	int linenr;
>  	enum config_origin_type origin_type;
>  	enum config_scope scope;
> diff --git a/refs.c b/refs.c
> index c633abf284..cce8a31b22 100644
> --- a/refs.c
> +++ b/refs.c
> @@ -2126,6 +2126,9 @@ struct ref_store *get_submodule_ref_store(const char *submodule)
>  	size_t len;
>  	struct repository *subrepo;
>  
> +	// TODO is this locking tolerable, and/or can we get any finer
> +	static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
> +
>  	if (!submodule)
>  		return NULL;
>  
> @@ -2139,7 +2142,9 @@ struct ref_store *get_submodule_ref_store(const char *submodule)
>  		/* We need to strip off one or more trailing slashes */
>  		submodule = to_free = xmemdupz(submodule, len);
>  
> +	pthread_mutex_lock(&lock);
>  	refs = lookup_ref_store_map(&submodule_ref_stores, submodule);
> +	pthread_mutex_unlock(&lock);
>  	if (refs)
>  		goto done;
>  
> @@ -2162,10 +2167,14 @@ struct ref_store *get_submodule_ref_store(const char *submodule)
>  		free(subrepo);
>  		goto done;
>  	}
> +
> +	pthread_mutex_lock(&lock);
> +	// TODO maybe lock this separately
>  	refs = ref_store_init(subrepo, submodule_sb.buf,
>  			      REF_STORE_READ | REF_STORE_ODB);
>  	register_ref_store_map(&submodule_ref_stores, "submodule",
>  			       refs, submodule);
> +	pthread_mutex_unlock(&lock);
>  
>  done:
>  	strbuf_release(&submodule_sb);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3] Make ce_compare_gitlink thread-safe
  2024-03-05 17:08   ` Junio C Hamano
@ 2024-03-05 18:53     ` Junio C Hamano
  2024-03-06  1:23       ` Jeff King
  2024-03-12 22:15     ` Atneya Nair
  1 sibling, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2024-03-05 18:53 UTC (permalink / raw)
  To: Atneya Nair; +Cc: git, jeffhost, me, nasamuffin, Tanay Abhra, Glen Choo

Junio C Hamano <gitster@pobox.com> writes:

> The use of strintern() comes originally from 3df8fd62 ...
> ..., so they may
> know how safe the change on the config side would be (I still do
> not understand why you'd want to do this in the first place, though,
> especially if you are protecting the callsites with mutex).

The risks of turning code that uses strintern() to use strdup() are

 * you will leak the allocated string unless you explicitly free the
   string you now own.

 * you may consume too much memory if you are creating too many
   copies of the same string (e.g. if you need filename for each
   line in a file in an application, the memory consumption can
   become 1000-fold).

 * the code may be taking advantage of the fact that two such
   strings can be compared for (in)equality simply by comparing
   their addresses, which you would need to adjust to use !strcmp()
   and the like.

I just checked to make sure that the last one is not the case for
our codebase, and you said you didn't see the second one is the case,
so the change may be a safe one to make.

One more thing.  Do not use strdup() without checking its return
value for failure.  It would be an easy fix to use xstrdup() instead.

Thanks.

>> diff --git a/config.c b/config.c
>> index 3cfeb3d8bd..d7f73d8745 100644
>> --- a/config.c
>> +++ b/config.c
>> @@ -1017,7 +1017,7 @@ static void kvi_from_source(struct config_source *cs,
>>  			    enum config_scope scope,
>>  			    struct key_value_info *out)
>>  {
>> -	out->filename = strintern(cs->name);
>> +	out->filename = strdup(cs->name);
>>  	out->origin_type = cs->origin_type;
>>  	out->linenr = cs->linenr;
>>  	out->scope = scope;
>> @@ -1857,6 +1857,7 @@ static int do_config_from(struct config_source *top, config_fn_t fn,
>>  
>>  	strbuf_release(&top->value);
>>  	strbuf_release(&top->var);
>> +	free(kvi.filename);
>>  
>>  	return ret;
>>  }


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3] Make ce_compare_gitlink thread-safe
  2024-03-05 18:53     ` Junio C Hamano
@ 2024-03-06  1:23       ` Jeff King
  2024-03-06  1:58         ` Junio C Hamano
  2024-03-12 22:13         ` Atneya Nair
  0 siblings, 2 replies; 17+ messages in thread
From: Jeff King @ 2024-03-06  1:23 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Atneya Nair, git, jeffhost, me, nasamuffin, Tanay Abhra, Glen Choo

On Tue, Mar 05, 2024 at 10:53:02AM -0800, Junio C Hamano wrote:

> Junio C Hamano <gitster@pobox.com> writes:
> 
> > The use of strintern() comes originally from 3df8fd62 ...
> > ..., so they may
> > know how safe the change on the config side would be (I still do
> > not understand why you'd want to do this in the first place, though,
> > especially if you are protecting the callsites with mutex).
> 
> The risks of turning code that uses strintern() to use strdup() are
> 
>  * you will leak the allocated string unless you explicitly free the
>    string you now own.
> 
>  * you may consume too much memory if you are creating too many
>    copies of the same string (e.g. if you need filename for each
>    line in a file in an application, the memory consumption can
>    become 1000-fold).
> 
>  * the code may be taking advantage of the fact that two such
>    strings can be compared for (in)equality simply by comparing
>    their addresses, which you would need to adjust to use !strcmp()
>    and the like.

There is one more, I think: if you _do_ free the allocated string to
avoid the leak you mention, then some other code which was relying on
the lifetime of that string to be effectively infinite will now have a
user-after-free.

And I think that may be the case here. The "kvi" struct itself is local
to do_config_from(). But when we load the caching configset, we do so in
configset_add_value() which makes a shallow copy of the kvi struct. And
then that shallow copy may live on and be accessed with further calls to
git_config().

So doing just this:

diff --git a/config.c b/config.c
index 3cfeb3d8bd..2f6c83ffe7 100644
--- a/config.c
+++ b/config.c
@@ -1017,7 +1017,7 @@ static void kvi_from_source(struct config_source *cs,
 			    enum config_scope scope,
 			    struct key_value_info *out)
 {
-	out->filename = strintern(cs->name);
+	out->filename = xstrdup(cs->name);
 	out->origin_type = cs->origin_type;
 	out->linenr = cs->linenr;
 	out->scope = scope;
@@ -1855,6 +1855,7 @@ static int do_config_from(struct config_source *top, config_fn_t fn,
 
 	ret = git_parse_source(top, fn, &kvi, data, opts);
 
+	free((char *)kvi.filename);
 	strbuf_release(&top->value);
 	strbuf_release(&top->var);
 

will cause t4013.199 (among others) to fail when built with
SANITIZE=address, as it detects the user-after-free. I think you'd need
this on top:

diff --git a/config.c b/config.c
index 2f6c83ffe7..9854ca002d 100644
--- a/config.c
+++ b/config.c
@@ -2262,6 +2262,7 @@ static int configset_add_value(const struct key_value_info *kvi_p,
 	l_item->value_index = e->value_list.nr - 1;
 
 	*kv_info = *kvi_p;
+	kv_info->filename = xstrdup_or_null(kvi_p->filename); /* deep copy! */
 	si->util = kv_info;
 
 	return 0;

though probably an actual kvi_copy() function would be less horrible.

A few other things to note, looking at this code:

  - isn't kvi->path in the same boat? We do not duplicate it at all, so
    it seems like the shallow copy made in the configset could cause a
    user-after-free.

  - the "fix" I showed above hits your point 2: now we are making a lot
    more copies of that string. I will note that we're already making a
    lot of copies of the kvi struct in the first place, so unless you
    have really long pathnames, it probably isn't a big difference.

    But it possibly could make sense to have the configset own a single
    duplicate string, and then let the kvi structs it holds point to
    that string. But IMHO all of this should be details of the configset
    code, and the main config-iteration code should not have to worry
    about this at all. I.e., I think kvi_from_source() should not be
    duplicating anything in the first place.

-Peff

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3] Make ce_compare_gitlink thread-safe
  2024-03-06  1:23       ` Jeff King
@ 2024-03-06  1:58         ` Junio C Hamano
  2024-03-12 22:13         ` Atneya Nair
  1 sibling, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2024-03-06  1:58 UTC (permalink / raw)
  To: Jeff King
  Cc: Atneya Nair, git, jeffhost, me, nasamuffin, Tanay Abhra, Glen Choo

Jeff King <peff@peff.net> writes:

> There is one more, I think: if you _do_ free the allocated string to
> avoid the leak you mention, then some other code which was relying on
> the lifetime of that string to be effectively infinite will now have a
> user-after-free.

Ah, yes, you're right.  I completely forgot about that shallow copy.

> A few other things to note, looking at this code:
>
>   - isn't kvi->path in the same boat? We do not duplicate it at all, so
>     it seems like the shallow copy made in the configset could cause a
>     user-after-free.
>
>   - the "fix" I showed above hits your point 2: now we are making a lot
>     more copies of that string. I will note that we're already making a
>     lot of copies of the kvi struct in the first place, so unless you
>     have really long pathnames, it probably isn't a big difference.
>
>     But it possibly could make sense to have the configset own a single
>     duplicate string, and then let the kvi structs it holds point to
>     that string. But IMHO all of this should be details of the configset
>     code, and the main config-iteration code should not have to worry
>     about this at all. I.e., I think kvi_from_source() should not be
>     duplicating anything in the first place.

Thanks for a detailed write-up.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/3] Make read_gitfile and resolve_gitfile thread safe
  2024-03-05  1:21 ` [RFC PATCH 1/3] Make read_gitfile and resolve_gitfile thread safe Atneya Nair
  2024-03-05  2:22   ` Junio C Hamano
  2024-03-05  4:29   ` Eric Sunshine
@ 2024-03-06  8:13   ` Jean-Noël Avila
  2024-03-06 16:57     ` Junio C Hamano
  2 siblings, 1 reply; 17+ messages in thread
From: Jean-Noël Avila @ 2024-03-06  8:13 UTC (permalink / raw)
  To: Atneya Nair, git; +Cc: gitster, jeffhost, me, nasamuffin

Le 05/03/2024 à 02:21, Atneya Nair a écrit :

<snip>

> @@ -830,7 +833,8 @@ void read_gitfile_error_die(int error_code, const char *path, const char *dir)
>  
>  /*
>   * Try to read the location of the git directory from the .git file,
> - * return path to git directory if found. The return value comes from
> + * return path to git directory if found. If passed a valid strbuf, the return
> + * value is is a ptr to within the buffer. If strbuf is null, the return value comes from
>   * a shared buffer.
>   *
>   * On failure, if return_error_code is not NULL, return_error_code
> @@ -838,7 +842,7 @@ void read_gitfile_error_die(int error_code, const char *path, const char *dir)
>   * return_error_code is NULL the function will die instead (for most
>   * cases).
>   */
> -const char *read_gitfile_gently(const char *path, int *return_error_code)
> +const char *read_gitfile_gently(const char *path, int *return_error_code, struct strbuf* result_buf)
>  {
>  	const int max_file_size = 1 << 20;  /* 1MB */
>  	int error_code = 0;
> @@ -848,7 +852,10 @@ const char *read_gitfile_gently(const char *path, int *return_error_code)
>  	struct stat st;
>  	int fd;
>  	ssize_t len;
> -	static struct strbuf realpath = STRBUF_INIT;
> +	static struct strbuf shared = STRBUF_INIT;
> +	if (!result_buf) {
> +		result_buf = &shared;
> +	}
>  

Question about general style: is it accepted practice to override a
parameter of a function?

I would have written:


@@ -838,7 +842,7 @@ void read_gitfile_error_die(int error_code, const
char *path, const char *dir)
   * return_error_code is NULL the function will die instead (for most
   * cases).
   */
-const char *read_gitfile_gently(const char *path, int *return_error_code)
+const char *read_gitfile_gently(const char *path, int
*return_error_code, struct strbuf* input_buf)
 {
 	const int max_file_size = 1 << 20;  /* 1MB */
 	int error_code = 0;
@@ -848,7 +852,10 @@ const char *read_gitfile_gently(const char *path,
int *return_error_code)
 	struct stat st;
 	int fd;
 	ssize_t len;
-	static struct strbuf realpath = STRBUF_INIT;
+	static struct strbuf shared = STRBUF_INIT;
+       struct strbuf* result_buf;
+	if (!input_buf) {
+		result_buf = &shared;
+	} else  {
+		result_buf = input_buf;
+	}


>  	if (stat(path, &st)) {
>  		/* NEEDSWORK: discern between ENOENT vs other errors */
> @@ -900,8 +907,10 @@ const char *read_gitfile_gently(const char *path, int *return_error_code)
>  		goto cleanup_return;
>  	}
>  
> -	strbuf_realpath(&realpath, dir, 1);
> -	path = realpath.buf;
> +	strbuf_realpath(result_buf, dir, 1);
> +	path = result_buf->buf;
> +	// TODO is this valid?
> +	if (!path) die(_("Unexpected null from realpath '%s'"), dir);

In fact, this is not a null path, but an empty path (null is not part of
the string).
By the way, shouldn't this be an internal bug instead of a message to
the user?


Thanks



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/3] Make read_gitfile and resolve_gitfile thread safe
  2024-03-06  8:13   ` Jean-Noël Avila
@ 2024-03-06 16:57     ` Junio C Hamano
  2024-03-12 20:44       ` Atneya Nair
  0 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2024-03-06 16:57 UTC (permalink / raw)
  To: Jean-Noël Avila; +Cc: Atneya Nair, git, jeffhost, me, nasamuffin

Jean-Noël Avila <avila.jn@gmail.com> writes:

>> -const char *read_gitfile_gently(const char *path, int *return_error_code)
>> +const char *read_gitfile_gently(const char *path, int *return_error_code, struct strbuf* result_buf)
>>  {
>>  	const int max_file_size = 1 << 20;  /* 1MB */
>>  	int error_code = 0;
>> @@ -848,7 +852,10 @@ const char *read_gitfile_gently(const char *path, int *return_error_code)
>>  	struct stat st;
>>  	int fd;
>>  	ssize_t len;
>> -	static struct strbuf realpath = STRBUF_INIT;
>> +	static struct strbuf shared = STRBUF_INIT;
>> +	if (!result_buf) {
>> +		result_buf = &shared;
>> +	}
>>  
>
> Question about general style: is it accepted practice to override a
> parameter of a function?

We do not forbid it.  We have a rule against enclosing a single
statement block inside {braces}, though ;-).

> I would have written:

If it matters to know what the caller-supplied value of the
parameter was, then we would probably write that way.  If it does
not, then the above is perfectly fine.  Even with the above, if a
later code really wanted to, it can compare the pointers to find out
if the caller was uninterested in the result (i.e., passed NULL),
but at that point, we may be better off to (re)write it your way.

>> -	strbuf_realpath(&realpath, dir, 1);
>> -	path = realpath.buf;
>> +	strbuf_realpath(result_buf, dir, 1);
>> +	path = result_buf->buf;
>> +	// TODO is this valid?
>> +	if (!path) die(_("Unexpected null from realpath '%s'"), dir);
>
> In fact, this is not a null path, but an empty path (null is not part of
> the string).
> By the way, shouldn't this be an internal bug instead of a message to
> the user?

Unless the strbuf instance the result_buf pointer points at is
corrupt, its .buf member should *NEVER* be NULL.  Testing for NULL
is meaningless, unless you are manually futzing with the members of
strbuf (you shouldn't).

Thanks for carefully reading.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/3] Make read_gitfile and resolve_gitfile thread safe
  2024-03-05  4:29   ` Eric Sunshine
@ 2024-03-12 20:38     ` Atneya Nair
  0 siblings, 0 replies; 17+ messages in thread
From: Atneya Nair @ 2024-03-12 20:38 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: git, gitster, jeffhost, me, nasamuffin

On Mon, Mar 4, 2024 at 8:29 PM Eric Sunshine <sunshine@sunshineco.com> wrote:

> Style: Declare `resolved_gitdir_buf` along with `ret` and `gitdir`,
> then have a blank line before the actual code.

Thanks for the detailed style feedback.

>
> >         } else if (err == READ_GITFILE_ERR_NOT_A_REPO) {
> > -               if (!(backlink = infer_backlink(realdotgit.buf))) {
> > +               if (!(backlink.buf = infer_backlink(realdotgit.buf))) {
>
> Don't do this. Never modify the internal state of strbuf directly;
> consider the state read-only. Modifications should only be made via
> the API. You'll need to rewrite this code a bit to make it work
> correctly with the changes proposed by this patch.

Good catch, I should've paid more attention in the refactoring.

Fixed all of the discussed notes in v2.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/3] Make read_gitfile and resolve_gitfile thread safe
  2024-03-06 16:57     ` Junio C Hamano
@ 2024-03-12 20:44       ` Atneya Nair
  0 siblings, 0 replies; 17+ messages in thread
From: Atneya Nair @ 2024-03-12 20:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jean-Noël Avila, git, jeffhost, me, nasamuffin

On Wed, Mar 6, 2024 at 8:57 AM Junio C Hamano <gitster@pobox.com> wrote:

> >> +    if (!path) die(_("Unexpected null from realpath '%s'"), dir);
> >
> > In fact, this is not a null path, but an empty path (null is not part of
> > the string).
> > By the way, shouldn't this be an internal bug instead of a message to
> > the user?
>
> Unless the strbuf instance the result_buf pointer points at is
> corrupt, its .buf member should *NEVER* be NULL.  Testing for NULL
> is meaningless, unless you are manually futzing with the members of
> strbuf (you shouldn't).
>
> Thanks for carefully reading.
>

Thanks for pointing this out, I fixed this issue in v2.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3] Make ce_compare_gitlink thread-safe
  2024-03-06  1:23       ` Jeff King
  2024-03-06  1:58         ` Junio C Hamano
@ 2024-03-12 22:13         ` Atneya Nair
  1 sibling, 0 replies; 17+ messages in thread
From: Atneya Nair @ 2024-03-12 22:13 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, git, jeffhost, me, nasamuffin, Tanay Abhra, Glen Choo

On Tue, Mar 5, 2024 at 5:23 PM Jeff King <peff@peff.net> wrote:

> And I think that may be the case here. The "kvi" struct itself is local
> to do_config_from(). But when we load the caching configset, we do so in
> configset_add_value() which makes a shallow copy of the kvi struct. And
> then that shallow copy may live on and be accessed with further calls to
> git_config().

I missed the shallow copy of kvi in do_config_from. Unfortunately, we also
pass this back into a callback function, and although it is often not used,
changing the code to be more precise about string ownership here seems
non-trivial.

>
> though probably an actual kvi_copy() function would be less horrible.
>
> A few other things to note, looking at this code:
>
>   - isn't kvi->path in the same boat? We do not duplicate it at all, so
>     it seems like the shallow copy made in the configset could cause a
>     user-after-free.

The situation with path seems to indicate that indeed ownership is correctly
handled up the call-stack (i.e. the config_source owns the file/path strings),
and these strings are valid for the duration of the shallow copies through the
call-graph. But, the fact that filename in particular is interned may indicate
that the behavior of leaking the string to static lifetime is used by
a caller at
some point.

>     But it possibly could make sense to have the configset own a single
>     duplicate string, and then let the kvi structs it holds point to
>     that string. But IMHO all of this should be details of the configset
>     code, and the main config-iteration code should not have to worry
>     about this at all. I.e., I think kvi_from_source() should not be
>     duplicating anything in the first place.
>
> -Peff

Based on this discussion, I think the cleanest solution is a lock on
the strintern hashmap. I originally wanted to avoid adding locking, since most
call-paths are single threaded, and I thought we were only using the
string locally.
However, copying the string is more expensive and potentially intrusive to many
call-sites, and it is much more difficult to reason about the correctness of the
change, so I think adding a lock is preferable. I'll switch to that in
v2 along with the
more descriptive commit message.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3] Make ce_compare_gitlink thread-safe
  2024-03-05 17:08   ` Junio C Hamano
  2024-03-05 18:53     ` Junio C Hamano
@ 2024-03-12 22:15     ` Atneya Nair
  2024-03-13 17:51       ` Junio C Hamano
  1 sibling, 1 reply; 17+ messages in thread
From: Atneya Nair @ 2024-03-12 22:15 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, jeffhost, me, nasamuffin, Tanay Abhra, Glen Choo

On Tue, Mar 5, 2024 at 9:08 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Atneya Nair <atneya@google.com> writes:
>
> > To enable parallel update of the read cache for submodules,
> > ce_compare_gitlink must be thread safe (for different objects).
> >
> > Remove string interning in do_config_from (called from
> > repo_submodule_init) and add locking around accessing the ref_store_map.
>
> This step does two independent things, even though they may have
> dependencies, i.e., for one to be a solution for the problem it is
> tackling, the other may have to be there already.  E.g., even after
> calls to ce_compare_gitlink() get serialized via a mutex, it may for
> some reason not work without giving each kvi.filename its own copy
> [*], and if that is the case, you may need to have the "stop
> interning" step in a single patch with its own justification, and
> then "have mutex around ref_store calls" patch has to come after it.
>

What is the appropriate way to split an intermediate patch into two? Simply
adding the additional commit as 3 of 4 would cause a mismatch between patch-set
versions? Would that cause any issues?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3] Make ce_compare_gitlink thread-safe
  2024-03-12 22:15     ` Atneya Nair
@ 2024-03-13 17:51       ` Junio C Hamano
  0 siblings, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2024-03-13 17:51 UTC (permalink / raw)
  To: Atneya Nair; +Cc: git, jeffhost, me, nasamuffin, Tanay Abhra, Glen Choo

Atneya Nair <atneya@google.com> writes:

> What is the appropriate way to split an intermediate patch into two? Simply
> adding the additional commit as 3 of 4 would cause a mismatch between patch-set
> versions? Would that cause any issues?

If you have three patch series originally and the second one is too
big, then you'd end up with four patch series whose second and third
corresponds to the second patch of the original series, and the last
patch of the updated series that is numbered 4 will correspond to
the third one from the original.  That is perfectly normal and you
can see the correspondence with help from tools like range-diff.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-03-13 17:52 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-05  1:21 [RFC PATCH 0/3] Parallel submodule status Atneya Nair
2024-03-05  1:21 ` [RFC PATCH 1/3] Make read_gitfile and resolve_gitfile thread safe Atneya Nair
2024-03-05  2:22   ` Junio C Hamano
2024-03-05  4:29   ` Eric Sunshine
2024-03-12 20:38     ` Atneya Nair
2024-03-06  8:13   ` Jean-Noël Avila
2024-03-06 16:57     ` Junio C Hamano
2024-03-12 20:44       ` Atneya Nair
2024-03-05  1:21 ` [RFC PATCH 2/3] Make ce_compare_gitlink thread-safe Atneya Nair
2024-03-05 17:08   ` Junio C Hamano
2024-03-05 18:53     ` Junio C Hamano
2024-03-06  1:23       ` Jeff King
2024-03-06  1:58         ` Junio C Hamano
2024-03-12 22:13         ` Atneya Nair
2024-03-12 22:15     ` Atneya Nair
2024-03-13 17:51       ` Junio C Hamano
2024-03-05  1:21 ` [RFC PATCH 3/3] Preload submodule state in refresh_index Atneya Nair

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.