All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] const-correctness in grep.c
@ 2021-09-21  3:45 Jeff King
  2021-09-21  3:46 ` [PATCH 1/5] grep: stop modifying buffer in strip_timestamp Jeff King
                   ` (8 more replies)
  0 siblings, 9 replies; 32+ messages in thread
From: Jeff King @ 2021-09-21  3:45 UTC (permalink / raw)
  To: git; +Cc: Hamza Mahfooz

While discussing [1], I noticed that the grep code mostly takes
non-const buffers, even though it is conceptually a read-only operation
to search in them. The culprit is a handful of spots that temporarily
tie off NUL-terminated strings by overwriting a byte of the buffer and
then restoring it. But I think we no longer need to do so these days,
now that we have a regexec_buf() that can take a ptr/size pair.

The first three patches are a bit repetitive, but I broke them up
individually because they're the high-risk part. I.e., if my assumptions
about needing the NUL are wrong, it could introduce a bug. But based on
my reading of the code, plus running the test suite with ASan/UBSan, I
feel reasonably confident.

The last two are the bigger cleanups, but should obviously avoid any
behavior changes.

  [1/5]: grep: stop modifying buffer in strip_timestamp
  [2/5]: grep: stop modifying buffer in show_line()
  [3/5]: grep: stop modifying buffer in grep_source_1()
  [4/5]: grep: mark "haystack" buffers as const
  [5/5]: grep: store grep_source buffer as const

 grep.c | 87 +++++++++++++++++++++++++++++-----------------------------
 grep.h |  4 +--
 2 files changed, 45 insertions(+), 46 deletions(-)

-Peff

[1] https://lore.kernel.org/git/YUk3zwuse56v76ze@coredump.intra.peff.net/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/5] grep: stop modifying buffer in strip_timestamp
  2021-09-21  3:45 [PATCH 0/5] const-correctness in grep.c Jeff King
@ 2021-09-21  3:46 ` Jeff King
  2021-09-21  5:18   ` Carlo Arenas
  2021-09-21  3:48 ` [PATCH 2/5] grep: stop modifying buffer in show_line() Jeff King
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Jeff King @ 2021-09-21  3:46 UTC (permalink / raw)
  To: git; +Cc: Hamza Mahfooz

When grepping for headers in commit objects, we receive individual
lines (e.g., "author Name <email> 1234 -0000"), and then strip off the
timestamp to do our match. We do so by writing a NUL byte over the
whitespace separator, and then remembering to restore it later.

We had to do it this way when this was added back in a4d7d2c6db (log
--author/--committer: really match only with name part, 2008-09-04),
because we fed the result directly to regexec(), which expects a
NUL-terminated string. But since b7d36ffca0 (regex: use regexec_buf(),
2016-09-21), we have a function which can match part of a buffer.

So instead of modifying the string, we can instead just move the "eol"
pointer, and the rest of the code will do the right thing. This will let
further patches mark more buffers as "const".

Signed-off-by: Jeff King <peff@peff.net>
---
 grep.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/grep.c b/grep.c
index 79598f245f..5b1f2da4d3 100644
--- a/grep.c
+++ b/grep.c
@@ -922,20 +922,16 @@ static int patmatch(struct grep_pat *p, char *line, char *eol,
 	return hit;
 }
 
-static int strip_timestamp(char *bol, char **eol_p)
+static void strip_timestamp(char *bol, char **eol_p)
 {
 	char *eol = *eol_p;
-	int ch;
 
 	while (bol < --eol) {
 		if (*eol != '>')
 			continue;
 		*eol_p = ++eol;
-		ch = *eol;
-		*eol = '\0';
-		return ch;
+		break;
 	}
-	return 0;
 }
 
 static struct {
@@ -952,7 +948,6 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
 			     regmatch_t *pmatch, int eflags)
 {
 	int hit = 0;
-	int saved_ch = 0;
 	const char *start = bol;
 
 	if ((p->token != GREP_PATTERN) &&
@@ -971,7 +966,7 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
 		switch (p->field) {
 		case GREP_HEADER_AUTHOR:
 		case GREP_HEADER_COMMITTER:
-			saved_ch = strip_timestamp(bol, &eol);
+			strip_timestamp(bol, &eol);
 			break;
 		default:
 			break;
@@ -1021,8 +1016,6 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
 				goto again;
 		}
 	}
-	if (p->token == GREP_PATTERN_HEAD && saved_ch)
-		*eol = saved_ch;
 	if (hit) {
 		pmatch[0].rm_so += bol - start;
 		pmatch[0].rm_eo += bol - start;
-- 
2.33.0.1023.gc687d0d3c8


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/5] grep: stop modifying buffer in show_line()
  2021-09-21  3:45 [PATCH 0/5] const-correctness in grep.c Jeff King
  2021-09-21  3:46 ` [PATCH 1/5] grep: stop modifying buffer in strip_timestamp Jeff King
@ 2021-09-21  3:48 ` Jeff King
  2021-09-21  4:22   ` Taylor Blau
  2021-09-21  3:48 ` [PATCH 3/5] grep: stop modifying buffer in grep_source_1() Jeff King
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Jeff King @ 2021-09-21  3:48 UTC (permalink / raw)
  To: git; +Cc: Hamza Mahfooz

When showing lines via grep (or looking for funcnames), we call
show_line() on a multi-line buffer. It finds the end of line and marks
it with a NUL. However, we don't need to do so, as the resulting line is
only used along with its "eol" marker:

 - we pass both to next_match(), which takes care to look at only the
   bytes we specified

 - we pass the line to output_color() without its matching eol marker.
   However, we do use the "match" struct we got from next_match() to
   tell it how many bytes to look at (which can never exceed the string
   we passed it).

So we can stop setting and restoring this NUL marker. That makes the
code simpler, and will allow us to take a const buffer in a future
patch.

Signed-off-by: Jeff King <peff@peff.net>
---
 grep.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/grep.c b/grep.c
index 5b1f2da4d3..70af01d1c1 100644
--- a/grep.c
+++ b/grep.c
@@ -1239,7 +1239,6 @@ static void show_line(struct grep_opt *opt, char *bol, char *eol,
 	if (opt->color || opt->only_matching) {
 		regmatch_t match;
 		enum grep_context ctx = GREP_CONTEXT_BODY;
-		int ch = *eol;
 		int eflags = 0;
 
 		if (opt->color) {
@@ -1254,7 +1253,6 @@ static void show_line(struct grep_opt *opt, char *bol, char *eol,
 			else if (sign == '=')
 				line_color = opt->colors[GREP_COLOR_FUNCTION];
 		}
-		*eol = '\0';
 		while (next_match(opt, bol, eol, ctx, &match, eflags)) {
 			if (match.rm_so == match.rm_eo)
 				break;
@@ -1272,7 +1270,6 @@ static void show_line(struct grep_opt *opt, char *bol, char *eol,
 			rest -= match.rm_eo;
 			eflags = REG_NOTBOL;
 		}
-		*eol = ch;
 	}
 	if (!opt->only_matching) {
 		output_color(opt, bol, rest, line_color);
-- 
2.33.0.1023.gc687d0d3c8


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/5] grep: stop modifying buffer in grep_source_1()
  2021-09-21  3:45 [PATCH 0/5] const-correctness in grep.c Jeff King
  2021-09-21  3:46 ` [PATCH 1/5] grep: stop modifying buffer in strip_timestamp Jeff King
  2021-09-21  3:48 ` [PATCH 2/5] grep: stop modifying buffer in show_line() Jeff King
@ 2021-09-21  3:48 ` Jeff King
  2021-09-21  3:49 ` [PATCH 4/5] grep: mark "haystack" buffers as const Jeff King
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Jeff King @ 2021-09-21  3:48 UTC (permalink / raw)
  To: git; +Cc: Hamza Mahfooz

We find the end of each matching line of a buffer, and then temporarily
write a NUL to turn it into a regular C string. But we don't need to do
so, because the only thing we do in the interim is pass the line and its
length (via an "eol" pointer) to match_line(). And that function should
only look at the bytes we passed it, whether it has a terminating NUL or
not.

We can drop this temporary write in order to simplify the code and make
it easier to use const buffers in more of grep.c.

Signed-off-by: Jeff King <peff@peff.net>
---
 grep.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/grep.c b/grep.c
index 70af01d1c1..32c4641443 100644
--- a/grep.c
+++ b/grep.c
@@ -1616,7 +1616,7 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle
 	bol = gs->buf;
 	left = gs->size;
 	while (left) {
-		char *eol, ch;
+		char *eol;
 		int hit;
 		ssize_t cno;
 		ssize_t col = -1, icol = -1;
@@ -1637,14 +1637,11 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle
 		    && look_ahead(opt, &left, &lno, &bol))
 			break;
 		eol = end_of_line(bol, &left);
-		ch = *eol;
-		*eol = 0;
 
 		if ((ctx == GREP_CONTEXT_HEAD) && (eol == bol))
 			ctx = GREP_CONTEXT_BODY;
 
 		hit = match_line(opt, bol, eol, &col, &icol, ctx, collect_hits);
-		*eol = ch;
 
 		if (collect_hits)
 			goto next_line;
-- 
2.33.0.1023.gc687d0d3c8


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/5] grep: mark "haystack" buffers as const
  2021-09-21  3:45 [PATCH 0/5] const-correctness in grep.c Jeff King
                   ` (2 preceding siblings ...)
  2021-09-21  3:48 ` [PATCH 3/5] grep: stop modifying buffer in grep_source_1() Jeff King
@ 2021-09-21  3:49 ` Jeff King
  2021-09-21 12:04   ` Ævar Arnfjörð Bjarmason
  2021-09-21  3:51 ` [PATCH 5/5] grep: store grep_source buffer " Jeff King
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Jeff King @ 2021-09-21  3:49 UTC (permalink / raw)
  To: git; +Cc: Hamza Mahfooz

When we're grepping in a buffer, we don't need to modify it. So we can
take "const char *" buffers, rather than "char *". This can avoid some
awkward casts in our callers, and make our expectations more clear (we
will not take ownership of the memory, nor will we ever write to it).

These spots don't all necessarily have to be converted at the same time,
but some of them are dependent on others, because we pass
pointers-to-pointers in a few cases. And none of this should change any
behavior, since we're just adding "const" qualifiers (and likewise, the
compiler will let us know if we missed any spots). So it's relatively
low-risk to just do this all at once.

Signed-off-by: Jeff King <peff@peff.net>
---
 grep.c | 61 +++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 35 insertions(+), 26 deletions(-)

diff --git a/grep.c b/grep.c
index 32c4641443..9603ac119d 100644
--- a/grep.c
+++ b/grep.c
@@ -867,7 +867,7 @@ void free_grep_patterns(struct grep_opt *opt)
 	free_pattern_expr(opt->pattern_expression);
 }
 
-static char *end_of_line(char *cp, unsigned long *left)
+static const char *end_of_line(const char *cp, unsigned long *left)
 {
 	unsigned long l = *left;
 	while (l && *cp != '\n') {
@@ -908,7 +908,8 @@ static void show_name(struct grep_opt *opt, const char *name)
 	opt->output(opt, opt->null_following_name ? "\0" : "\n", 1);
 }
 
-static int patmatch(struct grep_pat *p, char *line, char *eol,
+static int patmatch(struct grep_pat *p,
+		    const char *line, const char *eol,
 		    regmatch_t *match, int eflags)
 {
 	int hit;
@@ -922,9 +923,9 @@ static int patmatch(struct grep_pat *p, char *line, char *eol,
 	return hit;
 }
 
-static void strip_timestamp(char *bol, char **eol_p)
+static void strip_timestamp(const char *bol, const char **eol_p)
 {
-	char *eol = *eol_p;
+	const char *eol = *eol_p;
 
 	while (bol < --eol) {
 		if (*eol != '>')
@@ -943,7 +944,8 @@ static struct {
 	{ "reflog ", 7 },
 };
 
-static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
+static int match_one_pattern(struct grep_pat *p,
+			     const char *bol, const char *eol,
 			     enum grep_context ctx,
 			     regmatch_t *pmatch, int eflags)
 {
@@ -1023,8 +1025,9 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
 	return hit;
 }
 
-static int match_expr_eval(struct grep_opt *opt, struct grep_expr *x, char *bol,
-			   char *eol, enum grep_context ctx, ssize_t *col,
+static int match_expr_eval(struct grep_opt *opt, struct grep_expr *x,
+			   const char *bol, const char *eol,
+			   enum grep_context ctx, ssize_t *col,
 			   ssize_t *icol, int collect_hits)
 {
 	int h = 0;
@@ -1091,15 +1094,17 @@ static int match_expr_eval(struct grep_opt *opt, struct grep_expr *x, char *bol,
 	return h;
 }
 
-static int match_expr(struct grep_opt *opt, char *bol, char *eol,
+static int match_expr(struct grep_opt *opt,
+		      const char *bol, const char *eol,
 		      enum grep_context ctx, ssize_t *col,
 		      ssize_t *icol, int collect_hits)
 {
 	struct grep_expr *x = opt->pattern_expression;
 	return match_expr_eval(opt, x, bol, eol, ctx, col, icol, collect_hits);
 }
 
-static int match_line(struct grep_opt *opt, char *bol, char *eol,
+static int match_line(struct grep_opt *opt,
+		      const char *bol, const char *eol,
 		      ssize_t *col, ssize_t *icol,
 		      enum grep_context ctx, int collect_hits)
 {
@@ -1131,7 +1136,8 @@ static int match_line(struct grep_opt *opt, char *bol, char *eol,
 	return hit;
 }
 
-static int match_next_pattern(struct grep_pat *p, char *bol, char *eol,
+static int match_next_pattern(struct grep_pat *p,
+			      const char *bol, const char *eol,
 			      enum grep_context ctx,
 			      regmatch_t *pmatch, int eflags)
 {
@@ -1152,7 +1158,8 @@ static int match_next_pattern(struct grep_pat *p, char *bol, char *eol,
 	return 1;
 }
 
-static int next_match(struct grep_opt *opt, char *bol, char *eol,
+static int next_match(struct grep_opt *opt,
+		      const char *bol, const char *eol,
 		      enum grep_context ctx, regmatch_t *pmatch, int eflags)
 {
 	struct grep_pat *p;
@@ -1208,7 +1215,8 @@ static void show_line_header(struct grep_opt *opt, const char *name,
 	}
 }
 
-static void show_line(struct grep_opt *opt, char *bol, char *eol,
+static void show_line(struct grep_opt *opt,
+		      const char *bol, const char *eol,
 		      const char *name, unsigned lno, ssize_t cno, char sign)
 {
 	int rest = eol - bol;
@@ -1297,7 +1305,8 @@ static inline void grep_attr_unlock(void)
 		pthread_mutex_unlock(&grep_attr_mutex);
 }
 
-static int match_funcname(struct grep_opt *opt, struct grep_source *gs, char *bol, char *eol)
+static int match_funcname(struct grep_opt *opt, struct grep_source *gs,
+			  const char *bol, const char *eol)
 {
 	xdemitconf_t *xecfg = opt->priv;
 	if (xecfg && !xecfg->find_func) {
@@ -1324,10 +1333,10 @@ static int match_funcname(struct grep_opt *opt, struct grep_source *gs, char *bo
 }
 
 static void show_funcname_line(struct grep_opt *opt, struct grep_source *gs,
-			       char *bol, unsigned lno)
+			       const char *bol, unsigned lno)
 {
 	while (bol > gs->buf) {
-		char *eol = --bol;
+		const char *eol = --bol;
 
 		while (bol > gs->buf && bol[-1] != '\n')
 			bol--;
@@ -1346,7 +1355,7 @@ static void show_funcname_line(struct grep_opt *opt, struct grep_source *gs,
 static int is_empty_line(const char *bol, const char *eol);
 
 static void show_pre_context(struct grep_opt *opt, struct grep_source *gs,
-			     char *bol, char *end, unsigned lno)
+			     const char *bol, const char *end, unsigned lno)
 {
 	unsigned cur = lno, from = 1, funcname_lno = 0, orig_from;
 	int funcname_needed = !!opt->funcname, comment_needed = 0;
@@ -1366,8 +1375,8 @@ static void show_pre_context(struct grep_opt *opt, struct grep_source *gs,
 
 	/* Rewind. */
 	while (bol > gs->buf && cur > from) {
-		char *next_bol = bol;
-		char *eol = --bol;
+		const char *next_bol = bol;
+		const char *eol = --bol;
 
 		while (bol > gs->buf && bol[-1] != '\n')
 			bol--;
@@ -1398,7 +1407,7 @@ static void show_pre_context(struct grep_opt *opt, struct grep_source *gs,
 
 	/* Back forward. */
 	while (cur < lno) {
-		char *eol = bol, sign = (cur == funcname_lno) ? '=' : '-';
+		const char *eol = bol, sign = (cur == funcname_lno) ? '=' : '-';
 
 		while (*eol != '\n')
 			eol++;
@@ -1426,12 +1435,12 @@ static int should_lookahead(struct grep_opt *opt)
 static int look_ahead(struct grep_opt *opt,
 		      unsigned long *left_p,
 		      unsigned *lno_p,
-		      char **bol_p)
+		      const char **bol_p)
 {
 	unsigned lno = *lno_p;
-	char *bol = *bol_p;
+	const char *bol = *bol_p;
 	struct grep_pat *p;
-	char *sp, *last_bol;
+	const char *sp, *last_bol;
 	regoff_t earliest = -1;
 
 	for (p = opt->pattern_list; p; p = p->next) {
@@ -1533,8 +1542,8 @@ static int is_empty_line(const char *bol, const char *eol)
 
 static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int collect_hits)
 {
-	char *bol;
-	char *peek_bol = NULL;
+	const char *bol;
+	const char *peek_bol = NULL;
 	unsigned long left;
 	unsigned lno = 1;
 	unsigned last_hit = 0;
@@ -1616,7 +1625,7 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle
 	bol = gs->buf;
 	left = gs->size;
 	while (left) {
-		char *eol;
+		const char *eol;
 		int hit;
 		ssize_t cno;
 		ssize_t col = -1, icol = -1;
@@ -1700,7 +1709,7 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle
 		}
 		if (show_function && (!peek_bol || peek_bol < bol)) {
 			unsigned long peek_left = left;
-			char *peek_eol = eol;
+			const char *peek_eol = eol;
 
 			/*
 			 * Trailing empty lines are not interesting.
-- 
2.33.0.1023.gc687d0d3c8


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 5/5] grep: store grep_source buffer as const
  2021-09-21  3:45 [PATCH 0/5] const-correctness in grep.c Jeff King
                   ` (3 preceding siblings ...)
  2021-09-21  3:49 ` [PATCH 4/5] grep: mark "haystack" buffers as const Jeff King
@ 2021-09-21  3:51 ` Jeff King
  2021-09-21  4:30 ` [PATCH 0/5] const-correctness in grep.c Taylor Blau
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Jeff King @ 2021-09-21  3:51 UTC (permalink / raw)
  To: git; +Cc: Hamza Mahfooz

Our grep_buffer() function takes a non-const buffer, which is confusing:
we don't take ownership of nor write to the buffer.

This mostly comes from the fact that the underlying grep_source struct
in which we store the buffer uses non-const pointer. The memory pointed
to by the struct is sometimes owned by us (for FILE or OID sources), and
sometimes not (for BUF sources).

Let's store it as const, which lets us err on the side of caution (i.e.,
the compiler will warn us if any of our code writes to or tries to free
it).

As a result, we must annotate the one place where we do free it by
casting away the constness. But that's a small price to pay for the
extra safety and clarity elsewhere (and indeed, it already had a comment
explaining why GREP_SOURCE_BUF _didn't_ free it).

And then we can mark grep_buffer() as taking a const buffer.

Signed-off-by: Jeff King <peff@peff.net>
---
 grep.c | 9 ++++++---
 grep.h | 4 ++--
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/grep.c b/grep.c
index 9603ac119d..14fe8a0fd2 100644
--- a/grep.c
+++ b/grep.c
@@ -1820,9 +1820,10 @@ int grep_source(struct grep_opt *opt, struct grep_source *gs)
 
 	return grep_source_1(opt, gs, 0);
 }
 
-static void grep_source_init_buf(struct grep_source *gs, char *buf,
+static void grep_source_init_buf(struct grep_source *gs,
+				 const char *buf,
 				 unsigned long size)
 {
 	gs->type = GREP_SOURCE_BUF;
 	gs->name = NULL;
@@ -1832,9 +1833,9 @@ static void grep_source_init_buf(struct grep_source *gs, char *buf,
 	gs->driver = NULL;
 	gs->identifier = NULL;
 }
 
-int grep_buffer(struct grep_opt *opt, char *buf, unsigned long size)
+int grep_buffer(struct grep_opt *opt, const char *buf, unsigned long size)
 {
 	struct grep_source gs;
 	int r;
 
@@ -1884,9 +1885,11 @@ void grep_source_clear_data(struct grep_source *gs)
 {
 	switch (gs->type) {
 	case GREP_SOURCE_FILE:
 	case GREP_SOURCE_OID:
-		FREE_AND_NULL(gs->buf);
+		/* these types own the buffer */
+		free((char *)gs->buf);
+		gs->buf = NULL;
 		gs->size = 0;
 		break;
 	case GREP_SOURCE_BUF:
 		/* leave user-provided buf intact */
diff --git a/grep.h b/grep.h
index 128007db65..3cb8a83ae8 100644
--- a/grep.h
+++ b/grep.h
@@ -188,9 +188,9 @@ void append_grep_pat(struct grep_opt *opt, const char *pat, size_t patlen, const
 void append_grep_pattern(struct grep_opt *opt, const char *pat, const char *origin, int no, enum grep_pat_token t);
 void append_header_grep_pattern(struct grep_opt *, enum grep_header_field, const char *);
 void compile_grep_patterns(struct grep_opt *opt);
 void free_grep_patterns(struct grep_opt *opt);
-int grep_buffer(struct grep_opt *opt, char *buf, unsigned long size);
+int grep_buffer(struct grep_opt *opt, const char *buf, unsigned long size);
 
 struct grep_source {
 	char *name;
 
@@ -201,9 +201,9 @@ struct grep_source {
 	} type;
 	void *identifier;
 	struct repository *repo; /* if GREP_SOURCE_OID */
 
-	char *buf;
+	const char *buf;
 	unsigned long size;
 
 	char *path; /* for attribute lookups */
 	struct userdiff_driver *driver;
-- 
2.33.0.1023.gc687d0d3c8

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/5] grep: stop modifying buffer in show_line()
  2021-09-21  3:48 ` [PATCH 2/5] grep: stop modifying buffer in show_line() Jeff King
@ 2021-09-21  4:22   ` Taylor Blau
  2021-09-21  4:42     ` Jeff King
  0 siblings, 1 reply; 32+ messages in thread
From: Taylor Blau @ 2021-09-21  4:22 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Hamza Mahfooz

On Mon, Sep 20, 2021 at 11:48:09PM -0400, Jeff King wrote:
> When showing lines via grep (or looking for funcnames), we call
> show_line() on a multi-line buffer. It finds the end of line and marks
> it with a NUL. However, we don't need to do so, as the resulting line is
> only used along with its "eol" marker:
>
>  - we pass both to next_match(), which takes care to look at only the
>    bytes we specified

Thinking aloud, next_match() calls match_next_pattern() which takes eol
as non-const and passes it to match_one_pattern(). And that calls
strip_timestamp(), which would be non-const, were it not the previous
patch. So I think this conversion is safe.

>  - we pass the line to output_color() without its matching eol marker.
>    However, we do use the "match" struct we got from next_match() to
>    tell it how many bytes to look at (which can never exceed the string
>    we passed it

Yep, makes sense. The patch looks good and matches your description
here.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/5] const-correctness in grep.c
  2021-09-21  3:45 [PATCH 0/5] const-correctness in grep.c Jeff King
                   ` (4 preceding siblings ...)
  2021-09-21  3:51 ` [PATCH 5/5] grep: store grep_source buffer " Jeff King
@ 2021-09-21  4:30 ` Taylor Blau
  2021-09-21 12:07 ` Ævar Arnfjörð Bjarmason
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Taylor Blau @ 2021-09-21  4:30 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Hamza Mahfooz

On Mon, Sep 20, 2021 at 11:45:42PM -0400, Jeff King wrote:
> While discussing [1], I noticed that the grep code mostly takes
> non-const buffers, even though it is conceptually a read-only operation
> to search in them. The culprit is a handful of spots that temporarily
> tie off NUL-terminated strings by overwriting a byte of the buffer and
> then restoring it. But I think we no longer need to do so these days,
> now that we have a regexec_buf() that can take a ptr/size pair.

This all looks very reasonable to me. I appreciated the way you broke up
each spot that unnecessarily modified a buffer into its own patch with
its own explanation.

I looked through each of the three spots you mentioned with a close eye
and concurred with your reasoning. (To the extent it was possible, I
tried to ignore most of your commentary until I had generated my own
understanding while searching through my copy of grep.c).

Thanks for an enjoyable set of patches to read.

    Reviewed-by: Taylor Blau <me@ttaylorr.com>

> [1] https://lore.kernel.org/git/YUk3zwuse56v76ze@coredump.intra.peff.net/

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/5] grep: stop modifying buffer in show_line()
  2021-09-21  4:22   ` Taylor Blau
@ 2021-09-21  4:42     ` Jeff King
  2021-09-21  4:45       ` Taylor Blau
  0 siblings, 1 reply; 32+ messages in thread
From: Jeff King @ 2021-09-21  4:42 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Hamza Mahfooz

On Tue, Sep 21, 2021 at 12:22:45AM -0400, Taylor Blau wrote:

> On Mon, Sep 20, 2021 at 11:48:09PM -0400, Jeff King wrote:
> > When showing lines via grep (or looking for funcnames), we call
> > show_line() on a multi-line buffer. It finds the end of line and marks
> > it with a NUL. However, we don't need to do so, as the resulting line is
> > only used along with its "eol" marker:
> >
> >  - we pass both to next_match(), which takes care to look at only the
> >    bytes we specified
> 
> Thinking aloud, next_match() calls match_next_pattern() which takes eol
> as non-const and passes it to match_one_pattern(). And that calls
> strip_timestamp(), which would be non-const, were it not the previous
> patch. So I think this conversion is safe.

To be a little nit-picky: this move would be OK even without the change
to strip_timestamp(). The question is whether any of those sub-calls
actually looks past the "eol" pointer we give it.

I didn't dig all the way down through the complete call-stack in an
exhaustive way. But after having looked at the related functions when
changing their const-ness elsewhere in the series, they'd have to be
ignoring the "eol" parameter for it to be a problem. The irony, of
course, is that they did ignore this parameter at one point! Because
they'd all eventually end in regexec(), which would happily read past
our "eol".

So really, all of this is predicated on those sub-functions behaving
sensibly. I think they do. But if we found that they didn't, I'd much
rather know that and fix them than live with this "this NUL may or may
not be important" state forever.

> >  - we pass the line to output_color() without its matching eol marker.
> >    However, we do use the "match" struct we got from next_match() to
> >    tell it how many bytes to look at (which can never exceed the string
> >    we passed it
> 
> Yep, makes sense. The patch looks good and matches your description
> here.

Thanks for looking it over (my nitpick aside). :)

-Peff

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/5] grep: stop modifying buffer in show_line()
  2021-09-21  4:42     ` Jeff King
@ 2021-09-21  4:45       ` Taylor Blau
  0 siblings, 0 replies; 32+ messages in thread
From: Taylor Blau @ 2021-09-21  4:45 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, Hamza Mahfooz

On Tue, Sep 21, 2021 at 12:42:05AM -0400, Jeff King wrote:
> On Tue, Sep 21, 2021 at 12:22:45AM -0400, Taylor Blau wrote:
>
> > On Mon, Sep 20, 2021 at 11:48:09PM -0400, Jeff King wrote:
> > > When showing lines via grep (or looking for funcnames), we call
> > > show_line() on a multi-line buffer. It finds the end of line and marks
> > > it with a NUL. However, we don't need to do so, as the resulting line is
> > > only used along with its "eol" marker:
> > >
> > >  - we pass both to next_match(), which takes care to look at only the
> > >    bytes we specified
> >
> > Thinking aloud, next_match() calls match_next_pattern() which takes eol
> > as non-const and passes it to match_one_pattern(). And that calls
> > strip_timestamp(), which would be non-const, were it not the previous
> > patch. So I think this conversion is safe.
>
> To be a little nit-picky: this move would be OK even without the change
> to strip_timestamp(). The question is whether any of those sub-calls
> actually looks past the "eol" pointer we give it.

Right, I wasn't implying that this move was unsafe without the previous
patch. Just that without the previous patch, we couldn't make
show_line() take a const-pointer to eol.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] grep: stop modifying buffer in strip_timestamp
  2021-09-21  3:46 ` [PATCH 1/5] grep: stop modifying buffer in strip_timestamp Jeff King
@ 2021-09-21  5:18   ` Carlo Arenas
  2021-09-21  5:24     ` Eric Sunshine
  0 siblings, 1 reply; 32+ messages in thread
From: Carlo Arenas @ 2021-09-21  5:18 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Hamza Mahfooz

On Mon, Sep 20, 2021 at 9:09 PM Jeff King <peff@peff.net> wrote:
> @@ -971,7 +966,7 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
>                 switch (p->field) {
>                 case GREP_HEADER_AUTHOR:
>                 case GREP_HEADER_COMMITTER:
> -                       saved_ch = strip_timestamp(bol, &eol);
> +                       strip_timestamp(bol, &eol);

Why not something like (plus added error handling, even if it seems
the original didn't have them)?

  eol = strrchr(bol, '>');

Carlo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] grep: stop modifying buffer in strip_timestamp
  2021-09-21  5:18   ` Carlo Arenas
@ 2021-09-21  5:24     ` Eric Sunshine
  2021-09-21  5:40       ` Carlo Arenas
  2021-09-21  5:43       ` Jeff King
  0 siblings, 2 replies; 32+ messages in thread
From: Eric Sunshine @ 2021-09-21  5:24 UTC (permalink / raw)
  To: Carlo Arenas; +Cc: Jeff King, Git List, Hamza Mahfooz

On Tue, Sep 21, 2021 at 1:18 AM Carlo Arenas <carenas@gmail.com> wrote:
> On Mon, Sep 20, 2021 at 9:09 PM Jeff King <peff@peff.net> wrote:
> > @@ -971,7 +966,7 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
> >                 switch (p->field) {
> >                 case GREP_HEADER_AUTHOR:
> >                 case GREP_HEADER_COMMITTER:
> > -                       saved_ch = strip_timestamp(bol, &eol);
> > +                       strip_timestamp(bol, &eol);
>
> Why not something like (plus added error handling, even if it seems
> the original didn't have them)?
>
>   eol = strrchr(bol, '>');

strrchr() would search backward from the NUL, not from `eol`, thus
would not be a faithful conversion (and might not be safe, though I
didn't dig through all the callers).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] grep: stop modifying buffer in strip_timestamp
  2021-09-21  5:24     ` Eric Sunshine
@ 2021-09-21  5:40       ` Carlo Arenas
  2021-09-21  5:43       ` Jeff King
  1 sibling, 0 replies; 32+ messages in thread
From: Carlo Arenas @ 2021-09-21  5:40 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Jeff King, Git List, Hamza Mahfooz

On Mon, Sep 20, 2021 at 10:24 PM Eric Sunshine <sunshine@sunshineco.com> wrote:
>
> On Tue, Sep 21, 2021 at 1:18 AM Carlo Arenas <carenas@gmail.com> wrote:
> > On Mon, Sep 20, 2021 at 9:09 PM Jeff King <peff@peff.net> wrote:
> > > @@ -971,7 +966,7 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
> > >                 switch (p->field) {
> > >                 case GREP_HEADER_AUTHOR:
> > >                 case GREP_HEADER_COMMITTER:
> > > -                       saved_ch = strip_timestamp(bol, &eol);
> > > +                       strip_timestamp(bol, &eol);
> >
> > Why not something like (plus added error handling, even if it seems
> > the original didn't have them)?
> >
> >   eol = strrchr(bol, '>');
>
> strrchr() would search backward from the NUL, not from `eol`, thus
> would not be a faithful conversion (and might not be safe, though I
> didn't dig through all the callers).

of course; I meant memrchr, which I guess will need to have also a
compat version[1]
this is the only caller for strip_timestamp()

[1] https://www.gnu.org/software/gnulib/manual/html_node/memrchr.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] grep: stop modifying buffer in strip_timestamp
  2021-09-21  5:24     ` Eric Sunshine
  2021-09-21  5:40       ` Carlo Arenas
@ 2021-09-21  5:43       ` Jeff King
  2021-09-21  6:42         ` Carlo Marcelo Arenas Belón
  1 sibling, 1 reply; 32+ messages in thread
From: Jeff King @ 2021-09-21  5:43 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Carlo Arenas, Git List, Hamza Mahfooz

On Tue, Sep 21, 2021 at 01:24:41AM -0400, Eric Sunshine wrote:

> On Tue, Sep 21, 2021 at 1:18 AM Carlo Arenas <carenas@gmail.com> wrote:
> > On Mon, Sep 20, 2021 at 9:09 PM Jeff King <peff@peff.net> wrote:
> > > @@ -971,7 +966,7 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
> > >                 switch (p->field) {
> > >                 case GREP_HEADER_AUTHOR:
> > >                 case GREP_HEADER_COMMITTER:
> > > -                       saved_ch = strip_timestamp(bol, &eol);
> > > +                       strip_timestamp(bol, &eol);
> >
> > Why not something like (plus added error handling, even if it seems
> > the original didn't have them)?
> >
> >   eol = strrchr(bol, '>');
> 
> strrchr() would search backward from the NUL, not from `eol`, thus
> would not be a faithful conversion (and might not be safe, though I
> didn't dig through all the callers).

Right. The point is that we should be respecting "eol" here in the first
place. Plus the final result is the character _after_ the '>'. Plus when
it returns NULL, you'd want to leave "eol" untouched (stripping
nothing).

So yeah, I'm sure it could be rewritten around memrchr() or something,
but I doubt it would be much shorter, and the chance of introducing an
off-by-one seems non-trivial. :)

-Peff

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] grep: stop modifying buffer in strip_timestamp
  2021-09-21  5:43       ` Jeff King
@ 2021-09-21  6:42         ` Carlo Marcelo Arenas Belón
  2021-09-21  7:37           ` René Scharfe
  0 siblings, 1 reply; 32+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2021-09-21  6:42 UTC (permalink / raw)
  To: Jeff King; +Cc: Eric Sunshine, Git List, Hamza Mahfooz

On Tue, Sep 21, 2021 at 01:43:05AM -0400, Jeff King wrote:
> 
> So yeah, I'm sure it could be rewritten around memrchr() or something,
> but I doubt it would be much shorter, and the chance of introducing an
> off-by-one seems non-trivial. :)

Considering I am writing it, it is most likely warranted ;)

but it doesn't look that bad IMHO

Carlo

PS. I tested it in macOS with the compatibility layer that will be needed
------ 8> -------
Subject: [PATCH] grep: retire strip_timestamp()

After recent changes, the name is no longer valid, as the function
doesn't strip anything.

Having the code in the main function also helps with readability

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
 grep.c | 19 +++++--------------
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/grep.c b/grep.c
index 5b1f2da4d3..56fd86a7d8 100644
--- a/grep.c
+++ b/grep.c
@@ -922,18 +922,6 @@ static int patmatch(struct grep_pat *p, char *line, char *eol,
 	return hit;
 }
 
-static void strip_timestamp(char *bol, char **eol_p)
-{
-	char *eol = *eol_p;
-
-	while (bol < --eol) {
-		if (*eol != '>')
-			continue;
-		*eol_p = ++eol;
-		break;
-	}
-}
-
 static struct {
 	const char *field;
 	size_t len;
@@ -965,9 +953,12 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
 		bol += len;
 		switch (p->field) {
 		case GREP_HEADER_AUTHOR:
-		case GREP_HEADER_COMMITTER:
-			strip_timestamp(bol, &eol);
+		case GREP_HEADER_COMMITTER: {
+			char *em = memrchr(bol, '>', eol - bol);
+			if (em)
+				eol = em + 1;
 			break;
+		}
 		default:
 			break;
 		}
-- 
2.33.0.911.gbe391d4e11


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] grep: stop modifying buffer in strip_timestamp
  2021-09-21  6:42         ` Carlo Marcelo Arenas Belón
@ 2021-09-21  7:37           ` René Scharfe
  2021-09-21 14:24             ` Jeff King
  0 siblings, 1 reply; 32+ messages in thread
From: René Scharfe @ 2021-09-21  7:37 UTC (permalink / raw)
  To: Carlo Marcelo Arenas Belón, Jeff King
  Cc: Eric Sunshine, Git List, Hamza Mahfooz

Am 21.09.21 um 08:42 schrieb Carlo Marcelo Arenas Belón:
> On Tue, Sep 21, 2021 at 01:43:05AM -0400, Jeff King wrote:
>>
>> So yeah, I'm sure it could be rewritten around memrchr() or something,
>> but I doubt it would be much shorter, and the chance of introducing an
>> off-by-one seems non-trivial. :)
>
> Considering I am writing it, it is most likely warranted ;)
>
> but it doesn't look that bad IMHO
>
> Carlo
>
> PS. I tested it in macOS with the compatibility layer that will be needed

Right; memrchr is a GNU extension.  We'd need a compat/ implementation to
be able to use it.

> ------ 8> -------
> Subject: [PATCH] grep: retire strip_timestamp()
>
> After recent changes, the name is no longer valid, as the function
> doesn't strip anything.

It still does; the input string slice between bol and eol contains a
trailing timestamp and the output slice doesn't.

>
> Having the code in the main function also helps with readability
>
> Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
> ---
>  grep.c | 19 +++++--------------
>  1 file changed, 5 insertions(+), 14 deletions(-)
>
> diff --git a/grep.c b/grep.c
> index 5b1f2da4d3..56fd86a7d8 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -922,18 +922,6 @@ static int patmatch(struct grep_pat *p, char *line, char *eol,
>  	return hit;
>  }
>
> -static void strip_timestamp(char *bol, char **eol_p)
> -{
> -	char *eol = *eol_p;
> -
> -	while (bol < --eol) {
> -		if (*eol != '>')
> -			continue;
> -		*eol_p = ++eol;
> -		break;
> -	}
> -}
> -
>  static struct {
>  	const char *field;
>  	size_t len;
> @@ -965,9 +953,12 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
>  		bol += len;
>  		switch (p->field) {
>  		case GREP_HEADER_AUTHOR:
> -		case GREP_HEADER_COMMITTER:
> -			strip_timestamp(bol, &eol);
> +		case GREP_HEADER_COMMITTER: {
> +			char *em = memrchr(bol, '>', eol - bol);
> +			if (em)
> +				eol = em + 1;

The old code documents the intent via the function name.  The new one
goes into the nitty-gritty without further explanation, which I find
harder to read.

>  			break;
> +		}
>  		default:
>  			break;
>  		}
>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/5] grep: mark "haystack" buffers as const
  2021-09-21  3:49 ` [PATCH 4/5] grep: mark "haystack" buffers as const Jeff King
@ 2021-09-21 12:04   ` Ævar Arnfjörð Bjarmason
  2021-09-21 14:27     ` Jeff King
  0 siblings, 1 reply; 32+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-21 12:04 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Hamza Mahfooz


On Mon, Sep 20 2021, Jeff King wrote:

> -static int patmatch(struct grep_pat *p, char *line, char *eol,
> +static int patmatch(struct grep_pat *p,
> +		    const char *line, const char *eol,
>  		    regmatch_t *match, int eflags)

nit; not worth a re-roll, over-line-wrapping here? You usually stay at
column 79. In this case both line and eol would fit on the patmatch()
lin, which is being changed anyway...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/5] const-correctness in grep.c
  2021-09-21  3:45 [PATCH 0/5] const-correctness in grep.c Jeff King
                   ` (5 preceding siblings ...)
  2021-09-21  4:30 ` [PATCH 0/5] const-correctness in grep.c Taylor Blau
@ 2021-09-21 12:07 ` Ævar Arnfjörð Bjarmason
  2021-09-21 14:49   ` Jeff King
  2021-09-21 12:45 ` [PATCH 6/5] grep.c: mark eol/bol and derived as "const char * const" Ævar Arnfjörð Bjarmason
  2021-09-22 18:57 ` [PATCH 0/5] const-correctness in grep.c Junio C Hamano
  8 siblings, 1 reply; 32+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-21 12:07 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Hamza Mahfooz


On Mon, Sep 20 2021, Jeff King wrote:

> While discussing [1], I noticed that the grep code mostly takes
> non-const buffers, even though it is conceptually a read-only operation
> to search in them. The culprit is a handful of spots that temporarily
> tie off NUL-terminated strings by overwriting a byte of the buffer and
> then restoring it. But I think we no longer need to do so these days,
> now that we have a regexec_buf() that can take a ptr/size pair.
>
> The first three patches are a bit repetitive, but I broke them up
> individually because they're the high-risk part. I.e., if my assumptions
> about needing the NUL are wrong, it could introduce a bug. But based on
> my reading of the code, plus running the test suite with ASan/UBSan, I
> feel reasonably confident.
>
> The last two are the bigger cleanups, but should obviously avoid any
> behavior changes.
>
>   [1/5]: grep: stop modifying buffer in strip_timestamp
>   [2/5]: grep: stop modifying buffer in show_line()
>   [3/5]: grep: stop modifying buffer in grep_source_1()
>   [4/5]: grep: mark "haystack" buffers as const
>   [5/5]: grep: store grep_source buffer as const
>
>  grep.c | 87 +++++++++++++++++++++++++++++-----------------------------
>  grep.h |  4 +--
>  2 files changed, 45 insertions(+), 46 deletions(-)
>
> -Peff
>
> [1] https://lore.kernel.org/git/YUk3zwuse56v76ze@coredump.intra.peff.net/

This whole thing looks good to me. I only found a small whitespace nit
in one of the patches. Did you consider following-up by having this code
take const char*/const size_t pairs. E.g. starting with something like
the below.

When this API is called it's called like that, and the regex functions
at the bottom expect that, but we have all the bol/eol twiddling in the
middle, which is often confusing because some functions pass the
pointers along as-is, and some modify them. So not for now, but I think
rolling with what I started here below would make sense for this file
eventually:

diff --git a/grep.c b/grep.c
index 14fe8a0fd23..f55ec5c0e09 100644
--- a/grep.c
+++ b/grep.c
@@ -436,7 +436,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 	}
 }
 
-static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
+static int pcre2match(struct grep_pat *p, const char *line, const size_t len,
 		regmatch_t *match, int eflags)
 {
 	int ret, flags = 0;
@@ -448,11 +448,11 @@ static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
 
 	if (p->pcre2_jit_on)
 		ret = pcre2_jit_match(p->pcre2_pattern, (unsigned char *)line,
-				      eol - line, 0, flags, p->pcre2_match_data,
+				      len, 0, flags, p->pcre2_match_data,
 				      NULL);
 	else
 		ret = pcre2_match(p->pcre2_pattern, (unsigned char *)line,
-				  eol - line, 0, flags, p->pcre2_match_data,
+				  len, 0, flags, p->pcre2_match_data,
 				  NULL);
 
 	if (ret < 0 && ret != PCRE2_ERROR_NOMATCH) {
@@ -909,15 +909,15 @@ static void show_name(struct grep_opt *opt, const char *name)
 }
 
 static int patmatch(struct grep_pat *p,
-		    const char *line, const char *eol,
+		    const char *line, const size_t len,
 		    regmatch_t *match, int eflags)
 {
 	int hit;
 
 	if (p->pcre2_pattern)
-		hit = !pcre2match(p, line, eol, match, eflags);
+		hit = !pcre2match(p, line, len, match, eflags);
 	else
-		hit = !regexec_buf(&p->regexp, line, eol - line, 1, match,
+		hit = !regexec_buf(&p->regexp, line, len, 1, match,
 				   eflags);
 
 	return hit;
@@ -976,7 +976,7 @@ static int match_one_pattern(struct grep_pat *p,
 	}
 
  again:
-	hit = patmatch(p, bol, eol, pmatch, eflags);
+	hit = patmatch(p, bol, eol - bol, pmatch, eflags);
 
 	if (hit && p->word_regexp) {
 		if ((pmatch[0].rm_so < 0) ||
@@ -1447,7 +1447,7 @@ static int look_ahead(struct grep_opt *opt,
 		int hit;
 		regmatch_t m;
 
-		hit = patmatch(p, bol, bol + *left_p, &m, 0);
+		hit = patmatch(p, bol, *left_p, &m, 0);
 		if (!hit || m.rm_so < 0 || m.rm_eo < 0)
 			continue;
 		if (earliest < 0 || m.rm_so < earliest)



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 6/5] grep.c: mark eol/bol and derived as "const char * const"
  2021-09-21  3:45 [PATCH 0/5] const-correctness in grep.c Jeff King
                   ` (6 preceding siblings ...)
  2021-09-21 12:07 ` Ævar Arnfjörð Bjarmason
@ 2021-09-21 12:45 ` Ævar Arnfjörð Bjarmason
  2021-09-21 14:53   ` Jeff King
  2021-09-22 18:57 ` [PATCH 0/5] const-correctness in grep.c Junio C Hamano
  8 siblings, 1 reply; 32+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-21 12:45 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Hamza Mahfooz, Taylor Blau,
	Ævar Arnfjörð Bjarmason

The control flow of how "bol" and "eol" are passed through the various
functions in grep.c can be hard to follow, because in some functions
we want e.g. an "eol pointer to const char", but in others we can do
with a "const pointer to const char", likewise for the "const char **"
case.

I think that it would be good to eventually change this code to mostly
take a "const char *const"/"const size_t" pair, as that's what both
regexec_buf() and the equivalent PCRE function consume[1], now we
convert to length with "eol - bol" in several places.

For any such future conversion, and for reading the code, it'll be
much easier if we're at a starting point of knowing what pointers we
modify where, so let's have the compiler help us with that.

This change was made by converting these to the strictest possible
"const" forms and seeing where we had errors, note that the lone
"const char ** const" can only be that strict, it can't be a "const
char * const * const".

This change doesn't matter for the compiler's optimization, both gcc
and clang generate practically (only address differences) the same
code under both -O0 and -O3.

1. https://lore.kernel.org/git/87czp29l2c.fsf@evledraar.gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---

A suggested patch either on top of Jeff King's
<YUlVZk1xXulAqdef@coredump.intra.peff.net> or even for squashing into
his series.

I think that generally git's codebase could use going beyond just
"const char *" when a "const char * const" would suffice, for some
reason we seem to mostly use it for the static usage variables.

In this particular case I think it makes the flow of grep.c much
easier to reason about. You can immediately see which functions are
twiddling the bol/eof pointers and which aren't.

 grep.c | 41 +++++++++++++++++++++--------------------
 1 file changed, 21 insertions(+), 20 deletions(-)

diff --git a/grep.c b/grep.c
index 14fe8a0fd23..4e266769931 100644
--- a/grep.c
+++ b/grep.c
@@ -206,7 +206,7 @@ void grep_commit_pattern_type(enum grep_pattern_type pattern_type, struct grep_o
 		grep_set_pattern_type_option(GREP_PATTERN_TYPE_ERE, opt);
 }
 
-static struct grep_pat *create_grep_pat(const char *pat, size_t patlen,
+static struct grep_pat *create_grep_pat(const char *const pat, size_t patlen,
 					const char *origin, int no,
 					enum grep_pat_token t,
 					enum grep_header_field field)
@@ -436,8 +436,8 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 	}
 }
 
-static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
-		regmatch_t *match, int eflags)
+static int pcre2match(struct grep_pat *p, const char * const line,
+		      const char * const eol, regmatch_t *match, int eflags)
 {
 	int ret, flags = 0;
 	PCRE2_SIZE *ovector;
@@ -489,8 +489,9 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 	die("cannot use Perl-compatible regexes when not compiled with USE_LIBPCRE");
 }
 
-static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
-		regmatch_t *match, int eflags)
+static int pcre2match(struct grep_pat *p,
+		      const char * const line, const char * const eol,
+		      regmatch_t *match, int eflags)
 {
 	return 1;
 }
@@ -909,7 +910,7 @@ static void show_name(struct grep_opt *opt, const char *name)
 }
 
 static int patmatch(struct grep_pat *p,
-		    const char *line, const char *eol,
+		    const char * const line, const char * const eol,
 		    regmatch_t *match, int eflags)
 {
 	int hit;
@@ -923,7 +924,7 @@ static int patmatch(struct grep_pat *p,
 	return hit;
 }
 
-static void strip_timestamp(const char *bol, const char **eol_p)
+static void strip_timestamp(const char * const bol, const char ** const eol_p)
 {
 	const char *eol = *eol_p;
 
@@ -1026,7 +1027,7 @@ static int match_one_pattern(struct grep_pat *p,
 }
 
 static int match_expr_eval(struct grep_opt *opt, struct grep_expr *x,
-			   const char *bol, const char *eol,
+			   const char *bol, const char * const eol,
 			   enum grep_context ctx, ssize_t *col,
 			   ssize_t *icol, int collect_hits)
 {
@@ -1095,7 +1096,7 @@ static int match_expr_eval(struct grep_opt *opt, struct grep_expr *x,
 }
 
 static int match_expr(struct grep_opt *opt,
-		      const char *bol, const char *eol,
+		      const char *bol, const char * const eol,
 		      enum grep_context ctx, ssize_t *col,
 		      ssize_t *icol, int collect_hits)
 {
@@ -1104,7 +1105,7 @@ static int match_expr(struct grep_opt *opt,
 }
 
 static int match_line(struct grep_opt *opt,
-		      const char *bol, const char *eol,
+		      const char *bol, const char * const eol,
 		      ssize_t *col, ssize_t *icol,
 		      enum grep_context ctx, int collect_hits)
 {
@@ -1137,7 +1138,7 @@ static int match_line(struct grep_opt *opt,
 }
 
 static int match_next_pattern(struct grep_pat *p,
-			      const char *bol, const char *eol,
+			      const char * const bol, const char * const eol,
 			      enum grep_context ctx,
 			      regmatch_t *pmatch, int eflags)
 {
@@ -1159,7 +1160,7 @@ static int match_next_pattern(struct grep_pat *p,
 }
 
 static int next_match(struct grep_opt *opt,
-		      const char *bol, const char *eol,
+		      const char * const bol, const char * const eol,
 		      enum grep_context ctx, regmatch_t *pmatch, int eflags)
 {
 	struct grep_pat *p;
@@ -1216,7 +1217,7 @@ static void show_line_header(struct grep_opt *opt, const char *name,
 }
 
 static void show_line(struct grep_opt *opt,
-		      const char *bol, const char *eol,
+		      const char *bol, const char * const eol,
 		      const char *name, unsigned lno, ssize_t cno, char sign)
 {
 	int rest = eol - bol;
@@ -1306,7 +1307,7 @@ static inline void grep_attr_unlock(void)
 }
 
 static int match_funcname(struct grep_opt *opt, struct grep_source *gs,
-			  const char *bol, const char *eol)
+			  const char * const bol, const char * const eol)
 {
 	xdemitconf_t *xecfg = opt->priv;
 	if (xecfg && !xecfg->find_func) {
@@ -1336,7 +1337,7 @@ static void show_funcname_line(struct grep_opt *opt, struct grep_source *gs,
 			       const char *bol, unsigned lno)
 {
 	while (bol > gs->buf) {
-		const char *eol = --bol;
+		const char * const eol = --bol;
 
 		while (bol > gs->buf && bol[-1] != '\n')
 			bol--;
@@ -1352,7 +1353,7 @@ static void show_funcname_line(struct grep_opt *opt, struct grep_source *gs,
 	}
 }
 
-static int is_empty_line(const char *bol, const char *eol);
+static int is_empty_line(const char * const bol, const char * const eol);
 
 static void show_pre_context(struct grep_opt *opt, struct grep_source *gs,
 			     const char *bol, const char *end, unsigned lno)
@@ -1375,8 +1376,8 @@ static void show_pre_context(struct grep_opt *opt, struct grep_source *gs,
 
 	/* Rewind. */
 	while (bol > gs->buf && cur > from) {
-		const char *next_bol = bol;
-		const char *eol = --bol;
+		const char * const next_bol = bol;
+		const char * const eol = --bol;
 
 		while (bol > gs->buf && bol[-1] != '\n')
 			bol--;
@@ -1438,7 +1439,7 @@ static int look_ahead(struct grep_opt *opt,
 		      const char **bol_p)
 {
 	unsigned lno = *lno_p;
-	const char *bol = *bol_p;
+	const char * const bol = *bol_p;
 	struct grep_pat *p;
 	const char *sp, *last_bol;
 	regoff_t earliest = -1;
@@ -1533,7 +1534,7 @@ static int fill_textconv_grep(struct repository *r,
 	return 0;
 }
 
-static int is_empty_line(const char *bol, const char *eol)
+static int is_empty_line(const char *bol, const char * const eol)
 {
 	while (bol < eol && isspace(*bol))
 		bol++;
-- 
2.33.0.1098.gf02a64c1a2d


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] grep: stop modifying buffer in strip_timestamp
  2021-09-21  7:37           ` René Scharfe
@ 2021-09-21 14:24             ` Jeff King
  2021-09-21 21:02               ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 32+ messages in thread
From: Jeff King @ 2021-09-21 14:24 UTC (permalink / raw)
  To: René Scharfe
  Cc: Carlo Marcelo Arenas Belón, Eric Sunshine, Git List, Hamza Mahfooz

On Tue, Sep 21, 2021 at 09:37:23AM +0200, René Scharfe wrote:

> > @@ -965,9 +953,12 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
> >  		bol += len;
> >  		switch (p->field) {
> >  		case GREP_HEADER_AUTHOR:
> > -		case GREP_HEADER_COMMITTER:
> > -			strip_timestamp(bol, &eol);
> > +		case GREP_HEADER_COMMITTER: {
> > +			char *em = memrchr(bol, '>', eol - bol);
> > +			if (em)
> > +				eol = em + 1;
> 
> The old code documents the intent via the function name.  The new one
> goes into the nitty-gritty without further explanation, which I find
> harder to read.

Agreed. I do think the conversion is functionally correct, but it
doesn't strike me as worth the change.

-Peff

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/5] grep: mark "haystack" buffers as const
  2021-09-21 12:04   ` Ævar Arnfjörð Bjarmason
@ 2021-09-21 14:27     ` Jeff King
  0 siblings, 0 replies; 32+ messages in thread
From: Jeff King @ 2021-09-21 14:27 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, Hamza Mahfooz

On Tue, Sep 21, 2021 at 02:04:12PM +0200, Ævar Arnfjörð Bjarmason wrote:

> 
> On Mon, Sep 20 2021, Jeff King wrote:
> 
> > -static int patmatch(struct grep_pat *p, char *line, char *eol,
> > +static int patmatch(struct grep_pat *p,
> > +		    const char *line, const char *eol,
> >  		    regmatch_t *match, int eflags)
> 
> nit; not worth a re-roll, over-line-wrapping here? You usually stay at
> column 79. In this case both line and eol would fit on the patmatch()
> lin, which is being changed anyway...

True. I had to wrap all the others (and IMHO it is worth keeping "line"
and "eol" together), but I didn't calculate that this one can actually
fit the whole thing.

I dunno that it matters. The wrapping is consistent with the others,
even if it is not maximally filling the line.

-Peff

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/5] const-correctness in grep.c
  2021-09-21 12:07 ` Ævar Arnfjörð Bjarmason
@ 2021-09-21 14:49   ` Jeff King
  0 siblings, 0 replies; 32+ messages in thread
From: Jeff King @ 2021-09-21 14:49 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, Hamza Mahfooz

On Tue, Sep 21, 2021 at 02:07:08PM +0200, Ævar Arnfjörð Bjarmason wrote:

> This whole thing looks good to me. I only found a small whitespace nit
> in one of the patches. Did you consider following-up by having this code
> take const char*/const size_t pairs. E.g. starting with something like
> the below.

I do generally find ptr/len pairs to be easier to read, but they also
make it really easy to introduce subtle bugs. E.g., if you consume part
of a buffer, you have to tweak both the ptr and the len. So the current:

	while (word_char(bol[-1]) && bol < eol)
		bol++;

has to become:

	while (word_char(bol[-1] && len > 0) {
		bol++;
		len--;
	}

So I'd be hesitant to churn battle-tested code in such a way for what I
consider to be a pretty minor benefit.

I did notice the ugly use of "unsigned long" here in a few places
(rather than size_t). I do think it is worth fixing, but it seemed a
little too far to try to cram into this series (it's obviously touching
the same lines, but it's quite orthogonal semantically).

The other hesitation I had is that the source of this "unsigned long"
pattern is almost certainly the object code (which is much more
important to convert, as it blocks people from having >4GB objects on
Windows). So we might want to just wait for a larger conversion there.
OTOH, I don't think there is any downside to a partial conversion here
in the meantime (because size_t will always be at least as long as
"unsigned long" in practice).

> -static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
> +static int pcre2match(struct grep_pat *p, const char *line, const size_t len,
>  		regmatch_t *match, int eflags)
>  {
>  	int ret, flags = 0;
> @@ -448,11 +448,11 @@ static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
>  
>  	if (p->pcre2_jit_on)
>  		ret = pcre2_jit_match(p->pcre2_pattern, (unsigned char *)line,
> -				      eol - line, 0, flags, p->pcre2_match_data,
> +				      len, 0, flags, p->pcre2_match_data,
>  				      NULL);
>  	else
>  		ret = pcre2_match(p->pcre2_pattern, (unsigned char *)line,
> -				  eol - line, 0, flags, p->pcre2_match_data,
> +				  len, 0, flags, p->pcre2_match_data,
>  				  NULL);

Not related to your point, but these casts are funny now. They are meant
to cast to "unsigned char" pointers to match pcre's signature, but now
they are casting away const-ness, too. That might be worth fixing as
part of this series.

Though should they really be casting to PCRE2_SPTR? The types are opaque
in their API because of the weird multi-width thing, though I find it
hard to imagine us ever using the wider versions of the library.

-Peff

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/5] grep.c: mark eol/bol and derived as "const char * const"
  2021-09-21 12:45 ` [PATCH 6/5] grep.c: mark eol/bol and derived as "const char * const" Ævar Arnfjörð Bjarmason
@ 2021-09-21 14:53   ` Jeff King
  2021-09-21 15:17     ` Ævar Arnfjörð Bjarmason
  2021-09-22 19:02     ` Junio C Hamano
  0 siblings, 2 replies; 32+ messages in thread
From: Jeff King @ 2021-09-21 14:53 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Hamza Mahfooz, Taylor Blau

On Tue, Sep 21, 2021 at 02:45:16PM +0200, Ævar Arnfjörð Bjarmason wrote:

> I think that generally git's codebase could use going beyond just
> "const char *" when a "const char * const" would suffice, for some
> reason we seem to mostly use it for the static usage variables.

I didn't dig up the references in the list archive, but I feel like
we've had this discussion long ago. One of the reasons not to do so is
that it pollutes the function's interface with internal details. The
caller does not care whether the function is going to modify the pointer
itself, because it is passed by value. You could apply the same logic
that we should be passing "const int", and so on.

-Peff

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/5] grep.c: mark eol/bol and derived as "const char * const"
  2021-09-21 14:53   ` Jeff King
@ 2021-09-21 15:17     ` Ævar Arnfjörð Bjarmason
  2021-09-21 19:18       ` Jeff King
  2021-09-22 19:02     ` Junio C Hamano
  1 sibling, 1 reply; 32+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-21 15:17 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano, Hamza Mahfooz, Taylor Blau


On Tue, Sep 21 2021, Jeff King wrote:

> On Tue, Sep 21, 2021 at 02:45:16PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> I think that generally git's codebase could use going beyond just
>> "const char *" when a "const char * const" would suffice, for some
>> reason we seem to mostly use it for the static usage variables.
>
> I didn't dig up the references in the list archive, but I feel like
> we've had this discussion long ago. One of the reasons not to do so is
> that it pollutes the function's interface with internal details.[...]

Are there cases in my conversion where the caller has to do anything
special that they didn't before? These are also all static functions, so
it's all internal details exported to nobody.

> The caller does not care whether the function is going to modify the
> pointer itself, because it is passed by value.

Sure, it's for increased clarity of reading te function in question, not
its although in one case we pass a ** so you can see what exactly we
modify both from the callers and function perspective.

> You could apply the same logic that we should be passing "const int",
> and so on.

Sure, in general I think churn like that isn't worth it, and "const char
*" is usually good enough, even if it could be "const char * const" or
whatever.

I think it makes senes in this specific case where you need to read one
function after another with "bol" and "eol" variables, with some
treating their copy as immutable, others not etc. Particularly if we'd
convert it to some other style (e.g. str/len) we can see if an entire
chain of functions can all be safely changed over.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/5] grep.c: mark eol/bol and derived as "const char * const"
  2021-09-21 15:17     ` Ævar Arnfjörð Bjarmason
@ 2021-09-21 19:18       ` Jeff King
  2021-09-23 13:56         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 32+ messages in thread
From: Jeff King @ 2021-09-21 19:18 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Hamza Mahfooz, Taylor Blau

On Tue, Sep 21, 2021 at 05:17:57PM +0200, Ævar Arnfjörð Bjarmason wrote:

> 
> On Tue, Sep 21 2021, Jeff King wrote:
> 
> > On Tue, Sep 21, 2021 at 02:45:16PM +0200, Ævar Arnfjörð Bjarmason wrote:
> >
> >> I think that generally git's codebase could use going beyond just
> >> "const char *" when a "const char * const" would suffice, for some
> >> reason we seem to mostly use it for the static usage variables.
> >
> > I didn't dig up the references in the list archive, but I feel like
> > we've had this discussion long ago. One of the reasons not to do so is
> > that it pollutes the function's interface with internal details.[...]
> 
> Are there cases in my conversion where the caller has to do anything
> special that they didn't before? These are also all static functions, so
> it's all internal details exported to nobody.

No, they don't have to do anything differently. I just meant that it
clutters the interface when a human is reading it.

-Peff

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] grep: stop modifying buffer in strip_timestamp
  2021-09-21 14:24             ` Jeff King
@ 2021-09-21 21:02               ` Ævar Arnfjörð Bjarmason
  2021-09-22 20:20                 ` Jeff King
  0 siblings, 1 reply; 32+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-21 21:02 UTC (permalink / raw)
  To: Jeff King
  Cc: René Scharfe, Carlo Marcelo Arenas Belón,
	Eric Sunshine, Git List, Hamza Mahfooz


On Tue, Sep 21 2021, Jeff King wrote:

> On Tue, Sep 21, 2021 at 09:37:23AM +0200, René Scharfe wrote:
>
>> > @@ -965,9 +953,12 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
>> >  		bol += len;
>> >  		switch (p->field) {
>> >  		case GREP_HEADER_AUTHOR:
>> > -		case GREP_HEADER_COMMITTER:
>> > -			strip_timestamp(bol, &eol);
>> > +		case GREP_HEADER_COMMITTER: {
>> > +			char *em = memrchr(bol, '>', eol - bol);
>> > +			if (em)
>> > +				eol = em + 1;
>> 
>> The old code documents the intent via the function name.  The new one
>> goes into the nitty-gritty without further explanation, which I find
>> harder to read.
>
> Agreed. I do think the conversion is functionally correct, but it
> doesn't strike me as worth the change.

As far as some general improvement in thish area it seems to me that
this whole subthread is losing the forest for the trees.

We're operating a buffer that has "\n"'s in it, and then we loop and
split it up one line at a time, often just to match author OR
committer.

We then have things like commit_rewrite_person() in revision.c that's
preparing a "fake" commit just for grep.c's use, just for it to strip
those headers down again.

A real improvement in this area IMO would be trying to bridge the gap
between parse_commit_buffer() and grep.c's match_one_pattern().

I.e. we've even parsed this once before in commit.c's
parse_commit_date() by the time it gets to grep.c, now we're just doing
it again.

It probably makes sense to split up that commit.c code into something
that can give you structured output, i.e. headers with types and
start/end points for interesting data, then in grep.c we won't need a
strip_anything(), or strrchr() or memrchr() or whatever.

It would also be a lot faster for grepping if we could offload more of
this work to the regex engine, particularly if we've got a more capable
engine like PCREv2.

In many cases we're splitting lines ourselves, when we could have the
engine work in a multi-line mode, or translate the user's --author match
into something that can match the raw commit header. So have an implicit
/^author /m anchor if we're matching author headers, instead of having
grep.c string-twiddle that around one line at a time.

Just my 0.02.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/5] const-correctness in grep.c
  2021-09-21  3:45 [PATCH 0/5] const-correctness in grep.c Jeff King
                   ` (7 preceding siblings ...)
  2021-09-21 12:45 ` [PATCH 6/5] grep.c: mark eol/bol and derived as "const char * const" Ævar Arnfjörð Bjarmason
@ 2021-09-22 18:57 ` Junio C Hamano
  8 siblings, 0 replies; 32+ messages in thread
From: Junio C Hamano @ 2021-09-22 18:57 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Hamza Mahfooz

Jeff King <peff@peff.net> writes:

> While discussing [1], I noticed that the grep code mostly takes
> non-const buffers, even though it is conceptually a read-only operation
> to search in them. The culprit is a handful of spots that temporarily
> tie off NUL-terminated strings by overwriting a byte of the buffer and
> then restoring it. But I think we no longer need to do so these days,
> now that we have a regexec_buf() that can take a ptr/size pair.

Yes, the haystack has not been read-only exactly because we didn't
have <ptr,size> based regexec variant when the grep machinery was
written, and there is no reason why we want to use the "temporarily
terminate by swapping the byte with a NUL" trick.

It always is a pleasure to read such a concise and to-the-point
summary.  With a clear summary like that, a reader almost does not
have to see the patch to guess how the rest of the story goes ;-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/5] grep.c: mark eol/bol and derived as "const char * const"
  2021-09-21 14:53   ` Jeff King
  2021-09-21 15:17     ` Ævar Arnfjörð Bjarmason
@ 2021-09-22 19:02     ` Junio C Hamano
  1 sibling, 0 replies; 32+ messages in thread
From: Junio C Hamano @ 2021-09-22 19:02 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason, git, Hamza Mahfooz, Taylor Blau

Jeff King <peff@peff.net> writes:

> On Tue, Sep 21, 2021 at 02:45:16PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> I think that generally git's codebase could use going beyond just
>> "const char *" when a "const char * const" would suffice, for some
>> reason we seem to mostly use it for the static usage variables.
>
> I didn't dig up the references in the list archive, but I feel like
> we've had this discussion long ago. One of the reasons not to do so is
> that it pollutes the function's interface with internal details. The
> caller does not care whether the function is going to modify the pointer
> itself, because it is passed by value. You could apply the same logic
> that we should be passing "const int", and so on.

Yes.  "This pointer is not modified" is a good thing to have inside
an implementation, but the callers should not have to care.

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] grep: stop modifying buffer in strip_timestamp
  2021-09-21 21:02               ` Ævar Arnfjörð Bjarmason
@ 2021-09-22 20:20                 ` Jeff King
  2021-09-23  0:53                   ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 32+ messages in thread
From: Jeff King @ 2021-09-22 20:20 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: René Scharfe, Carlo Marcelo Arenas Belón,
	Eric Sunshine, Git List, Hamza Mahfooz

On Tue, Sep 21, 2021 at 11:02:31PM +0200, Ævar Arnfjörð Bjarmason wrote:

> 
> On Tue, Sep 21 2021, Jeff King wrote:
> 
> > On Tue, Sep 21, 2021 at 09:37:23AM +0200, René Scharfe wrote:
> >
> >> > @@ -965,9 +953,12 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
> >> >  		bol += len;
> >> >  		switch (p->field) {
> >> >  		case GREP_HEADER_AUTHOR:
> >> > -		case GREP_HEADER_COMMITTER:
> >> > -			strip_timestamp(bol, &eol);
> >> > +		case GREP_HEADER_COMMITTER: {
> >> > +			char *em = memrchr(bol, '>', eol - bol);
> >> > +			if (em)
> >> > +				eol = em + 1;
> >> 
> >> The old code documents the intent via the function name.  The new one
> >> goes into the nitty-gritty without further explanation, which I find
> >> harder to read.
> >
> > Agreed. I do think the conversion is functionally correct, but it
> > doesn't strike me as worth the change.
> 
> As far as some general improvement in thish area it seems to me that
> this whole subthread is losing the forest for the trees.

I'd definitely agree with that. :)

> It probably makes sense to split up that commit.c code into something
> that can give you structured output, i.e. headers with types and
> start/end points for interesting data, then in grep.c we won't need a
> strip_anything(), or strrchr() or memrchr() or whatever.

I don't disagree with any of this, either, but I think it's a separate
(and much more complicated) topic than what this series is dealing with.
So my preference would be to take this as an immediate improvement, and
let anything like that get built on top.

> It would also be a lot faster for grepping if we could offload more of
> this work to the regex engine, particularly if we've got a more capable
> engine like PCREv2.
> 
> In many cases we're splitting lines ourselves, when we could have the
> engine work in a multi-line mode, or translate the user's --author match
> into something that can match the raw commit header. So have an implicit
> /^author /m anchor if we're matching author headers, instead of having
> grep.c string-twiddle that around one line at a time.

Ditto here. Multi-line matching may make things a lot more efficient,
but I think is out of scope for this series. That seems like an
interesting area for future work.

-Peff

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/5] grep: stop modifying buffer in strip_timestamp
  2021-09-22 20:20                 ` Jeff King
@ 2021-09-23  0:53                   ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 32+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-23  0:53 UTC (permalink / raw)
  To: Jeff King
  Cc: René Scharfe, Carlo Marcelo Arenas Belón,
	Eric Sunshine, Git List, Hamza Mahfooz


On Wed, Sep 22 2021, Jeff King wrote:

> On Tue, Sep 21, 2021 at 11:02:31PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> 
>> On Tue, Sep 21 2021, Jeff King wrote:
>> 
>> > On Tue, Sep 21, 2021 at 09:37:23AM +0200, René Scharfe wrote:
>> >
>> >> > @@ -965,9 +953,12 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
>> >> >  		bol += len;
>> >> >  		switch (p->field) {
>> >> >  		case GREP_HEADER_AUTHOR:
>> >> > -		case GREP_HEADER_COMMITTER:
>> >> > -			strip_timestamp(bol, &eol);
>> >> > +		case GREP_HEADER_COMMITTER: {
>> >> > +			char *em = memrchr(bol, '>', eol - bol);
>> >> > +			if (em)
>> >> > +				eol = em + 1;
>> >> 
>> >> The old code documents the intent via the function name.  The new one
>> >> goes into the nitty-gritty without further explanation, which I find
>> >> harder to read.
>> >
>> > Agreed. I do think the conversion is functionally correct, but it
>> > doesn't strike me as worth the change.
>> 
>> As far as some general improvement in thish area it seems to me that
>> this whole subthread is losing the forest for the trees.
>
> I'd definitely agree with that. :)
>
>> It probably makes sense to split up that commit.c code into something
>> that can give you structured output, i.e. headers with types and
>> start/end points for interesting data, then in grep.c we won't need a
>> strip_anything(), or strrchr() or memrchr() or whatever.
>
> I don't disagree with any of this, either, but I think it's a separate
> (and much more complicated) topic than what this series is dealing with.
> So my preference would be to take this as an immediate improvement, and
> let anything like that get built on top.
>
>> It would also be a lot faster for grepping if we could offload more of
>> this work to the regex engine, particularly if we've got a more capable
>> engine like PCREv2.
>> 
>> In many cases we're splitting lines ourselves, when we could have the
>> engine work in a multi-line mode, or translate the user's --author match
>> into something that can match the raw commit header. So have an implicit
>> /^author /m anchor if we're matching author headers, instead of having
>> grep.c string-twiddle that around one line at a time.
>
> Ditto here. Multi-line matching may make things a lot more efficient,
> but I think is out of scope for this series. That seems like an
> interesting area for future work.

Indeed, *way* outside this series. I'm I think the change you suggested
here makes sense for this change, just pointing out that for any
follow-up changes it's much more worthwhile to consider a few more
stackframes than just the 1-2 that involve those strings in that
particular form.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/5] grep.c: mark eol/bol and derived as "const char * const"
  2021-09-21 19:18       ` Jeff King
@ 2021-09-23 13:56         ` Ævar Arnfjörð Bjarmason
  2021-09-24  4:22           ` Junio C Hamano
  0 siblings, 1 reply; 32+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-23 13:56 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano, Hamza Mahfooz, Taylor Blau


On Tue, Sep 21 2021, Jeff King wrote:

> On Tue, Sep 21, 2021 at 05:17:57PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> 
>> On Tue, Sep 21 2021, Jeff King wrote:
>> 
>> > On Tue, Sep 21, 2021 at 02:45:16PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> >
>> >> I think that generally git's codebase could use going beyond just
>> >> "const char *" when a "const char * const" would suffice, for some
>> >> reason we seem to mostly use it for the static usage variables.
>> >
>> > I didn't dig up the references in the list archive, but I feel like
>> > we've had this discussion long ago. One of the reasons not to do so is
>> > that it pollutes the function's interface with internal details.[...]
>> 
>> Are there cases in my conversion where the caller has to do anything
>> special that they didn't before? These are also all static functions, so
>> it's all internal details exported to nobody.
>
> No, they don't have to do anything differently. I just meant that it
> clutters the interface when a human is reading it.

Fair enough, to this human reading it (and I'm not new to the grep.c
code) I find myself trying and failing to mentally track what gets
modified where, but anyway, I think that point is understood and I won't
argue for it ad nauseum.

Just replied to say to Junio's <xmqqh7ec77s3.fsf@gitster.g> in the
side-thread:

> Yes.  "This pointer is not modified" is a good thing to have inside
> an implementation, but the callers should not have to care.

That I just don't understand, i.e. how the variable is defined in this
6/5 pertains to how it gets used /inside/ the function in this case, the
caller code doesn't have to care, but perhaps I'm missing something...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/5] grep.c: mark eol/bol and derived as "const char * const"
  2021-09-23 13:56         ` Ævar Arnfjörð Bjarmason
@ 2021-09-24  4:22           ` Junio C Hamano
  0 siblings, 0 replies; 32+ messages in thread
From: Junio C Hamano @ 2021-09-24  4:22 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jeff King, git, Hamza Mahfooz, Taylor Blau

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> That I just don't understand, i.e. how the variable is defined in this
> 6/5 pertains to how it gets used /inside/ the function in this case, the
> caller code doesn't have to care, but perhaps I'm missing something...


-static struct grep_pat *create_grep_pat(const char *pat, size_t patlen,
+static struct grep_pat *create_grep_pat(const char *const pat, size_t patlen,
 					const char *origin, int no,
 					enum grep_pat_token t,
 					enum grep_header_field field)

Because the function signature is visible to the caller of this
function, it sees that the callee, as part of its implementation
detail, keeps "pat" pointer pointing at the same location.  That is
not something the caller needs to care, even though the caller may
deeply care that the memory locations "pat" points at will not be
changed via the pointer by calling this function.

On the other hand

@@ -1438,7 +1439,7 @@ static int look_ahead(struct grep_opt *opt,
 		      const char **bol_p)
 {
 	unsigned lno = *lno_p;
-	const char *bol = *bol_p;
+	const char * const bol = *bol_p;
 	struct grep_pat *p;
 	const char *sp, *last_bol;
 	regoff_t earliest = -1;

This one is totally invisible to the caller of the function, and is
a good documentation for those who read the implementation of it.

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2021-09-24  4:22 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-21  3:45 [PATCH 0/5] const-correctness in grep.c Jeff King
2021-09-21  3:46 ` [PATCH 1/5] grep: stop modifying buffer in strip_timestamp Jeff King
2021-09-21  5:18   ` Carlo Arenas
2021-09-21  5:24     ` Eric Sunshine
2021-09-21  5:40       ` Carlo Arenas
2021-09-21  5:43       ` Jeff King
2021-09-21  6:42         ` Carlo Marcelo Arenas Belón
2021-09-21  7:37           ` René Scharfe
2021-09-21 14:24             ` Jeff King
2021-09-21 21:02               ` Ævar Arnfjörð Bjarmason
2021-09-22 20:20                 ` Jeff King
2021-09-23  0:53                   ` Ævar Arnfjörð Bjarmason
2021-09-21  3:48 ` [PATCH 2/5] grep: stop modifying buffer in show_line() Jeff King
2021-09-21  4:22   ` Taylor Blau
2021-09-21  4:42     ` Jeff King
2021-09-21  4:45       ` Taylor Blau
2021-09-21  3:48 ` [PATCH 3/5] grep: stop modifying buffer in grep_source_1() Jeff King
2021-09-21  3:49 ` [PATCH 4/5] grep: mark "haystack" buffers as const Jeff King
2021-09-21 12:04   ` Ævar Arnfjörð Bjarmason
2021-09-21 14:27     ` Jeff King
2021-09-21  3:51 ` [PATCH 5/5] grep: store grep_source buffer " Jeff King
2021-09-21  4:30 ` [PATCH 0/5] const-correctness in grep.c Taylor Blau
2021-09-21 12:07 ` Ævar Arnfjörð Bjarmason
2021-09-21 14:49   ` Jeff King
2021-09-21 12:45 ` [PATCH 6/5] grep.c: mark eol/bol and derived as "const char * const" Ævar Arnfjörð Bjarmason
2021-09-21 14:53   ` Jeff King
2021-09-21 15:17     ` Ævar Arnfjörð Bjarmason
2021-09-21 19:18       ` Jeff King
2021-09-23 13:56         ` Ævar Arnfjörð Bjarmason
2021-09-24  4:22           ` Junio C Hamano
2021-09-22 19:02     ` Junio C Hamano
2021-09-22 18:57 ` [PATCH 0/5] const-correctness in grep.c Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.