* [PATCH] fast-import: Reinitialize command_buf rather than detach it. @ 2019-08-25 4:13 Mike Hommey 2019-08-25 6:57 ` Jeff King 2019-08-25 12:35 ` [PATCH] fast-import: Reinitialize command_buf rather than detach it René Scharfe 0 siblings, 2 replies; 15+ messages in thread From: Mike Hommey @ 2019-08-25 4:13 UTC (permalink / raw) To: git; +Cc: gitster command_buf.buf is also stored in cmd_hist, so instead of strbuf_release, the current code uses strbuf_detach in order to "leak" the buffer as far as the strbuf is concerned. However, strbuf_detach does more than "leak" the strbuf buffer: it possibly reallocates it to ensure a terminating nul character. And when that happens, what is already stored in cmd_hist may now be a free()d buffer. In practice, though, command_buf.buf is most of the time a nul terminated string, meaning command_buf.len < command_buf.alloc, and strbuf_detach is a no-op. BUT, when it's not (e.g. on the first call), command_buf.buf is &strbuf_slopbuf. In that case, strbuf_detach does allocate a 1 byte buffer to store a nul character in it, which is then leaked. Since the code using strbuf_detach is assuming it does nothing to command_buf.buf, it's overall safer to use strbuf_init, which has the same practical effect in the usual case, and works appropriately when command_buf is empty. --- fast-import.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/fast-import.c b/fast-import.c index b44d6a467e..b1d07efe8c 100644 --- a/fast-import.c +++ b/fast-import.c @@ -1763,7 +1763,9 @@ static int read_next_command(void) } else { struct recent_command *rc; - strbuf_detach(&command_buf, NULL); + // command_buf is enther empty or also stored in cmd_hist, + // reinitialize it. + strbuf_init(&command_buf, 0); stdin_eof = strbuf_getline_lf(&command_buf, stdin); if (stdin_eof) return EOF; @@ -1833,7 +1835,9 @@ static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res) char *term = xstrdup(data); size_t term_len = command_buf.len - (data - command_buf.buf); - strbuf_detach(&command_buf, NULL); + // command_buf is enther empty or also stored in cmd_hist, + // reinitialize it. + strbuf_init(&command_buf, 0); for (;;) { if (strbuf_getline_lf(&command_buf, stdin) == EOF) die("EOF in data (terminator '%s' not found)", term); -- 2.23.0 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] fast-import: Reinitialize command_buf rather than detach it. 2019-08-25 4:13 [PATCH] fast-import: Reinitialize command_buf rather than detach it Mike Hommey @ 2019-08-25 6:57 ` Jeff King 2019-08-25 7:20 ` Mike Hommey 2019-08-25 8:06 ` [PATCH 0/2] fast-import input string handling bugs Jeff King 2019-08-25 12:35 ` [PATCH] fast-import: Reinitialize command_buf rather than detach it René Scharfe 1 sibling, 2 replies; 15+ messages in thread From: Jeff King @ 2019-08-25 6:57 UTC (permalink / raw) To: Mike Hommey; +Cc: git, gitster On Sun, Aug 25, 2019 at 01:13:48PM +0900, Mike Hommey wrote: > command_buf.buf is also stored in cmd_hist, so instead of > strbuf_release, the current code uses strbuf_detach in order to > "leak" the buffer as far as the strbuf is concerned. > > However, strbuf_detach does more than "leak" the strbuf buffer: it > possibly reallocates it to ensure a terminating nul character. And when > that happens, what is already stored in cmd_hist may now be a free()d > buffer. > > In practice, though, command_buf.buf is most of the time a nul > terminated string, meaning command_buf.len < command_buf.alloc, and > strbuf_detach is a no-op. BUT, when it's not (e.g. on the first call), > command_buf.buf is &strbuf_slopbuf. In that case, strbuf_detach does > allocate a 1 byte buffer to store a nul character in it, which is then > leaked. I think this is stronger than just "most of the time". It's an invariant for strbufs to have a NUL, so the only case where detaching isn't a noop is the empty slopbuf case you mention. Splitting hairs, perhaps, but I think with that explanation, we could probably argue that this case will never come up: strbuf_getline will either have allocated a buffer or will have returned EOF. That said, I do think it's quite confusing and is worth fixing, both for readability and for future-proofing. But... > diff --git a/fast-import.c b/fast-import.c > index b44d6a467e..b1d07efe8c 100644 > --- a/fast-import.c > +++ b/fast-import.c > @@ -1763,7 +1763,9 @@ static int read_next_command(void) > } else { > struct recent_command *rc; > > - strbuf_detach(&command_buf, NULL); > + // command_buf is enther empty or also stored in cmd_hist, > + // reinitialize it. > + strbuf_init(&command_buf, 0); This whole thing is a sign that the code is Doing It Wrong. Whoever is taking ownership of the buffer should be calling strbuf_detach() at that point. It's a bit tricky, though, because we take ownership and then expect people still look at command_buf.buf. Which makes me concerned that there are other operations which might modify the buffer and have the same issue. I think it would be much easier to follow if we simply used the same command_buf over and over, and just copied into the history. The cost is about the same (we still alloc once per line, though we do an extra memcpy now). So I thought something like this would work (we don't need to convert those detaches into a strbuf_reset() calls because strbuf_getline does so automatically): diff --git a/fast-import.c b/fast-import.c index b44d6a467e..31207acd61 100644 --- a/fast-import.c +++ b/fast-import.c @@ -1763,7 +1763,6 @@ static int read_next_command(void) } else { struct recent_command *rc; - strbuf_detach(&command_buf, NULL); stdin_eof = strbuf_getline_lf(&command_buf, stdin); if (stdin_eof) return EOF; @@ -1784,7 +1783,7 @@ static int read_next_command(void) free(rc->buf); } - rc->buf = command_buf.buf; + rc->buf = xstrdup(command_buf.buf); rc->prev = cmd_tail; rc->next = cmd_hist.prev; rc->prev->next = rc; @@ -1833,7 +1832,6 @@ static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res) char *term = xstrdup(data); size_t term_len = command_buf.len - (data - command_buf.buf); - strbuf_detach(&command_buf, NULL); for (;;) { if (strbuf_getline_lf(&command_buf, stdin) == EOF) die("EOF in data (terminator '%s' not found)", term); But it doesn't! It turns out that there are other places in the code which assume they can call read_next_command() while hanging onto the existing buffer. Which only works in the old code because the history feature happens to hold on to the old pointer. If I do this on top, then all tests pass: diff --git a/fast-import.c b/fast-import.c index 31207acd61..1f9160b645 100644 --- a/fast-import.c +++ b/fast-import.c @@ -2586,7 +2586,7 @@ static void parse_new_commit(const char *arg) struct branch *b; char *author = NULL; char *committer = NULL; - const char *encoding = NULL; + char *encoding = NULL; struct hash_list *merge_list = NULL; unsigned int merge_count; unsigned char prev_fanout, new_fanout; @@ -2609,8 +2609,10 @@ static void parse_new_commit(const char *arg) } if (!committer) die("Expected committer but didn't get one"); - if (skip_prefix(command_buf.buf, "encoding ", &encoding)) + if (skip_prefix(command_buf.buf, "encoding ", &v)) { + encoding = xstrdup(v); read_next_command(); + } parse_data(&msg, 0, NULL); read_next_command(); parse_from(b); @@ -2684,6 +2686,7 @@ static void parse_new_commit(const char *arg) strbuf_addbuf(&new_data, &msg); free(author); free(committer); + free(encoding); if (!store_object(OBJ_COMMIT, &new_data, NULL, &b->oid, next_mark)) b->pack_id = pack_id; And I think this is actually a real bug in the current code! We keep a pointer to the encoding string, which survives because of the history. But that history is bounded, and we could have an indefinite number of changed files in the middle. If I modify t9300 like this: diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh index 141b7fa35e..d4bbe630d5 100755 --- a/t/t9300-fast-import.sh +++ b/t/t9300-fast-import.sh @@ -3314,6 +3314,11 @@ test_expect_success 'X: handling encoding' ' printf "Pi: \360\nCOMMIT\n" >>input && + for i in $(test_seq 200) + do + echo "M 644 $EMPTY_BLOB file-$i" + done >>input && + git fast-import <input && git cat-file -p encoding | grep $(printf "\360") && git log -1 --format=%B encoding | grep $(printf "\317\200") and run the test suite (using an unmodified git, without the earlier patches I showed) then the test fails, putting garbage into the encoding header (and when compiled with ASan, reports a use-after-free). So I think the way the string handling is currently done papers over such problems. You only see the problem if you have a hundred or more modified files, so it works _most_ of the time. That implies to me it's worth following a fix like the one I showed above. I am slightly concerned there are other similar cases to the "encoding" one lurking (and they _might_ not be bugs already, if there's a fixed number of reads between the pointer being saved and being used), but it seems worth the risk. -Peff ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] fast-import: Reinitialize command_buf rather than detach it. 2019-08-25 6:57 ` Jeff King @ 2019-08-25 7:20 ` Mike Hommey 2019-08-25 7:28 ` Jeff King 2019-08-25 8:06 ` [PATCH 0/2] fast-import input string handling bugs Jeff King 1 sibling, 1 reply; 15+ messages in thread From: Mike Hommey @ 2019-08-25 7:20 UTC (permalink / raw) To: Jeff King; +Cc: git, gitster On Sun, Aug 25, 2019 at 02:57:48AM -0400, Jeff King wrote: > On Sun, Aug 25, 2019 at 01:13:48PM +0900, Mike Hommey wrote: > > > command_buf.buf is also stored in cmd_hist, so instead of > > strbuf_release, the current code uses strbuf_detach in order to > > "leak" the buffer as far as the strbuf is concerned. > > > > However, strbuf_detach does more than "leak" the strbuf buffer: it > > possibly reallocates it to ensure a terminating nul character. And when > > that happens, what is already stored in cmd_hist may now be a free()d > > buffer. > > > > In practice, though, command_buf.buf is most of the time a nul > > terminated string, meaning command_buf.len < command_buf.alloc, and > > strbuf_detach is a no-op. BUT, when it's not (e.g. on the first call), > > command_buf.buf is &strbuf_slopbuf. In that case, strbuf_detach does > > allocate a 1 byte buffer to store a nul character in it, which is then > > leaked. > > I think this is stronger than just "most of the time". It's an invariant > for strbufs to have a NUL, so the only case where detaching isn't a noop > is the empty slopbuf case you mention. If it's an invariant, why does detach actively tries to realloc and set the nul terminator as if it can happen in more cases than when using the slopbuf? > Splitting hairs, perhaps, but I think with that explanation, we could > probably argue that this case will never come up: strbuf_getline will > either have allocated a buffer or will have returned EOF. Note that the slopbuf case _does_ come up, and we always leak a 1 byte buffer. > That said, I do think it's quite confusing and is worth fixing, both for > readability and for future-proofing. But... (...) I do agree the way fast-import works between cmd_hist and command_buf is very brittle, as you've shown. I didn't feel like digging into it though. Thanks for having gone further than I did. Mike ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] fast-import: Reinitialize command_buf rather than detach it. 2019-08-25 7:20 ` Mike Hommey @ 2019-08-25 7:28 ` Jeff King 0 siblings, 0 replies; 15+ messages in thread From: Jeff King @ 2019-08-25 7:28 UTC (permalink / raw) To: Mike Hommey; +Cc: git, gitster On Sun, Aug 25, 2019 at 04:20:31PM +0900, Mike Hommey wrote: > > I think this is stronger than just "most of the time". It's an invariant > > for strbufs to have a NUL, so the only case where detaching isn't a noop > > is the empty slopbuf case you mention. > > If it's an invariant, why does detach actively tries to realloc and set > the nul terminator as if it can happen in more cases than when using the > slopbuf? It calls strbuf_grow() to handle the slopbuf case (we can't hand off the slopbuf, since the caller expects an allocated buffer). It just doesn't bother to distinguish that case itself, and lets strbuf_grow() handle it. I think it would be equally correct for strbuf_detach() to do: if (!sb->alloc) strbuf_grow(0); > > Splitting hairs, perhaps, but I think with that explanation, we could > > probably argue that this case will never come up: strbuf_getline will > > either have allocated a buffer or will have returned EOF. > > Note that the slopbuf case _does_ come up, and we always leak a 1 byte > buffer. Hmm, I suppose so, on the very first call before we've read anything (and likewise if parse_data() reset it then got an EOF, and we then tried to read another command). > I do agree the way fast-import works between cmd_hist and command_buf is > very brittle, as you've shown. I didn't feel like digging into it > though. Thanks for having gone further than I did. I'll see if I can shape my rambling into a patch. -Peff ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 0/2] fast-import input string handling bugs 2019-08-25 6:57 ` Jeff King 2019-08-25 7:20 ` Mike Hommey @ 2019-08-25 8:06 ` Jeff King 2019-08-25 8:08 ` [PATCH 1/2] fast-import: duplicate parsed encoding string Jeff King ` (3 more replies) 1 sibling, 4 replies; 15+ messages in thread From: Jeff King @ 2019-08-25 8:06 UTC (permalink / raw) To: Mike Hommey; +Cc: Elijah Newren, git, gitster On Sun, Aug 25, 2019 at 02:57:48AM -0400, Jeff King wrote: > And I think this is actually a real bug in the current code! We keep a > pointer to the encoding string, which survives because of the history. > But that history is bounded, and we could have an indefinite number of > changed files in the middle. If I modify t9300 like this: Here are two patches. The first fixes the existing bug with "encoding", and the second uses the approach I suggested to fix the leak you noticed. The second one does carry a greater risk of regression than your patch, but I think it's worth it for the fact that it makes any other bugs (like the "encoding" one) more obvious. [1/2]: fast-import: duplicate parsed encoding string [2/2]: fast-import: duplicate into history rather than passing ownership fast-import.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) -Peff ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/2] fast-import: duplicate parsed encoding string 2019-08-25 8:06 ` [PATCH 0/2] fast-import input string handling bugs Jeff King @ 2019-08-25 8:08 ` Jeff King 2019-08-26 18:28 ` Elijah Newren 2019-08-25 8:10 ` [PATCH 2/2] fast-import: duplicate into history rather than passing ownership Jeff King ` (2 subsequent siblings) 3 siblings, 1 reply; 15+ messages in thread From: Jeff King @ 2019-08-25 8:08 UTC (permalink / raw) To: Mike Hommey; +Cc: Elijah Newren, git, gitster We read each line of the fast-import stream into the command_buf strbuf. When reading a commit, we parse a line like "encoding foo" by storing a pointer to "foo", but not making a copy. We may then read an unbounded number of other lines (e.g., one for each modified file in the commit), each of which writes into command_buf. This works out in practice for small cases, because we hand off ownership of the heap buffer from command_buf to the cmd_hist array, and read new commands into a fresh heap buffer. And thus the pointer to "foo" remains valid as long as there aren't so many intermediate lines that we end up dropping the original "encoding" line from the history. But as the test modification shows, if we go over our default of 100 lines, we end up with our encoding string pointing into freed heap memory. This seems to fail reliably by writing garbage into the output, but running under ASan definitely detects this as a user-after-free. We can fix it by duplicating the encoding value, just as we do for other parsed lines (e.g., an author line ends up in parse_ident, which copies it to a new string). Signed-off-by: Jeff King <peff@peff.net> --- fast-import.c | 7 +++++-- t/t9300-fast-import.sh | 5 +++++ 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/fast-import.c b/fast-import.c index b44d6a467e..ee7258037a 100644 --- a/fast-import.c +++ b/fast-import.c @@ -2588,7 +2588,7 @@ static void parse_new_commit(const char *arg) struct branch *b; char *author = NULL; char *committer = NULL; - const char *encoding = NULL; + char *encoding = NULL; struct hash_list *merge_list = NULL; unsigned int merge_count; unsigned char prev_fanout, new_fanout; @@ -2611,8 +2611,10 @@ static void parse_new_commit(const char *arg) } if (!committer) die("Expected committer but didn't get one"); - if (skip_prefix(command_buf.buf, "encoding ", &encoding)) + if (skip_prefix(command_buf.buf, "encoding ", &v)) { + encoding = xstrdup(v); read_next_command(); + } parse_data(&msg, 0, NULL); read_next_command(); parse_from(b); @@ -2686,6 +2688,7 @@ static void parse_new_commit(const char *arg) strbuf_addbuf(&new_data, &msg); free(author); free(committer); + free(encoding); if (!store_object(OBJ_COMMIT, &new_data, NULL, &b->oid, next_mark)) b->pack_id = pack_id; diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh index 141b7fa35e..cf66b40ebc 100755 --- a/t/t9300-fast-import.sh +++ b/t/t9300-fast-import.sh @@ -3314,6 +3314,11 @@ test_expect_success 'X: handling encoding' ' printf "Pi: \360\nCOMMIT\n" >>input && + for i in $(test_seq 100) + do + echo "M 644 $EMPTY_BLOB file-$i" + done >>input && + git fast-import <input && git cat-file -p encoding | grep $(printf "\360") && git log -1 --format=%B encoding | grep $(printf "\317\200") -- 2.23.0.478.g23872bed7d ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] fast-import: duplicate parsed encoding string 2019-08-25 8:08 ` [PATCH 1/2] fast-import: duplicate parsed encoding string Jeff King @ 2019-08-26 18:28 ` Elijah Newren 2019-08-26 18:44 ` Jeff King 0 siblings, 1 reply; 15+ messages in thread From: Elijah Newren @ 2019-08-26 18:28 UTC (permalink / raw) To: Jeff King; +Cc: Mike Hommey, Git Mailing List, Junio C Hamano On Sun, Aug 25, 2019 at 1:08 AM Jeff King <peff@peff.net> wrote: > > We read each line of the fast-import stream into the command_buf strbuf. > When reading a commit, we parse a line like "encoding foo" by storing a > pointer to "foo", but not making a copy. We may then read an unbounded > number of other lines (e.g., one for each modified file in the commit), > each of which writes into command_buf. > > This works out in practice for small cases, because we hand off > ownership of the heap buffer from command_buf to the cmd_hist array, and > read new commands into a fresh heap buffer. And thus the pointer to > "foo" remains valid as long as there aren't so many intermediate lines > that we end up dropping the original "encoding" line from the history. > > But as the test modification shows, if we go over our default of 100 > lines, we end up with our encoding string pointing into freed heap > memory. This seems to fail reliably by writing garbage into the output, > but running under ASan definitely detects this as a user-after-free. s/user-after-free/use-after-free/ > We can fix it by duplicating the encoding value, just as we do for other > parsed lines (e.g., an author line ends up in parse_ident, which copies > it to a new string). Eek! Thanks for fixing this up for me; patch looks good. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] fast-import: duplicate parsed encoding string 2019-08-26 18:28 ` Elijah Newren @ 2019-08-26 18:44 ` Jeff King 0 siblings, 0 replies; 15+ messages in thread From: Jeff King @ 2019-08-26 18:44 UTC (permalink / raw) To: Elijah Newren; +Cc: Mike Hommey, Git Mailing List, Junio C Hamano On Mon, Aug 26, 2019 at 11:28:58AM -0700, Elijah Newren wrote: > On Sun, Aug 25, 2019 at 1:08 AM Jeff King <peff@peff.net> wrote: > > > > We read each line of the fast-import stream into the command_buf strbuf. > > When reading a commit, we parse a line like "encoding foo" by storing a > > pointer to "foo", but not making a copy. We may then read an unbounded > > number of other lines (e.g., one for each modified file in the commit), > > each of which writes into command_buf. > > > > This works out in practice for small cases, because we hand off > > ownership of the heap buffer from command_buf to the cmd_hist array, and > > read new commands into a fresh heap buffer. And thus the pointer to > > "foo" remains valid as long as there aren't so many intermediate lines > > that we end up dropping the original "encoding" line from the history. > > > > But as the test modification shows, if we go over our default of 100 > > lines, we end up with our encoding string pointing into freed heap > > memory. This seems to fail reliably by writing garbage into the output, > > but running under ASan definitely detects this as a user-after-free. > > s/user-after-free/use-after-free/ Wow. I self-corrected "user-after-free" at least three other times while writing this thread. I clearly have a problem. :) > > We can fix it by duplicating the encoding value, just as we do for other > > parsed lines (e.g., an author line ends up in parse_ident, which copies > > it to a new string). > > Eek! Thanks for fixing this up for me; patch looks good. No problem. On the plus side, finding your bug made me think much harder about the implications of patch 2 (because it's quite subtle and an easy mistake to make). -Peff ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 2/2] fast-import: duplicate into history rather than passing ownership 2019-08-25 8:06 ` [PATCH 0/2] fast-import input string handling bugs Jeff King 2019-08-25 8:08 ` [PATCH 1/2] fast-import: duplicate parsed encoding string Jeff King @ 2019-08-25 8:10 ` Jeff King 2019-08-25 10:02 ` Mike Hommey 2019-08-26 15:36 ` [PATCH 0/2] fast-import input string handling bugs Junio C Hamano 2019-08-26 19:18 ` Elijah Newren 3 siblings, 1 reply; 15+ messages in thread From: Jeff King @ 2019-08-25 8:10 UTC (permalink / raw) To: Mike Hommey; +Cc: Elijah Newren, git, gitster Fast-import's read_next_command() has somewhat odd memory ownership semantics for the command_buf strbuf. After reading a command, we copy the strbuf's pointer (without duplicating the string) into our cmd_hist array of recent commands. And then when we're about to read a new command, we clear the strbuf by calling strbuf_detach(), dropping ownership from the strbuf (leaving the cmd_hist reference as the remaining owner). This has a few surprising implications: - if the strbuf hasn't been copied into cmd_hist (e.g., because we haven't ready any commands yet), then the strbuf_detach() will leak the resulting string - any modification to command_buf risks invalidating the pointer held by cmd_hist. There doesn't seem to be any way to trigger this currently (since we tend to modify it only by detaching and reading in a new value), but it's subtly dangerous. - any pointers into an input string will remain valid as long as cmd_hist points to them. So in general, you can point into command_buf.buf and call read_next_command() up to 100 times before your string is cycled out and freed, leaving you with a dangling pointer. This makes it easy to miss bugs during testing, as they might trigger only for a sufficiently large commit (e.g., the bug fixed in the previous commit). Instead, let's make a new string to copy the command into the history array, rather than having dual ownership with the old. Then we can drop the strbuf_detach() calls entirely, and just reuse the same buffer within command_buf over and over. We'd normally have to strbuf_reset() it before using it again, but in both cases here we're using strbuf_getline(), which does it automatically for us. This fixes the leak, and it means that even a single call to read_next_command() will invalidate any held pointers, making it easier to find bugs. In fact, we can drop the extra input lines added to the test case by the previous commit, as the unfixed bug would now trigger just from reading the commit message, even without any modified files in the commit. Reported-by: Mike Hommey <mh@glandium.org> Signed-off-by: Jeff King <peff@peff.net> --- fast-import.c | 4 +--- t/t9300-fast-import.sh | 5 ----- 2 files changed, 1 insertion(+), 8 deletions(-) diff --git a/fast-import.c b/fast-import.c index ee7258037a..1f9160b645 100644 --- a/fast-import.c +++ b/fast-import.c @@ -1763,7 +1763,6 @@ static int read_next_command(void) } else { struct recent_command *rc; - strbuf_detach(&command_buf, NULL); stdin_eof = strbuf_getline_lf(&command_buf, stdin); if (stdin_eof) return EOF; @@ -1784,7 +1783,7 @@ static int read_next_command(void) free(rc->buf); } - rc->buf = command_buf.buf; + rc->buf = xstrdup(command_buf.buf); rc->prev = cmd_tail; rc->next = cmd_hist.prev; rc->prev->next = rc; @@ -1833,7 +1832,6 @@ static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res) char *term = xstrdup(data); size_t term_len = command_buf.len - (data - command_buf.buf); - strbuf_detach(&command_buf, NULL); for (;;) { if (strbuf_getline_lf(&command_buf, stdin) == EOF) die("EOF in data (terminator '%s' not found)", term); diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh index cf66b40ebc..141b7fa35e 100755 --- a/t/t9300-fast-import.sh +++ b/t/t9300-fast-import.sh @@ -3314,11 +3314,6 @@ test_expect_success 'X: handling encoding' ' printf "Pi: \360\nCOMMIT\n" >>input && - for i in $(test_seq 100) - do - echo "M 644 $EMPTY_BLOB file-$i" - done >>input && - git fast-import <input && git cat-file -p encoding | grep $(printf "\360") && git log -1 --format=%B encoding | grep $(printf "\317\200") -- 2.23.0.478.g23872bed7d ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] fast-import: duplicate into history rather than passing ownership 2019-08-25 8:10 ` [PATCH 2/2] fast-import: duplicate into history rather than passing ownership Jeff King @ 2019-08-25 10:02 ` Mike Hommey 2019-08-25 14:21 ` René Scharfe 0 siblings, 1 reply; 15+ messages in thread From: Mike Hommey @ 2019-08-25 10:02 UTC (permalink / raw) To: Jeff King; +Cc: Elijah Newren, git, gitster On Sun, Aug 25, 2019 at 04:10:55AM -0400, Jeff King wrote: > Fast-import's read_next_command() has somewhat odd memory ownership > semantics for the command_buf strbuf. After reading a command, we copy > the strbuf's pointer (without duplicating the string) into our cmd_hist > array of recent commands. And then when we're about to read a new > command, we clear the strbuf by calling strbuf_detach(), dropping > ownership from the strbuf (leaving the cmd_hist reference as the > remaining owner). > > This has a few surprising implications: > > - if the strbuf hasn't been copied into cmd_hist (e.g., because we > haven't ready any commands yet), then the strbuf_detach() will leak > the resulting string > > - any modification to command_buf risks invalidating the pointer held > by cmd_hist. There doesn't seem to be any way to trigger this > currently (since we tend to modify it only by detaching and reading > in a new value), but it's subtly dangerous. > > - any pointers into an input string will remain valid as long as > cmd_hist points to them. So in general, you can point into > command_buf.buf and call read_next_command() up to 100 times before > your string is cycled out and freed, leaving you with a dangling > pointer. This makes it easy to miss bugs during testing, as they > might trigger only for a sufficiently large commit (e.g., the bug > fixed in the previous commit). > > Instead, let's make a new string to copy the command into the history > array, rather than having dual ownership with the old. Then we can drop > the strbuf_detach() calls entirely, and just reuse the same buffer > within command_buf over and over. We'd normally have to strbuf_reset() > it before using it again, but in both cases here we're using > strbuf_getline(), which does it automatically for us. > > This fixes the leak, and it means that even a single call to > read_next_command() will invalidate any held pointers, making it easier > to find bugs. In fact, we can drop the extra input lines added to the > test case by the previous commit, as the unfixed bug would now trigger > just from reading the commit message, even without any modified files in > the commit. > > Reported-by: Mike Hommey <mh@glandium.org> > Signed-off-by: Jeff King <peff@peff.net> > --- > fast-import.c | 4 +--- > t/t9300-fast-import.sh | 5 ----- > 2 files changed, 1 insertion(+), 8 deletions(-) > > diff --git a/fast-import.c b/fast-import.c > index ee7258037a..1f9160b645 100644 > --- a/fast-import.c > +++ b/fast-import.c > @@ -1763,7 +1763,6 @@ static int read_next_command(void) > } else { > struct recent_command *rc; > > - strbuf_detach(&command_buf, NULL); > stdin_eof = strbuf_getline_lf(&command_buf, stdin); > if (stdin_eof) > return EOF; > @@ -1784,7 +1783,7 @@ static int read_next_command(void) > free(rc->buf); > } > > - rc->buf = command_buf.buf; > + rc->buf = xstrdup(command_buf.buf); You could xstrndup(command_buf.buf, command_buf.len), which would avoid a hidden strlen. Mike ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] fast-import: duplicate into history rather than passing ownership 2019-08-25 10:02 ` Mike Hommey @ 2019-08-25 14:21 ` René Scharfe 2019-08-26 18:42 ` Jeff King 0 siblings, 1 reply; 15+ messages in thread From: René Scharfe @ 2019-08-25 14:21 UTC (permalink / raw) To: Mike Hommey, Jeff King; +Cc: Elijah Newren, git, gitster Am 25.08.19 um 12:02 schrieb Mike Hommey: > On Sun, Aug 25, 2019 at 04:10:55AM -0400, Jeff King wrote: >> diff --git a/fast-import.c b/fast-import.c >> index ee7258037a..1f9160b645 100644 >> --- a/fast-import.c >> +++ b/fast-import.c >> @@ -1763,7 +1763,6 @@ static int read_next_command(void) >> } else { >> struct recent_command *rc; >> >> - strbuf_detach(&command_buf, NULL); >> stdin_eof = strbuf_getline_lf(&command_buf, stdin); >> if (stdin_eof) >> return EOF; >> @@ -1784,7 +1783,7 @@ static int read_next_command(void) >> free(rc->buf); >> } >> >> - rc->buf = command_buf.buf; >> + rc->buf = xstrdup(command_buf.buf); > > You could xstrndup(command_buf.buf, command_buf.len), which would avoid > a hidden strlen. xstrndup() also searches for NUL, albeit with memchr(3). xmemdupz() would copy without checking. I suspect the simplicity of xstrdup() outweighs the benefits of the alternatives, but didn't do any measurements.. René ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] fast-import: duplicate into history rather than passing ownership 2019-08-25 14:21 ` René Scharfe @ 2019-08-26 18:42 ` Jeff King 0 siblings, 0 replies; 15+ messages in thread From: Jeff King @ 2019-08-26 18:42 UTC (permalink / raw) To: René Scharfe; +Cc: Mike Hommey, Elijah Newren, git, gitster On Sun, Aug 25, 2019 at 04:21:54PM +0200, René Scharfe wrote: > > You could xstrndup(command_buf.buf, command_buf.len), which would avoid > > a hidden strlen. > > xstrndup() also searches for NUL, albeit with memchr(3). xmemdupz() > would copy without checking. > > I suspect the simplicity of xstrdup() outweighs the benefits of the > alternatives, but didn't do any measurements.. Yep. I actually started to write xmemdupz() originally then decided it was unnecessarily verbose and a premature optimization. I wondered after this exchange whether something like: char *strbuf_dup(const struct strbuf *sb) { return xmemdupz(sb->buf, sb->len); } would be a useful general helper. Grepping around it doesn't seem like there are a lot of candidates. If we really wanted to micro-optimize, we could have cmd_hist store strbufs, and then we could reuse the same buffers over and over without re-allocating. And use strbuf_addbuf(&cmd_hist.buf, &command_buf). :) -Peff ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/2] fast-import input string handling bugs 2019-08-25 8:06 ` [PATCH 0/2] fast-import input string handling bugs Jeff King 2019-08-25 8:08 ` [PATCH 1/2] fast-import: duplicate parsed encoding string Jeff King 2019-08-25 8:10 ` [PATCH 2/2] fast-import: duplicate into history rather than passing ownership Jeff King @ 2019-08-26 15:36 ` Junio C Hamano 2019-08-26 19:18 ` Elijah Newren 3 siblings, 0 replies; 15+ messages in thread From: Junio C Hamano @ 2019-08-26 15:36 UTC (permalink / raw) To: Jeff King; +Cc: Mike Hommey, Elijah Newren, git Jeff King <peff@peff.net> writes: > On Sun, Aug 25, 2019 at 02:57:48AM -0400, Jeff King wrote: > >> And I think this is actually a real bug in the current code! We keep a >> pointer to the encoding string, which survives because of the history. >> But that history is bounded, and we could have an indefinite number of >> changed files in the middle. If I modify t9300 like this: > > Here are two patches. The first fixes the existing bug with "encoding", > and the second uses the approach I suggested to fix the leak you > noticed. > > The second one does carry a greater risk of regression than your patch, > but I think it's worth it for the fact that it makes any other bugs > (like the "encoding" one) more obvious. Yeah, it may be worth the risk, given that this is quite early in the cycle, so we have enough time to cook it in 'next' to see if somebody screams ;-) > > [1/2]: fast-import: duplicate parsed encoding string > [2/2]: fast-import: duplicate into history rather than passing ownership > > fast-import.c | 11 ++++++----- > 1 file changed, 6 insertions(+), 5 deletions(-) > > -Peff ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/2] fast-import input string handling bugs 2019-08-25 8:06 ` [PATCH 0/2] fast-import input string handling bugs Jeff King ` (2 preceding siblings ...) 2019-08-26 15:36 ` [PATCH 0/2] fast-import input string handling bugs Junio C Hamano @ 2019-08-26 19:18 ` Elijah Newren 3 siblings, 0 replies; 15+ messages in thread From: Elijah Newren @ 2019-08-26 19:18 UTC (permalink / raw) To: Jeff King; +Cc: Mike Hommey, Git Mailing List, Junio C Hamano On Sun, Aug 25, 2019 at 1:06 AM Jeff King <peff@peff.net> wrote: > > On Sun, Aug 25, 2019 at 02:57:48AM -0400, Jeff King wrote: > > > And I think this is actually a real bug in the current code! We keep a > > pointer to the encoding string, which survives because of the history. > > But that history is bounded, and we could have an indefinite number of > > changed files in the middle. If I modify t9300 like this: > > Here are two patches. The first fixes the existing bug with "encoding", > and the second uses the approach I suggested to fix the leak you > noticed. > > The second one does carry a greater risk of regression than your patch, > but I think it's worth it for the fact that it makes any other bugs > (like the "encoding" one) more obvious. I agree, both patches look good to me, and I particularly appreciate some extra help to avoid making the same mistake again. :-) Just for good measure, I also went and tested these patches by running the git filter-repo testsuite and by re-running the filter-repo timing cases at https://public-inbox.org/git/CABPp-BGOz8nks0+Tdw5GyGqxeYR-3FF6FT5JcgVqZDYVRQ6qog@mail.gmail.com/. While filter-repo only uses a subset of fast-export functionality, it tests it with a variety of weird and unusual tiny test repos. And the timing cases run on three real-world repositories (git, rails, and linux). Everything looks good on all of these testcases. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] fast-import: Reinitialize command_buf rather than detach it. 2019-08-25 4:13 [PATCH] fast-import: Reinitialize command_buf rather than detach it Mike Hommey 2019-08-25 6:57 ` Jeff King @ 2019-08-25 12:35 ` René Scharfe 1 sibling, 0 replies; 15+ messages in thread From: René Scharfe @ 2019-08-25 12:35 UTC (permalink / raw) To: Mike Hommey; +Cc: git, gitster Am 25.08.19 um 06:13 schrieb Mike Hommey: > command_buf.buf is also stored in cmd_hist, so instead of > strbuf_release, the current code uses strbuf_detach in order to > "leak" the buffer as far as the strbuf is concerned. Hmm. > However, strbuf_detach does more than "leak" the strbuf buffer: it > possibly reallocates it to ensure a terminating nul character. And when > that happens, what is already stored in cmd_hist may now be a free()d > buffer. > > In practice, though, command_buf.buf is most of the time a nul > terminated string, meaning command_buf.len < command_buf.alloc, and > strbuf_detach is a no-op. BUT, when it's not (e.g. on the first call), > command_buf.buf is &strbuf_slopbuf. In that case, strbuf_detach does > allocate a 1 byte buffer to store a nul character in it, which is then > leaked. And that's why a pointer to a strbuf buf is valid until the next strbuf_ function call. > > Since the code using strbuf_detach is assuming it does nothing to > command_buf.buf, it's overall safer to use strbuf_init, which has > the same practical effect in the usual case, and works appropriately > when command_buf is empty. > --- > fast-import.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/fast-import.c b/fast-import.c > index b44d6a467e..b1d07efe8c 100644 > --- a/fast-import.c > +++ b/fast-import.c > @@ -1763,7 +1763,9 @@ static int read_next_command(void) > } else { > struct recent_command *rc; > > - strbuf_detach(&command_buf, NULL); > + // command_buf is enther empty or also stored in cmd_hist, > + // reinitialize it. > + strbuf_init(&command_buf, 0); This is a no-op; strbuf_detach already re-initializes the strbuf. (And double-slash comments are avoided in Git code..) ((s/enther/either/)) > stdin_eof = strbuf_getline_lf(&command_buf, stdin); > if (stdin_eof) > return EOF; > @@ -1833,7 +1835,9 @@ static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res) > char *term = xstrdup(data); > size_t term_len = command_buf.len - (data - command_buf.buf); > > - strbuf_detach(&command_buf, NULL); > + // command_buf is enther empty or also stored in cmd_hist, > + // reinitialize it. > + strbuf_init(&command_buf, 0); Same here. > for (;;) { > if (strbuf_getline_lf(&command_buf, stdin) == EOF) > die("EOF in data (terminator '%s' not found)", term); > strbuf_detach() is handing over ownership of the buffer. Its return value should be stored; grabbing the pointer beforehand is naughty because it can get stale, as you noted. I doubt there is a good reason for ignoring the return value of strbuf_detach(), ever. René ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2019-08-26 19:19 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-08-25 4:13 [PATCH] fast-import: Reinitialize command_buf rather than detach it Mike Hommey 2019-08-25 6:57 ` Jeff King 2019-08-25 7:20 ` Mike Hommey 2019-08-25 7:28 ` Jeff King 2019-08-25 8:06 ` [PATCH 0/2] fast-import input string handling bugs Jeff King 2019-08-25 8:08 ` [PATCH 1/2] fast-import: duplicate parsed encoding string Jeff King 2019-08-26 18:28 ` Elijah Newren 2019-08-26 18:44 ` Jeff King 2019-08-25 8:10 ` [PATCH 2/2] fast-import: duplicate into history rather than passing ownership Jeff King 2019-08-25 10:02 ` Mike Hommey 2019-08-25 14:21 ` René Scharfe 2019-08-26 18:42 ` Jeff King 2019-08-26 15:36 ` [PATCH 0/2] fast-import input string handling bugs Junio C Hamano 2019-08-26 19:18 ` Elijah Newren 2019-08-25 12:35 ` [PATCH] fast-import: Reinitialize command_buf rather than detach it René Scharfe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).