git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: "Phillip Wood" <phillip.wood123@gmail.com>,
	"Thomas Bock" <bockthom@cs.uni-saarland.de>,
	"Derrick Stolee" <derrickstolee@github.com>,
	git@vger.kernel.org, "René Scharfe" <l.s.r@web.de>
Subject: [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases
Date: Thu, 27 Apr 2023 04:13:30 -0400	[thread overview]
Message-ID: <20230427081330.GA1461786@coredump.intra.peff.net> (raw)
In-Reply-To: <xmqqildiveu6.fsf@gitster.g>

On Wed, Apr 26, 2023 at 08:32:49AM -0700, Junio C Hamano wrote:

> > Note that will exclude a few cases that we do allow now, like:
> >
> >   committer name <email> \v123456 +0000\n
> >
> > Right now that parses as "123456", but we'd reject it as "0" after such
> > a patch.
> 
> I would say that it is a good thing.
> 
> The only (somewhat) end-user controlled things on the line are the
> name and email, and even there name is sanitized to remove "crud".
> The user-supplied timestamp goes through date.c::parse_date(),
> ending up with what date.c::date_string() formats, so there will not
> be syntactically incorrect timestamp there.  So we can be strict
> format-wise on the timestamp field, once we identify where it begins,
> which is the point of scanning backwards for '>'.
> 
> Unless the user does "hash-object" and deliberately creates a
> malformed commit object---they can keep both halves just fine in
> such a case as long as we do reject such a timestamp correctly.

I think we'd ideally consider the behavior against hypothetical bugs in
other implementations (including us in the future!). So yeah, I don't
think we ever generated a syntactically incorrect timestamp, and it
would be hard for a user to create one. But all things being equal, I'd
prefer to keep parsing something like:

  committer name <email> 123456\n

which is missing its timezone (and seems like a plausible sort of bug).
But I'm OK to draw the line at "if your implementation is sticking
control characters in the header, then tough luck".

So here's a v3. I was tempted to add the fix on top of the existing
patch, since it's somewhat its own case, and could be explained
separately. But they really are two versions of the same problem, so I
just rolled it all into patch 3.

Patch 4 needed small updates to its comment to match. The first two
patches are the same.

  [1/4]: t4212: avoid putting git on left-hand side of pipe
  [2/4]: parse_commit(): parse timestamp from end of line
  [3/4]: parse_commit(): handle broken whitespace-only timestamp
  [4/4]: parse_commit(): describe more date-parsing failure modes

 commit.c               | 57 ++++++++++++++++++++++++++++++++++++------
 t/t4212-log-corrupt.sh | 51 +++++++++++++++++++++++++++++++++++--
 2 files changed, 98 insertions(+), 10 deletions(-)

1:  7a2fa8daac = 1:  57401571b6 t4212: avoid putting git on left-hand side of pipe
2:  d90c720075 = 2:  54fa983d66 parse_commit(): parse timestamp from end of line
3:  1a47c87c07 ! 3:  894815d82d parse_commit(): handle broken whitespace-only timestamp
    @@ Commit message
         line, as well as the "\n\n" separator, and mistake the subject for the
         timestamp.
     
    -    The new test demonstrates such a case. I also added a test to check this
    -    case against the pretty-print formatter, which uses split_ident_line().
    -    It's not subject to the same bug, because it insists that there be one
    -    or more digits in the timestamp.
    +    We can solve this by trimming the whitespace ourselves, making sure that
    +    it has some non-whitespace to parse. Note that we need to be a bit
    +    careful about the definition of "whitespace" here, as our isspace()
    +    doesn't match exotic characters like vertical tab or formfeed. We can
    +    work around that by checking for an actual number (see the in-code
    +    comment). This is slightly more restrictive than the current code, but
    +    in practice the results are either the same (we reject "foo" as "0", but
    +    so would parse_timestamp()) or extremely unlikely even for broken
    +    commits (parse_timestamp() would allow "\v123" as "123", but we'll know
    +    make it "0").
     
    +    I did also allow "-" here, which may be controversial, as we don't
    +    currently support negative timestamps. My reasoning was two-fold. One,
    +    the design of parse_timestamp() is such that we should be able to easily
    +    switch it to handling signed values, and this otherwise creates a
    +    hard-to-find gotcha that anybody doing that work would get tripped up
    +    on. And two, the status quo is that we currently parse them, though the
    +    result of course ends up as a very large unsigned value (which is likely
    +    to just get clamped to "0" for display anyway, since our date routines
    +    can't handle it).
    +
    +    The new test checks the commit parser (via "--until") for both vanilla
    +    spaces and the vertical-tab case. I also added a test to check these
    +    against the pretty-print formatter, which uses split_ident_line().  It's
    +    not subject to the same bug, because it already insists that there be
    +    one or more digits in the timestamp.
    +
    +    Helped-by: Phillip Wood <phillip.wood123@gmail.com>
         Signed-off-by: Jeff King <peff@peff.net>
     
      ## commit.c ##
    @@ commit.c: static timestamp_t parse_commit_date(const char *buf, const char *tail
      
     -	/* dateptr < eol && *eol == '\n', so parsing will stop at eol */
     +	/*
    -+	 * Trim leading whitespace; parse_timestamp() will do this itself, but
    -+	 * if we have _only_ whitespace, it will walk right past the newline
    -+	 * while doing so.
    ++	 * Trim leading whitespace, but make sure we have at least one
    ++	 * non-whitespace character, as parse_timestamp() will otherwise walk
    ++	 * right past the newline we found in "eol" when skipping whitespace
    ++	 * itself.
    ++	 *
    ++	 * In theory it would be sufficient to allow any character not matched
    ++	 * by isspace(), but there's a catch: our isspace() does not
    ++	 * necessarily match the behavior of parse_timestamp(), as the latter
    ++	 * is implemented by system routines which match more exotic control
    ++	 * codes, or even locale-dependent sequences.
    ++	 *
    ++	 * Since we expect the timestamp to be a number, we can check for that.
    ++	 * Anything else (e.g., a non-numeric token like "foo") would just
    ++	 * cause parse_timestamp() to return 0 anyway.
     +	 */
     +	while (dateptr < eol && isspace(*dateptr))
     +		dateptr++;
    -+	if (dateptr == eol)
    ++	if (!isdigit(*dateptr) && *dateptr != '-')
     +		return 0;
     +
     +	/*
    -+	 * We know there is at least one non-whitespace character, so we'll
    -+	 * begin parsing there and stop at worst case at eol.
    ++	 * We know there is at least one digit (or dash), so we'll begin
    ++	 * parsing there and stop at worst case at eol.
     +	 */
      	return parse_timestamp(dateptr, NULL, 10);
      }
    @@ t/t4212-log-corrupt.sh: test_expect_success 'absurdly far-in-future date' '
      	git log -1 --format=%ad $commit
      '
      
    -+test_expect_success 'create commit with whitespace committer date' '
    ++test_expect_success 'create commits with whitespace committer dates' '
     +	# It is important that this subject line is numeric, since we want to
     +	# be sure we are not confused by skipping whitespace and accidentally
     +	# parsing the subject as a timestamp.
     +	#
     +	# Do not use munge_author_date here. Besides not hitting the committer
     +	# line, it leaves the timezone intact, and we want nothing but
     +	# whitespace.
    ++	#
    ++	# We will make two munged commits here. The first, ws_commit, will
    ++	# be purely spaces. The second contains a vertical tab, which is
    ++	# considered a space by strtoumax(), but not by our isspace().
     +	test_commit 1234567890 &&
     +	git cat-file commit HEAD >commit.orig &&
     +	sed "s/>.*/>    /" <commit.orig >commit.munge &&
    -+	ws_commit=$(git hash-object --literally -w -t commit commit.munge)
    ++	ws_commit=$(git hash-object --literally -w -t commit commit.munge) &&
    ++	sed "s/>.*/>   $(printf "\013")/" <commit.orig >commit.munge &&
    ++	vt_commit=$(git hash-object --literally -w -t commit commit.munge)
     +'
     +
     +test_expect_success '--until treats whitespace date as sentinel' '
     +	echo $ws_commit >expect &&
     +	git rev-list --until=1980-01-01 $ws_commit >actual &&
    ++	test_cmp expect actual &&
    ++
    ++	echo $vt_commit >expect &&
    ++	git rev-list --until=1980-01-01 $vt_commit >actual &&
     +	test_cmp expect actual
     +'
     +
    @@ t/t4212-log-corrupt.sh: test_expect_success 'absurdly far-in-future date' '
     +	# test bogus timestamps with git-log, 2014-02-24) for more discussion.
     +	echo : >expect &&
     +	git log -1 --format="%at:%ct" $ws_commit >actual &&
    ++	test_cmp expect actual &&
    ++	git log -1 --format="%at:%ct" $vt_commit &&
     +	test_cmp expect actual
     +'
     +
4:  193a01a32a < -:  ---------- parse_commit(): describe more date-parsing failure modes
-:  ---------- > 4:  ff7e9ddc7c parse_commit(): describe more date-parsing failure modes

  reply	other threads:[~2023-04-27  8:13 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-14 11:37 Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 Thomas Bock
2023-04-15  8:52 ` Jeff King
2023-04-15  8:59   ` Jeff King
2023-04-15 14:10   ` Kristoffer Haugsbakk
2023-04-17  5:40     ` Jeff King
2023-04-17  6:20       ` Kristoffer Haugsbakk
2023-04-17  7:41         ` Jeff King
2023-04-27 22:32           ` Kristoffer Haugsbakk
2023-04-17  9:51   ` Junio C Hamano
2023-04-18  4:12     ` Jeff King
2023-04-18 14:02       ` Derrick Stolee
2023-04-21 14:51         ` Thomas Bock
2023-04-22 13:41           ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Jeff King
2023-04-22 13:42             ` [PATCH 1/3] t4212: avoid putting git on left-hand side of pipe Jeff King
2023-04-22 13:47             ` [PATCH 2/3] parse_commit(): parse timestamp from end of line Jeff King
2023-04-24 17:05               ` Junio C Hamano
2023-04-25  5:23                 ` Jeff King
2023-04-24 16:39             ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Junio C Hamano
2023-04-25  5:52             ` [PATCH v2 " Jeff King
2023-04-25  5:54               ` Jeff King
2023-04-25  5:54               ` [PATCH v2 1/4] t4212: avoid putting git on left-hand side of pipe Jeff King
2023-04-25  5:54               ` [PATCH v2 2/4] parse_commit(): parse timestamp from end of line Jeff King
2023-04-25  5:54               ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King
2023-04-25 10:11                 ` Phillip Wood
2023-04-25 16:06                   ` Junio C Hamano
2023-04-26 11:36                     ` Jeff King
2023-04-26 15:32                       ` Junio C Hamano
2023-04-27  8:13                         ` Jeff King [this message]
2023-04-27  8:14                           ` [PATCH v3 1/4] t4212: avoid putting git on left-hand side of pipe Jeff King
2023-04-27  8:14                           ` [PATCH v3 2/4] parse_commit(): parse timestamp from end of line Jeff King
2023-04-27  8:17                           ` [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King
2023-04-27 10:11                             ` Phillip Wood
2023-04-27 11:55                               ` Phillip Wood
2023-04-27 16:46                                 ` Jeff King
2023-04-27 16:20                               ` Junio C Hamano
2023-04-27 16:55                                 ` Jeff King
2023-04-27 16:25                             ` Junio C Hamano
2023-04-27 16:57                               ` Jeff King
2023-04-27  8:17                           ` [PATCH v3 4/4] parse_commit(): describe more date-parsing failure modes Jeff King
2023-04-27  8:18                           ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King
2023-04-27 16:32                           ` Junio C Hamano
2023-04-26 14:06                     ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Phillip Wood
2023-04-26 14:31                       ` Andreas Schwab
2023-04-26 14:44                         ` Phillip Wood
2023-04-25  5:55               ` [PATCH v2 4/4] parse_commit(): describe more date-parsing failure modes Jeff King
2023-04-22 13:52         ` Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230427081330.GA1461786@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=bockthom@cs.uni-saarland.de \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    --cc=phillip.wood123@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).