git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/14] numparse module: systematically tighten up integer parsing
@ 2015-03-17 16:00 Michael Haggerty
  2015-03-17 16:00 ` [PATCH 01/14] numparse: new module for parsing integral numbers Michael Haggerty
                   ` (17 more replies)
  0 siblings, 18 replies; 43+ messages in thread
From: Michael Haggerty @ 2015-03-17 16:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Michael Haggerty

Sorry for the long email. Feel free to skip over the background info
and continue reading at "My solution" below.

This odyssey started with a typo:

    $ git shortlog -snl v2.2.0..v2.3.0
     14795      Junio C Hamano
      1658      Jeff King
      1400      Shawn O. Pearce
      1109      Linus Torvalds
       760      Jonathan Nieder
    [...]

It took me a minute to realize why so many commits are listed. It
turns out that "-l", which I added by accident, requires an integer
argument, but it happily swallows "v2.2.0..v2.3.0" without error. This
leaves no range argument, so the command traverses the entire history
of HEAD.

The relevant code is

        else if ((argcount = short_opt('l', av, &optarg))) {
                options->rename_limit = strtoul(optarg, NULL, 10);
                return argcount;
        }

, which is broken in many ways. First of all, strtoul() is way too
permissive for simple tasks like this:

* It allows leading whitespace.

* It allows arbitrary trailing characters.

* It allows a leading sign character ('+' or '-'), even though the
  result is unsigned.

* If the string doesn't contain a number at all, it sets its "endptr"
  argument to point at the start of the string but doesn't set errno.

* If the value is larger than fits in an unsigned long, it returns the
  value clamped to the range 0..ULONG_MAX (setting errno to ERANGE).

* If the value is between -ULONG_MAX and 0, it returns the positive
  integer with the same bit pattern, without setting errno(!) (I can
  hardly think of a scenario where this would be useful.)

* For base=0 (autodetect base), it allows a base specifier prefix "0x"
  or "0" and parses the number accordingly. For base=16 it also allows
  a "0x" prefix.

strtol() has similar foibles.

The caller compounds the permissivity of strtoul() with further sins:

* It does absolutely no error detection.

* It assigns the return value, which is an unsigned long, to an int
  value. This could truncate the result, perhaps even resulting in
  rename_limit being set to a negative value.

When I looked around, I found scores of sites that call atoi(),
strtoul(), and strtol() carelessly. And it's understandable. Calling
any of these functions safely is so much work as to be completely
impractical in day-to-day code.

git-compat-util.h has two functions, strtoul_ui() and strtol_i(), that
try to make parsing integers a little bit easier. But they are only
used in a few places, they hardly restrict the pathological
flexibility of strtoul()/strtol(), strtoul_ui() doesn't implement
clamping consistently when converting from unsigned long to unsigned
int, and neither function can be used when the integer value *is*
followed by other characters.


My solution: the numparse module

So I hereby propose a new module, numparse, to make it easier to parse
integral values from strings in a convenient, safe, and flexible way.
The first commit introduces the new module, and subsequent commits
change a sample (a small fraction!) of call sites to use the new
functions. Consider it a proof-of-concept. If people are OK with this
approach, I will continue sending patches to fix other call sites. (I
already have another two dozen such patches in my repo).

Please see the docstrings in numparse.h in the first commit for
detailed API docs. Briefly, there are eight new functions:

    parse_{l,ul,i,ui}(const char *s, unsigned int flags,
                      T *result, char **endptr);
    convert_{l,ul,i,ui}(const char *s, unsigned int flags, T *result);

The parse_*() functions are for parsing a number from the start of a
string; the convert_*() functions are for parsing a string that
consists of a single number. The flags argument selects not only the
base of the number, but also which of strtol()/strtoul()'s many
features should be allowed. The *_i() and *.ui() functions are for
parsing int and unsigned int values; they are careful about how they
truncate the corresponding long values. The functions all return 0 on
success and a negative number on error.

Here are a few examples of how these functions can be used (taken from
the header of numparse.h):

* Convert hexadecimal string s into an unsigned int. Die if there are
  any characters in s besides hexadecimal digits, or if the result
  exceeds the range of an unsigned int:

    if (convert_ui(s, 16, &result))
            die("...");

* Read a base-ten long number from the front of a string, allowing
  sign characters and setting endptr to point at any trailing
  characters:

    if (parse_l(s, 10 | NUM_SIGN | NUM_TRAILING, &result, &endptr))
            die("...");

* Convert decimal string s into a signed int, but not allowing the
  string to contain a '+' or '-' prefix (and thereby indirectly
  ensuring that the result will be non-negative):

    if (convert_i(s, 10, &result))
            die("...");

* Convert s into a signed int, interpreting prefix "0x" to mean
  hexadecimal and "0" to mean octal. If the value doesn't fit in an
  unsigned int, set result to INT_MIN or INT_MAX.

    if (convert_i(s, NUM_SLOPPY, &result))
            die("...");

My main questions:

* Do people like the API? My main goal was to make these functions as
  painless as possible to use correctly, because there are so many
  call sites.

* Is it too gimmicky to encode the base together with other options in
  `flags`? (I think it is worth it to avoid the need for another
  parameter, which callers could easily put in the wrong order.)

* Am I making callers too strict? In many cases where a positive
  integer is expected (e.g., "--abbrev=<num>"), I have been replacing
  code like

      result = strtoul(s, NULL, 10);

  with

      if (convert_i(s, 10, &result))
              die("...");

  Strictly speaking, this is backwards incompatible, because the
  former allows things like

      --abbrev="+10"
      --abbrev="    10"
      --abbrev="10 bananas"
      --abbrev=-18446744073709551606

  (all equivalent to "--abbrev=10"), whereas the new code rejects all
  of them. It would be easy to change the new code to allow the first
  three, by adding flags NUM_PLUS, NUM_LEADING_WHITESPACE, or
  NUM_TRAILING respectively. But my inclination is to be strict
  (though I could easily be persuaded to permit the first one).

* Should I submit tiny patches, each rewriting a single call site, or
  make changes in larger chunks (say, file by file)? For example, in
  revision.c there is a function handle_revision_opt() that contains
  about 10 call sites that should be rewritten, for 10 different
  options that take integers. My gut feeling is that the rewrite of
  "--min-age=<value>" should be done in a different patch than that of
  "--min-parents=<value>", especially since the invocations might use
  different levels of parsing strictness that might be topics of
  discussion on the ML. On the other hand, I feel silly bombarding the
  list with tons of tiny patches.

* I couldn't think of any places where today's sloppy parsing could
  result in security vulnerabilities, but people should think about
  this, too. I would be especially wary of sites that call strtoul()
  and assign the result to an int, seemingly not considering that the
  result could end up negative.

These patches apply to "master". They are also available on my GitHub
repo [1].

Michael

[1] https://github.com/mhagger/git.git, branch "numparse1"

Michael Haggerty (14):
  numparse: new module for parsing integral numbers
  cacheinfo_callback(): use convert_ui() when handling "--cacheinfo"
  write_subdirectory(): use convert_ui() for parsing mode
  handle_revision_opt(): use skip_prefix() in many places
  handle_revision_opt(): use convert_i() when handling "-<digit>"
  strtoul_ui(), strtol_i(): remove functions
  handle_revision_opt(): use convert_ui() when handling "--abbrev="
  builtin_diff(): detect errors when parsing --unified argument
  opt_arg(): val is always non-NULL
  opt_arg(): use convert_i() in implementation
  opt_arg(): report errors parsing option values
  opt_arg(): simplify pointer handling
  diff_opt_parse(): use convert_i() when handling "-l<num>"
  diff_opt_parse(): use convert_i() when handling --abbrev=<num>

 Makefile                                  |   1 +
 builtin/update-index.c                    |   3 +-
 contrib/convert-objects/convert-objects.c |   3 +-
 diff.c                                    |  55 ++++----
 git-compat-util.h                         |  26 ----
 numparse.c                                | 180 ++++++++++++++++++++++++++
 numparse.h                                | 207 ++++++++++++++++++++++++++++++
 revision.c                                |  64 ++++-----
 8 files changed, 447 insertions(+), 92 deletions(-)
 create mode 100644 numparse.c
 create mode 100644 numparse.h

-- 
2.1.4

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 01/14] numparse: new module for parsing integral numbers
  2015-03-17 16:00 [PATCH 00/14] numparse module: systematically tighten up integer parsing Michael Haggerty
@ 2015-03-17 16:00 ` Michael Haggerty
  2015-03-18 18:27   ` Eric Sunshine
  2015-03-20 17:51   ` Junio C Hamano
  2015-03-17 16:00 ` [PATCH 02/14] cacheinfo_callback(): use convert_ui() when handling "--cacheinfo" Michael Haggerty
                   ` (16 subsequent siblings)
  17 siblings, 2 replies; 43+ messages in thread
From: Michael Haggerty @ 2015-03-17 16:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Michael Haggerty

Implement wrappers for strtol() and strtoul() that are safer and more
convenient to use.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 Makefile   |   1 +
 numparse.c | 180 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 numparse.h | 207 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 388 insertions(+)
 create mode 100644 numparse.c
 create mode 100644 numparse.h

diff --git a/Makefile b/Makefile
index 44f1dd1..6c0cfcc 100644
--- a/Makefile
+++ b/Makefile
@@ -732,6 +732,7 @@ LIB_OBJS += notes.o
 LIB_OBJS += notes-cache.o
 LIB_OBJS += notes-merge.o
 LIB_OBJS += notes-utils.o
+LIB_OBJS += numparse.o
 LIB_OBJS += object.o
 LIB_OBJS += pack-bitmap.o
 LIB_OBJS += pack-bitmap-write.o
diff --git a/numparse.c b/numparse.c
new file mode 100644
index 0000000..90b44ce
--- /dev/null
+++ b/numparse.c
@@ -0,0 +1,180 @@
+#include "git-compat-util.h"
+#include "numparse.h"
+
+#define NUM_NEGATIVE (1 << 16)
+
+
+static int parse_precheck(const char *s, unsigned int *flags)
+{
+	const char *number;
+
+	if (isspace(*s)) {
+		if (!(*flags & NUM_LEADING_WHITESPACE))
+			return -NUM_LEADING_WHITESPACE;
+		do {
+			s++;
+		} while (isspace(*s));
+	}
+
+	if (*s == '+') {
+		if (!(*flags & NUM_PLUS))
+			return -NUM_PLUS;
+		number = s + 1;
+		*flags &= ~NUM_NEGATIVE;
+	} else if (*s == '-') {
+		if (!(*flags & NUM_MINUS))
+			return -NUM_MINUS;
+		number = s + 1;
+		*flags |= NUM_NEGATIVE;
+	} else {
+		number = s;
+		*flags &= ~NUM_NEGATIVE;
+	}
+
+	if (!(*flags & NUM_BASE_SPECIFIER)) {
+		int base = *flags & NUM_BASE_MASK;
+		if (base == 0) {
+			/* This is a pointless combination of options. */
+			die("BUG: base=0 specified without NUM_BASE_SPECIFIER");
+		} else if (base == 16 && starts_with(number, "0x")) {
+			/*
+			 * We want to treat this as zero terminated by
+			 * an 'x', whereas strtol()/strtoul() would
+			 * silently eat the "0x". We accomplish this
+			 * by treating it as a base 10 number:
+			 */
+			*flags = (*flags & ~NUM_BASE_MASK) | 10;
+		}
+	}
+	return 0;
+}
+
+int parse_l(const char *s, unsigned int flags, long *result, char **endptr)
+{
+	long l;
+	const char *end;
+	int err = 0;
+
+	err = parse_precheck(s, &flags);
+	if (err)
+		return err;
+
+	/*
+	 * Now let strtol() do the heavy lifting:
+	 */
+	errno = 0;
+	l = strtol(s, (char **)&end, flags & NUM_BASE_MASK);
+	if (errno) {
+		if (errno == ERANGE) {
+			if (!(flags & NUM_SATURATE))
+				return -NUM_SATURATE;
+		} else {
+			return -NUM_OTHER_ERROR;
+		}
+	}
+	if (end == s)
+		return -NUM_NO_DIGITS;
+
+	if (*end && !(flags & NUM_TRAILING))
+		return -NUM_TRAILING;
+
+	/* Everything was OK */
+	*result = l;
+	if (endptr)
+		*endptr = (char *)end;
+	return 0;
+}
+
+int parse_ul(const char *s, unsigned int flags,
+	     unsigned long *result, char **endptr)
+{
+	unsigned long ul;
+	const char *end;
+	int err = 0;
+
+	err = parse_precheck(s, &flags);
+	if (err)
+		return err;
+
+	/*
+	 * Now let strtoul() do the heavy lifting:
+	 */
+	errno = 0;
+	ul = strtoul(s, (char **)&end, flags & NUM_BASE_MASK);
+	if (errno) {
+		if (errno == ERANGE) {
+			if (!(flags & NUM_SATURATE))
+				return -NUM_SATURATE;
+		} else {
+			return -NUM_OTHER_ERROR;
+		}
+	}
+	if (end == s)
+		return -NUM_NO_DIGITS;
+
+	/*
+	 * strtoul(), perversely, accepts negative numbers, converting
+	 * them to the positive number with the same bit pattern. We
+	 * don't ever want that.
+	 */
+	if ((flags & NUM_NEGATIVE) && ul) {
+		if (!(flags & NUM_SATURATE))
+			return -NUM_SATURATE;
+		ul = 0;
+	}
+
+	if (*end && !(flags & NUM_TRAILING))
+		return -NUM_TRAILING;
+
+	/* Everything was OK */
+	*result = ul;
+	if (endptr)
+		*endptr = (char *)end;
+	return 0;
+}
+
+int parse_i(const char *s, unsigned int flags, int *result, char **endptr)
+{
+	long l;
+	int err;
+	char *end;
+
+	err = parse_l(s, flags, &l, &end);
+	if (err)
+		return err;
+
+	if ((int)l == l)
+		*result = l;
+	else if (!(flags & NUM_SATURATE))
+		return -NUM_SATURATE;
+	else
+		*result = (l <= 0) ? INT_MIN : INT_MAX;
+
+	if (endptr)
+		*endptr = end;
+
+	return 0;
+}
+
+int parse_ui(const char *s, unsigned int flags, unsigned int *result, char **endptr)
+{
+	unsigned long ul;
+	int err;
+	char *end;
+
+	err = parse_ul(s, flags, &ul, &end);
+	if (err)
+		return err;
+
+	if ((unsigned int)ul == ul)
+		*result = ul;
+	else if (!(flags & NUM_SATURATE))
+		return -NUM_SATURATE;
+	else
+		*result = UINT_MAX;
+
+	if (endptr)
+		*endptr = end;
+
+	return 0;
+}
diff --git a/numparse.h b/numparse.h
new file mode 100644
index 0000000..4de5e10
--- /dev/null
+++ b/numparse.h
@@ -0,0 +1,207 @@
+#ifndef NUMPARSE_H
+#define NUMPARSE_H
+
+/*
+ * Functions for parsing integral numbers.
+ *
+ * strtol() and strtoul() are very flexible, in fact too flexible for
+ * many purposes. These functions wrap them to make them easier to use
+ * in a stricter way.
+ *
+ * There are two classes of function, parse_*() and convert_*(). The
+ * former try to read a number from the front of a string and report a
+ * pointer to the character following the number. The latter don't
+ * report the end of the number, and are meant to be used when the
+ * input string should contain only a single number, with no trailing
+ * characters.
+ *
+ * Each class of functions has four variants:
+ *
+ * - parse_l(), convert_l() -- parse long ints
+ * - parse_ul(), convert_ul() -- parse unsigned long ints
+ * - parse_i(), convert_i() -- parse ints
+ * - parse_ui(), convert_ui() -- parse unsigned ints
+ *
+ * The style of parsing is controlled by a flags argument which
+ * encodes both the base of the number and many other options. The
+ * base is encoded by its numerical value (2 <= base <= 36), or zero
+ * if it should be determined automatically based on whether the
+ * number has a "0x" or "0" prefix.
+ *
+ * The functions all return zero on success. On error, they return a
+ * negative integer indicating the first error that was detected. For
+ * example, if no sign characters were allowed but the string
+ * contained a '-', the function will return -NUM_MINUS. If there is
+ * any kind of error, *result and *endptr are unchanged.
+ *
+ * Examples:
+ *
+ * - Convert hexadecimal string s into an unsigned int. Die if there
+ *   are any characters in s besides hexadecimal digits, or if the
+ *   result exceeds the range of an unsigned int:
+ *
+ *     if (convert_ui(s, 16, &result))
+ *             die("...");
+ *
+ * - Read a base-ten long number from the front of a string, allowing
+ *   sign characters and setting endptr to point at any trailing
+ *   characters:
+ *
+ *     if (parse_l(s, 10 | NUM_SIGN | NUM_TRAILING, &result, &endptr))
+ *             die("...");
+ *
+ * - Convert decimal string s into a signed int, but not allowing the
+ *   string to contain a '+' or '-' prefix (and thereby indirectly
+ *   ensuring that the result will be non-negative):
+ *
+ *     if (convert_i(s, 10, &result))
+ *             die("...");
+ *
+ * - Convert s into a signed int, interpreting prefix "0x" to mean
+ *   hexadecimal and "0" to mean octal. If the value doesn't fit in an
+ *   unsigned int, set result to INT_MIN or INT_MAX.
+ *
+ *     if (convert_i(s, NUM_SLOPPY, &result))
+ *             die("...");
+ */
+
+
+/*
+ * Constants for parsing numbers.
+ *
+ * These can be passed in flags to allow the specified features. Also,
+ * if there is an error parsing a number, the parsing functions return
+ * the negated value of one of these constants (or NUM_NO_DIGITS or
+ * NUM_OTHER_ERROR) to indicate the first error detected.
+ */
+
+/*
+ * The lowest 6 bits of flags hold the numerical base that should be
+ * used to parse the number, 2 <= base <= 36. If base is set to 0,
+ * then NUM_BASE_SPECIFIER must be set too; in this case, the base is
+ * detected automatically from the string's prefix.
+ */
+#define NUM_BASE_MASK 0x3f
+
+/* Skip any whitespace before the number. */
+#define NUM_LEADING_WHITESPACE (1 << 8)
+
+/* Allow a leading '+'. */
+#define NUM_PLUS               (1 << 9)
+
+/* Allow a leading '-'. */
+#define NUM_MINUS              (1 << 10)
+
+/*
+ * Allow a leading base specifier:
+ * - If base is 0: a leading "0x" indicates base 16; a leading "0"
+ *   indicates base 8; otherwise, assume base 10.
+ * - If base is 16: a leading "0x" is allowed and skipped over.
+ */
+#define NUM_BASE_SPECIFIER     (1 << 11)
+
+/*
+ * If the number is not in the allowed range, return the smallest or
+ * largest representable value instead.
+ */
+#define NUM_SATURATE           (1 << 12)
+
+/*
+ * Just parse until the end of the number, ignoring any subsequent
+ * characters. If this option is not specified, then it is an error if
+ * the whole string cannot be parsed.
+ */
+#define NUM_TRAILING           (1 << 13)
+
+
+/* Additional errors that can come from parsing numbers: */
+
+/* There were no valid digits */
+#define NUM_NO_DIGITS          (1 << 14)
+/* There was some other error reported by strtol()/strtoul(): */
+#define NUM_OTHER_ERROR        (1 << 15)
+
+/*
+ * Please note that there is also a NUM_NEGATIVE, which is used
+ * internally.
+ */
+
+/*
+ * Now define some useful combinations of parsing options:
+ */
+
+/* A bunch of digits with an optional sign. */
+#define NUM_SIGN (NUM_PLUS | NUM_MINUS)
+
+/*
+ * Be as liberal as possible with the form of the number itself
+ * (though if you also want to allow leading whitespace and/or
+ * trailing characters, you should combine this with
+ * NUM_LEADING_WHITESPACE and/or NUM_TRAILING).
+ */
+#define NUM_SLOPPY (NUM_SIGN | NUM_SATURATE | NUM_BASE_SPECIFIER)
+
+
+/*
+ * Number parsing functions:
+ *
+ * The following functions parse a number (long, unsigned long, int,
+ * or unsigned int respectively) from the front of s, storing the
+ * value to *result and storing a pointer to the first character after
+ * the number to *endptr. flags specifies how the number should be
+ * parsed, including which base should be used. flags is a combination
+ * of the numerical base (2-36) and the NUM_* constants above (see).
+ * Return 0 on success or a negative value if there was an error. On
+ * failure, *result and *entptr are left unchanged.
+ *
+ * Please note that if NUM_TRAILING is not set, then it is
+ * nevertheless an error if there are any characters between the end
+ * of the number and the end of the string.
+ */
+
+int parse_l(const char *s, unsigned int flags,
+	    long *result, char **endptr);
+
+int parse_ul(const char *s, unsigned int flags,
+	     unsigned long *result, char **endptr);
+
+int parse_i(const char *s, unsigned int flags,
+	    int *result, char **endptr);
+
+int parse_ui(const char *s, unsigned int flags,
+	     unsigned int *result, char **endptr);
+
+
+/*
+ * Number conversion functions:
+ *
+ * The following functions parse a string into a number. They are
+ * identical to the parse_*() functions above, except that the endptr
+ * is not returned. These are most useful when parsing a whole string
+ * into a number; i.e., when (flags & NUM_TRAILING) is unset.
+ */
+static inline int convert_l(const char *s, unsigned int flags,
+			    long *result)
+{
+	return parse_l(s, flags, result, NULL);
+}
+
+static inline int convert_ul(const char *s, unsigned int flags,
+			     unsigned long *result)
+{
+	return parse_ul(s, flags, result, NULL);
+}
+
+static inline int convert_i(const char *s, unsigned int flags,
+			    int *result)
+{
+	return parse_i(s, flags, result, NULL);
+}
+
+static inline int convert_ui(const char *s, unsigned int flags,
+			     unsigned int *result)
+{
+	return parse_ui(s, flags, result, NULL);
+}
+
+#endif /* NUMPARSE_H */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 02/14] cacheinfo_callback(): use convert_ui() when handling "--cacheinfo"
  2015-03-17 16:00 [PATCH 00/14] numparse module: systematically tighten up integer parsing Michael Haggerty
  2015-03-17 16:00 ` [PATCH 01/14] numparse: new module for parsing integral numbers Michael Haggerty
@ 2015-03-17 16:00 ` Michael Haggerty
  2015-03-17 16:00 ` [PATCH 03/14] write_subdirectory(): use convert_ui() for parsing mode Michael Haggerty
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 43+ messages in thread
From: Michael Haggerty @ 2015-03-17 16:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Michael Haggerty

Use convert_ui() instead of strtoul_ui() to parse the <mode> argument.
This tightens up the parsing a bit:

* Leading whitespace is no longer allowed
* '+' and '-' are no longer allowed

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 builtin/update-index.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/builtin/update-index.c b/builtin/update-index.c
index 5878986..845e944 100644
--- a/builtin/update-index.c
+++ b/builtin/update-index.c
@@ -4,6 +4,7 @@
  * Copyright (C) Linus Torvalds, 2005
  */
 #include "cache.h"
+#include "numparse.h"
 #include "lockfile.h"
 #include "quote.h"
 #include "cache-tree.h"
@@ -672,7 +673,7 @@ static int cacheinfo_callback(struct parse_opt_ctx_t *ctx,
 	}
 	if (ctx->argc <= 3)
 		return error("option 'cacheinfo' expects <mode>,<sha1>,<path>");
-	if (strtoul_ui(*++ctx->argv, 8, &mode) ||
+	if (convert_ui(*++ctx->argv, 8, &mode) ||
 	    get_sha1_hex(*++ctx->argv, sha1) ||
 	    add_cacheinfo(mode, sha1, *++ctx->argv, 0))
 		die("git update-index: --cacheinfo cannot add %s", *ctx->argv);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 03/14] write_subdirectory(): use convert_ui() for parsing mode
  2015-03-17 16:00 [PATCH 00/14] numparse module: systematically tighten up integer parsing Michael Haggerty
  2015-03-17 16:00 ` [PATCH 01/14] numparse: new module for parsing integral numbers Michael Haggerty
  2015-03-17 16:00 ` [PATCH 02/14] cacheinfo_callback(): use convert_ui() when handling "--cacheinfo" Michael Haggerty
@ 2015-03-17 16:00 ` Michael Haggerty
  2015-03-17 16:00 ` [PATCH 04/14] handle_revision_opt(): use skip_prefix() in many places Michael Haggerty
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 43+ messages in thread
From: Michael Haggerty @ 2015-03-17 16:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Michael Haggerty

Use convert_ui() instead of strtoul_ui() when parsing tree entries'
modes. This tightens up the parsing a bit:

* Leading whitespace is no longer allowed
* '+' and '-' are no longer allowed

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 contrib/convert-objects/convert-objects.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/contrib/convert-objects/convert-objects.c b/contrib/convert-objects/convert-objects.c
index f3b57bf..4f484a4 100644
--- a/contrib/convert-objects/convert-objects.c
+++ b/contrib/convert-objects/convert-objects.c
@@ -1,4 +1,5 @@
 #include "cache.h"
+#include "numparse.h"
 #include "blob.h"
 #include "commit.h"
 #include "tree.h"
@@ -88,7 +89,7 @@ static int write_subdirectory(void *buffer, unsigned long size, const char *base
 		unsigned int mode;
 		char *slash, *origpath;
 
-		if (!path || strtoul_ui(buffer, 8, &mode))
+		if (!path || convert_ui(buffer, 8, &mode))
 			die("bad tree conversion");
 		mode = convert_mode(mode);
 		path++;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 04/14] handle_revision_opt(): use skip_prefix() in many places
  2015-03-17 16:00 [PATCH 00/14] numparse module: systematically tighten up integer parsing Michael Haggerty
                   ` (2 preceding siblings ...)
  2015-03-17 16:00 ` [PATCH 03/14] write_subdirectory(): use convert_ui() for parsing mode Michael Haggerty
@ 2015-03-17 16:00 ` Michael Haggerty
  2015-03-17 16:00 ` [PATCH 05/14] handle_revision_opt(): use convert_i() when handling "-<digit>" Michael Haggerty
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 43+ messages in thread
From: Michael Haggerty @ 2015-03-17 16:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Michael Haggerty

This reduces the need for "magic numbers".

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 revision.c | 59 ++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 30 insertions(+), 29 deletions(-)

diff --git a/revision.c b/revision.c
index 66520c6..25838fc 100644
--- a/revision.c
+++ b/revision.c
@@ -1719,8 +1719,8 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
 		revs->max_count = atoi(argv[1]);
 		revs->no_walk = 0;
 		return 2;
-	} else if (starts_with(arg, "-n")) {
-		revs->max_count = atoi(arg + 2);
+	} else if (skip_prefix(arg, "-n", &arg)) {
+		revs->max_count = atoi(arg);
 		revs->no_walk = 0;
 	} else if ((argcount = parse_long_opt("max-age", argv, &optarg))) {
 		revs->max_age = atoi(optarg);
@@ -1779,15 +1779,15 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
 	} else if (!strcmp(arg, "--author-date-order")) {
 		revs->sort_order = REV_SORT_BY_AUTHOR_DATE;
 		revs->topo_order = 1;
-	} else if (starts_with(arg, "--early-output")) {
+	} else if (skip_prefix(arg, "--early-output", &arg)) {
 		int count = 100;
-		switch (arg[14]) {
+		switch (*arg++) {
 		case '=':
-			count = atoi(arg+15);
+			count = atoi(arg);
 			/* Fallthrough */
 		case 0:
 			revs->topo_order = 1;
-		       revs->early_output = count;
+			revs->early_output = count;
 		}
 	} else if (!strcmp(arg, "--parents")) {
 		revs->rewrite_parents = 1;
@@ -1804,12 +1804,12 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
 		revs->min_parents = 2;
 	} else if (!strcmp(arg, "--no-merges")) {
 		revs->max_parents = 1;
-	} else if (starts_with(arg, "--min-parents=")) {
-		revs->min_parents = atoi(arg+14);
+	} else if (skip_prefix(arg, "--min-parents=", &arg)) {
+		revs->min_parents = atoi(arg);
 	} else if (starts_with(arg, "--no-min-parents")) {
 		revs->min_parents = 0;
-	} else if (starts_with(arg, "--max-parents=")) {
-		revs->max_parents = atoi(arg+14);
+	} else if (skip_prefix(arg, "--max-parents=", &arg)) {
+		revs->max_parents = atoi(arg);
 	} else if (starts_with(arg, "--no-max-parents")) {
 		revs->max_parents = -1;
 	} else if (!strcmp(arg, "--boundary")) {
@@ -1891,26 +1891,27 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
 		revs->verbose_header = 1;
 		revs->pretty_given = 1;
 		get_commit_format(NULL, revs);
-	} else if (starts_with(arg, "--pretty=") || starts_with(arg, "--format=")) {
+	} else if (skip_prefix(arg, "--pretty=", &arg) ||
+		   skip_prefix(arg, "--format=", &arg)) {
 		/*
 		 * Detached form ("--pretty X" as opposed to "--pretty=X")
 		 * not allowed, since the argument is optional.
 		 */
 		revs->verbose_header = 1;
 		revs->pretty_given = 1;
-		get_commit_format(arg+9, revs);
+		get_commit_format(arg, revs);
 	} else if (!strcmp(arg, "--show-notes") || !strcmp(arg, "--notes")) {
 		revs->show_notes = 1;
 		revs->show_notes_given = 1;
 		revs->notes_opt.use_default_notes = 1;
 	} else if (!strcmp(arg, "--show-signature")) {
 		revs->show_signature = 1;
-	} else if (!strcmp(arg, "--show-linear-break") ||
-		   starts_with(arg, "--show-linear-break=")) {
-		if (starts_with(arg, "--show-linear-break="))
-			revs->break_bar = xstrdup(arg + 20);
-		else
-			revs->break_bar = "                    ..........";
+	} else if (!strcmp(arg, "--show-linear-break")) {
+		revs->break_bar = "                    ..........";
+		revs->track_linear = 1;
+		revs->track_first_time = 1;
+	} else if (skip_prefix(arg, "--show-linear-break=", &arg)) {
+		revs->break_bar = xstrdup(arg);
 		revs->track_linear = 1;
 		revs->track_first_time = 1;
 	} else if (starts_with(arg, "--show-notes=") ||
@@ -1961,8 +1962,8 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
 		revs->abbrev = 0;
 	} else if (!strcmp(arg, "--abbrev")) {
 		revs->abbrev = DEFAULT_ABBREV;
-	} else if (starts_with(arg, "--abbrev=")) {
-		revs->abbrev = strtoul(arg + 9, NULL, 10);
+	} else if (skip_prefix(arg, "--abbrev=", &arg)) {
+		revs->abbrev = strtoul(arg, NULL, 10);
 		if (revs->abbrev < MINIMUM_ABBREV)
 			revs->abbrev = MINIMUM_ABBREV;
 		else if (revs->abbrev > 40)
@@ -2112,20 +2113,20 @@ static int handle_revision_pseudo_opt(const char *submodule,
 	} else if ((argcount = parse_long_opt("exclude", argv, &optarg))) {
 		add_ref_exclusion(&revs->ref_excludes, optarg);
 		return argcount;
-	} else if (starts_with(arg, "--branches=")) {
+	} else if (skip_prefix(arg, "--branches=", &arg)) {
 		struct all_refs_cb cb;
 		init_all_refs_cb(&cb, revs, *flags);
-		for_each_glob_ref_in(handle_one_ref, arg + 11, "refs/heads/", &cb);
+		for_each_glob_ref_in(handle_one_ref, arg, "refs/heads/", &cb);
 		clear_ref_exclusion(&revs->ref_excludes);
-	} else if (starts_with(arg, "--tags=")) {
+	} else if (skip_prefix(arg, "--tags=", &arg)) {
 		struct all_refs_cb cb;
 		init_all_refs_cb(&cb, revs, *flags);
-		for_each_glob_ref_in(handle_one_ref, arg + 7, "refs/tags/", &cb);
+		for_each_glob_ref_in(handle_one_ref, arg, "refs/tags/", &cb);
 		clear_ref_exclusion(&revs->ref_excludes);
-	} else if (starts_with(arg, "--remotes=")) {
+	} else if (skip_prefix(arg, "--remotes=", &arg)) {
 		struct all_refs_cb cb;
 		init_all_refs_cb(&cb, revs, *flags);
-		for_each_glob_ref_in(handle_one_ref, arg + 10, "refs/remotes/", &cb);
+		for_each_glob_ref_in(handle_one_ref, arg, "refs/remotes/", &cb);
 		clear_ref_exclusion(&revs->ref_excludes);
 	} else if (!strcmp(arg, "--reflog")) {
 		add_reflogs_to_pending(revs, *flags);
@@ -2135,14 +2136,14 @@ static int handle_revision_pseudo_opt(const char *submodule,
 		*flags ^= UNINTERESTING | BOTTOM;
 	} else if (!strcmp(arg, "--no-walk")) {
 		revs->no_walk = REVISION_WALK_NO_WALK_SORTED;
-	} else if (starts_with(arg, "--no-walk=")) {
+	} else if (skip_prefix(arg, "--no-walk=", &arg)) {
 		/*
 		 * Detached form ("--no-walk X" as opposed to "--no-walk=X")
 		 * not allowed, since the argument is optional.
 		 */
-		if (!strcmp(arg + 10, "sorted"))
+		if (!strcmp(arg, "sorted"))
 			revs->no_walk = REVISION_WALK_NO_WALK_SORTED;
-		else if (!strcmp(arg + 10, "unsorted"))
+		else if (!strcmp(arg, "unsorted"))
 			revs->no_walk = REVISION_WALK_NO_WALK_UNSORTED;
 		else
 			return error("invalid argument to --no-walk");
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 05/14] handle_revision_opt(): use convert_i() when handling "-<digit>"
  2015-03-17 16:00 [PATCH 00/14] numparse module: systematically tighten up integer parsing Michael Haggerty
                   ` (3 preceding siblings ...)
  2015-03-17 16:00 ` [PATCH 04/14] handle_revision_opt(): use skip_prefix() in many places Michael Haggerty
@ 2015-03-17 16:00 ` Michael Haggerty
  2015-03-19  6:34   ` Junio C Hamano
  2015-03-17 16:00 ` [PATCH 06/14] strtoul_ui(), strtol_i(): remove functions Michael Haggerty
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 43+ messages in thread
From: Michael Haggerty @ 2015-03-17 16:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Michael Haggerty

This tightens up the parsing a bit:

* Leading whitespace is no longer allowed
* '+' and '-' are no longer allowed

It also removes the need to check separately that max_count is
non-negative.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 revision.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/revision.c b/revision.c
index 25838fc..4908e66 100644
--- a/revision.c
+++ b/revision.c
@@ -1,4 +1,5 @@
 #include "cache.h"
+#include "numparse.h"
 #include "tag.h"
 #include "blob.h"
 #include "tree.h"
@@ -1709,8 +1710,7 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
 		return argcount;
 	} else if ((*arg == '-') && isdigit(arg[1])) {
 		/* accept -<digit>, like traditional "head" */
-		if (strtol_i(arg + 1, 10, &revs->max_count) < 0 ||
-		    revs->max_count < 0)
+		if (convert_i(arg + 1, 10, &revs->max_count))
 			die("'%s': not a non-negative integer", arg + 1);
 		revs->no_walk = 0;
 	} else if (!strcmp(arg, "-n")) {
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 06/14] strtoul_ui(), strtol_i(): remove functions
  2015-03-17 16:00 [PATCH 00/14] numparse module: systematically tighten up integer parsing Michael Haggerty
                   ` (4 preceding siblings ...)
  2015-03-17 16:00 ` [PATCH 05/14] handle_revision_opt(): use convert_i() when handling "-<digit>" Michael Haggerty
@ 2015-03-17 16:00 ` Michael Haggerty
  2015-03-17 16:00 ` [PATCH 07/14] handle_revision_opt(): use convert_ui() when handling "--abbrev=" Michael Haggerty
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 43+ messages in thread
From: Michael Haggerty @ 2015-03-17 16:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Michael Haggerty

Their callers have been changed to use the numparse module.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 git-compat-util.h | 26 --------------------------
 1 file changed, 26 deletions(-)

diff --git a/git-compat-util.h b/git-compat-util.h
index a3095be..cbe7f16 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -751,32 +751,6 @@ static inline int sane_iscase(int x, int is_lower)
 		return (x & 0x20) == 0;
 }
 
-static inline int strtoul_ui(char const *s, int base, unsigned int *result)
-{
-	unsigned long ul;
-	char *p;
-
-	errno = 0;
-	ul = strtoul(s, &p, base);
-	if (errno || *p || p == s || (unsigned int) ul != ul)
-		return -1;
-	*result = ul;
-	return 0;
-}
-
-static inline int strtol_i(char const *s, int base, int *result)
-{
-	long ul;
-	char *p;
-
-	errno = 0;
-	ul = strtol(s, &p, base);
-	if (errno || *p || p == s || (int) ul != ul)
-		return -1;
-	*result = ul;
-	return 0;
-}
-
 #ifdef INTERNAL_QSORT
 void git_qsort(void *base, size_t nmemb, size_t size,
 	       int(*compar)(const void *, const void *));
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 07/14] handle_revision_opt(): use convert_ui() when handling "--abbrev="
  2015-03-17 16:00 [PATCH 00/14] numparse module: systematically tighten up integer parsing Michael Haggerty
                   ` (5 preceding siblings ...)
  2015-03-17 16:00 ` [PATCH 06/14] strtoul_ui(), strtol_i(): remove functions Michael Haggerty
@ 2015-03-17 16:00 ` Michael Haggerty
  2015-03-17 16:00 ` [PATCH 08/14] builtin_diff(): detect errors when parsing --unified argument Michael Haggerty
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 43+ messages in thread
From: Michael Haggerty @ 2015-03-17 16:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Michael Haggerty

This adds error checking, where previously there was none. It also
disallows '+' and '-', leading whitespace, and trailing junk.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 revision.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/revision.c b/revision.c
index 4908e66..a6f7c2e 100644
--- a/revision.c
+++ b/revision.c
@@ -1963,7 +1963,8 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
 	} else if (!strcmp(arg, "--abbrev")) {
 		revs->abbrev = DEFAULT_ABBREV;
 	} else if (skip_prefix(arg, "--abbrev=", &arg)) {
-		revs->abbrev = strtoul(arg, NULL, 10);
+		if (convert_ui(arg, 10, &revs->abbrev))
+			die("--abbrev requires a non-negative integer argument");
 		if (revs->abbrev < MINIMUM_ABBREV)
 			revs->abbrev = MINIMUM_ABBREV;
 		else if (revs->abbrev > 40)
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 08/14] builtin_diff(): detect errors when parsing --unified argument
  2015-03-17 16:00 [PATCH 00/14] numparse module: systematically tighten up integer parsing Michael Haggerty
                   ` (6 preceding siblings ...)
  2015-03-17 16:00 ` [PATCH 07/14] handle_revision_opt(): use convert_ui() when handling "--abbrev=" Michael Haggerty
@ 2015-03-17 16:00 ` Michael Haggerty
  2015-03-17 16:00 ` [PATCH 09/14] opt_arg(): val is always non-NULL Michael Haggerty
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 43+ messages in thread
From: Michael Haggerty @ 2015-03-17 16:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Michael Haggerty

The previous code used strtoul() without any checks that it succeeded.
Instead use convert_l(), in strict mode, and die() if there is an
error. This tightens up the parsing:

* Leading whitespace is no longer allowed
* '+' and '-' are no longer allowed
* Trailing junk is not allowed

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 diff.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/diff.c b/diff.c
index abc32c8..a350677 100644
--- a/diff.c
+++ b/diff.c
@@ -2,6 +2,7 @@
  * Copyright (C) 2005 Junio C Hamano
  */
 #include "cache.h"
+#include "numparse.h"
 #include "quote.h"
 #include "diff.h"
 #include "diffcore.h"
@@ -2393,12 +2394,12 @@ static void builtin_diff(const char *name_a,
 			xecfg.flags |= XDL_EMIT_FUNCCONTEXT;
 		if (pe)
 			xdiff_set_find_func(&xecfg, pe->pattern, pe->cflags);
-		if (!diffopts)
-			;
-		else if (skip_prefix(diffopts, "--unified=", &v))
-			xecfg.ctxlen = strtoul(v, NULL, 10);
-		else if (skip_prefix(diffopts, "-u", &v))
-			xecfg.ctxlen = strtoul(v, NULL, 10);
+		if (diffopts
+		    && (skip_prefix(diffopts, "--unified=", &v) ||
+			skip_prefix(diffopts, "-u", &v))) {
+			if (convert_l(v, 10, &xecfg.ctxlen))
+				die("--unified argument must be a non-negative integer");
+		}
 		if (o->word_diff)
 			init_diff_words_data(&ecbdata, o, one, two);
 		xdi_diff_outf(&mf1, &mf2, fn_out_consume, &ecbdata,
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 09/14] opt_arg(): val is always non-NULL
  2015-03-17 16:00 [PATCH 00/14] numparse module: systematically tighten up integer parsing Michael Haggerty
                   ` (7 preceding siblings ...)
  2015-03-17 16:00 ` [PATCH 08/14] builtin_diff(): detect errors when parsing --unified argument Michael Haggerty
@ 2015-03-17 16:00 ` Michael Haggerty
  2015-03-17 16:00 ` [PATCH 10/14] opt_arg(): use convert_i() in implementation Michael Haggerty
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 43+ messages in thread
From: Michael Haggerty @ 2015-03-17 16:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Michael Haggerty

opt_arg() is never called with val set to NULL, so remove the code for
handling that eventuality (which anyway wasn't handled consistently in
the function).

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 diff.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/diff.c b/diff.c
index a350677..6e3f498 100644
--- a/diff.c
+++ b/diff.c
@@ -3367,7 +3367,7 @@ static int opt_arg(const char *arg, int arg_short, const char *arg_long, int *va
 		c = *++arg;
 		if (!c)
 			return 1;
-		if (val && isdigit(c)) {
+		if (isdigit(c)) {
 			char *end;
 			int n = strtoul(arg, &end, 10);
 			if (*end)
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 10/14] opt_arg(): use convert_i() in implementation
  2015-03-17 16:00 [PATCH 00/14] numparse module: systematically tighten up integer parsing Michael Haggerty
                   ` (8 preceding siblings ...)
  2015-03-17 16:00 ` [PATCH 09/14] opt_arg(): val is always non-NULL Michael Haggerty
@ 2015-03-17 16:00 ` Michael Haggerty
  2015-03-17 16:00 ` [PATCH 11/14] opt_arg(): report errors parsing option values Michael Haggerty
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 43+ messages in thread
From: Michael Haggerty @ 2015-03-17 16:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Michael Haggerty

This shortens the code and avoids the old code's careless truncation
from unsigned long to int.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 diff.c | 28 ++++++++--------------------
 1 file changed, 8 insertions(+), 20 deletions(-)

diff --git a/diff.c b/diff.c
index 6e3f498..77c7acb 100644
--- a/diff.c
+++ b/diff.c
@@ -3366,16 +3366,10 @@ static int opt_arg(const char *arg, int arg_short, const char *arg_long, int *va
 	if (c == arg_short) {
 		c = *++arg;
 		if (!c)
-			return 1;
-		if (isdigit(c)) {
-			char *end;
-			int n = strtoul(arg, &end, 10);
-			if (*end)
-				return 0;
-			*val = n;
-			return 1;
-		}
-		return 0;
+			return 1; /* optional argument was missing */
+		if (convert_i(arg, 10, val))
+			return 0;
+		return 1;
 	}
 	if (c != '-')
 		return 0;
@@ -3384,16 +3378,10 @@ static int opt_arg(const char *arg, int arg_short, const char *arg_long, int *va
 	len = eq - arg;
 	if (!len || strncmp(arg, arg_long, len))
 		return 0;
-	if (*eq) {
-		int n;
-		char *end;
-		if (!isdigit(*++eq))
-			return 0;
-		n = strtoul(eq, &end, 10);
-		if (*end)
-			return 0;
-		*val = n;
-	}
+	if (!*eq)
+		return 1; /* '=' and optional argument were missing */
+	if (convert_i(eq + 1, 10, val))
+		return 0;
 	return 1;
 }
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 11/14] opt_arg(): report errors parsing option values
  2015-03-17 16:00 [PATCH 00/14] numparse module: systematically tighten up integer parsing Michael Haggerty
                   ` (9 preceding siblings ...)
  2015-03-17 16:00 ` [PATCH 10/14] opt_arg(): use convert_i() in implementation Michael Haggerty
@ 2015-03-17 16:00 ` Michael Haggerty
  2015-03-17 16:00 ` [PATCH 12/14] opt_arg(): simplify pointer handling Michael Haggerty
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 43+ messages in thread
From: Michael Haggerty @ 2015-03-17 16:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Michael Haggerty

If an argument is there, but it can't be parsed as a non-positive
number, then die() rather than returning 0.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 diff.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/diff.c b/diff.c
index 77c7acb..03cdabf 100644
--- a/diff.c
+++ b/diff.c
@@ -3368,7 +3368,7 @@ static int opt_arg(const char *arg, int arg_short, const char *arg_long, int *va
 		if (!c)
 			return 1; /* optional argument was missing */
 		if (convert_i(arg, 10, val))
-			return 0;
+			die("The value for -%c must be a non-negative integer", arg_short);
 		return 1;
 	}
 	if (c != '-')
@@ -3381,7 +3381,7 @@ static int opt_arg(const char *arg, int arg_short, const char *arg_long, int *va
 	if (!*eq)
 		return 1; /* '=' and optional argument were missing */
 	if (convert_i(eq + 1, 10, val))
-		return 0;
+		die("The value for --%s must be a non-negative integer", arg_long);
 	return 1;
 }
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 12/14] opt_arg(): simplify pointer handling
  2015-03-17 16:00 [PATCH 00/14] numparse module: systematically tighten up integer parsing Michael Haggerty
                   ` (10 preceding siblings ...)
  2015-03-17 16:00 ` [PATCH 11/14] opt_arg(): report errors parsing option values Michael Haggerty
@ 2015-03-17 16:00 ` Michael Haggerty
  2015-03-17 16:00 ` [PATCH 13/14] diff_opt_parse(): use convert_i() when handling "-l<num>" Michael Haggerty
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 43+ messages in thread
From: Michael Haggerty @ 2015-03-17 16:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Michael Haggerty

Increment "arg" when a character is consumed, not just before consuming
the next character.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 diff.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/diff.c b/diff.c
index 03cdabf..bc1e3c3 100644
--- a/diff.c
+++ b/diff.c
@@ -3358,14 +3358,13 @@ static int opt_arg(const char *arg, int arg_short, const char *arg_long, int *va
 	char c, *eq;
 	int len;
 
-	if (*arg != '-')
+	if (*arg++ != '-')
 		return 0;
-	c = *++arg;
+	c = *arg++;
 	if (!c)
 		return 0;
 	if (c == arg_short) {
-		c = *++arg;
-		if (!c)
+		if (!*arg)
 			return 1; /* optional argument was missing */
 		if (convert_i(arg, 10, val))
 			die("The value for -%c must be a non-negative integer", arg_short);
@@ -3373,7 +3372,6 @@ static int opt_arg(const char *arg, int arg_short, const char *arg_long, int *va
 	}
 	if (c != '-')
 		return 0;
-	arg++;
 	eq = strchrnul(arg, '=');
 	len = eq - arg;
 	if (!len || strncmp(arg, arg_long, len))
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 13/14] diff_opt_parse(): use convert_i() when handling "-l<num>"
  2015-03-17 16:00 [PATCH 00/14] numparse module: systematically tighten up integer parsing Michael Haggerty
                   ` (11 preceding siblings ...)
  2015-03-17 16:00 ` [PATCH 12/14] opt_arg(): simplify pointer handling Michael Haggerty
@ 2015-03-17 16:00 ` Michael Haggerty
  2015-03-17 16:00 ` [PATCH 14/14] diff_opt_parse(): use convert_i() when handling --abbrev=<num> Michael Haggerty
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 43+ messages in thread
From: Michael Haggerty @ 2015-03-17 16:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Michael Haggerty

die() with an error message if the argument is not a non-negative
integer. This change tightens up parsing: '+' and '-', leading
whitespace, and trailing junk are all disallowed now.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 diff.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/diff.c b/diff.c
index bc1e3c3..be389ae 100644
--- a/diff.c
+++ b/diff.c
@@ -3799,7 +3799,8 @@ int diff_opt_parse(struct diff_options *options, const char **av, int ac)
 	else if (!strcmp(arg, "-z"))
 		options->line_termination = 0;
 	else if ((argcount = short_opt('l', av, &optarg))) {
-		options->rename_limit = strtoul(optarg, NULL, 10);
+		if (convert_i(optarg, 10, &options->rename_limit))
+			die("-l requires a non-negative integer argument");
 		return argcount;
 	}
 	else if ((argcount = short_opt('S', av, &optarg))) {
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 14/14] diff_opt_parse(): use convert_i() when handling --abbrev=<num>
  2015-03-17 16:00 [PATCH 00/14] numparse module: systematically tighten up integer parsing Michael Haggerty
                   ` (12 preceding siblings ...)
  2015-03-17 16:00 ` [PATCH 13/14] diff_opt_parse(): use convert_i() when handling "-l<num>" Michael Haggerty
@ 2015-03-17 16:00 ` Michael Haggerty
  2015-03-19  6:37   ` Junio C Hamano
  2015-03-17 18:48 ` [PATCH 00/14] numparse module: systematically tighten up integer parsing Junio C Hamano
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 43+ messages in thread
From: Michael Haggerty @ 2015-03-17 16:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Michael Haggerty

die() with an error message if the argument is not a non-negative
integer. This change tightens up parsing: '+' and '-', leading
whitespace, and trailing junk are all disallowed now.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 diff.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/diff.c b/diff.c
index be389ae..1cc5428 100644
--- a/diff.c
+++ b/diff.c
@@ -3830,7 +3830,8 @@ int diff_opt_parse(struct diff_options *options, const char **av, int ac)
 	else if (!strcmp(arg, "--abbrev"))
 		options->abbrev = DEFAULT_ABBREV;
 	else if (skip_prefix(arg, "--abbrev=", &arg)) {
-		options->abbrev = strtoul(arg, NULL, 10);
+		if (convert_i(arg, 10, &options->abbrev))
+			die("--abbrev requires an integer argument");
 		if (options->abbrev < MINIMUM_ABBREV)
 			options->abbrev = MINIMUM_ABBREV;
 		else if (40 < options->abbrev)
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-17 16:00 [PATCH 00/14] numparse module: systematically tighten up integer parsing Michael Haggerty
                   ` (13 preceding siblings ...)
  2015-03-17 16:00 ` [PATCH 14/14] diff_opt_parse(): use convert_i() when handling --abbrev=<num> Michael Haggerty
@ 2015-03-17 18:48 ` Junio C Hamano
  2015-03-17 19:46   ` Michael Haggerty
  2015-03-17 23:05 ` Duy Nguyen
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2015-03-17 18:48 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, git

Michael Haggerty <mhagger@alum.mit.edu> writes:

> When I looked around, I found scores of sites that call atoi(),
> strtoul(), and strtol() carelessly. And it's understandable. Calling
> any of these functions safely is so much work as to be completely
> impractical in day-to-day code.
>
> git-compat-util.h has two functions, strtoul_ui() and strtol_i(), that
> try to make parsing integers a little bit easier.

Exactly.  They were introduced to prevent sloppy callers from
passing NULL to the &end parameter to underlying strtoul(3).  The
first thing that came to my mind while reading your message up to
this point was "why not use them more, tighten them, adding
different variants if necessary, instead of introducing an entirely
new set of functions in a new namespace?"

For example:

> * Am I making callers too strict? In many cases where a positive
>   integer is expected (e.g., "--abbrev=<num>"), I have been replacing
>   code like
>
>       result = strtoul(s, NULL, 10);
>
>   with
>
>       if (convert_i(s, 10, &result))
>               die("...");

I think strictness would be good and those who did "--abbrev='  20'"
deserve what they get from the additional strictness, but 

	if (strtol_i(s, 10, &result))

would have been more familiar.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-17 18:48 ` [PATCH 00/14] numparse module: systematically tighten up integer parsing Junio C Hamano
@ 2015-03-17 19:46   ` Michael Haggerty
  2015-03-19  6:31     ` Junio C Hamano
  0 siblings, 1 reply; 43+ messages in thread
From: Michael Haggerty @ 2015-03-17 19:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git

On 03/17/2015 07:48 PM, Junio C Hamano wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
> 
>> When I looked around, I found scores of sites that call atoi(),
>> strtoul(), and strtol() carelessly. And it's understandable. Calling
>> any of these functions safely is so much work as to be completely
>> impractical in day-to-day code.
>>
>> git-compat-util.h has two functions, strtoul_ui() and strtol_i(), that
>> try to make parsing integers a little bit easier.
> 
> Exactly.  They were introduced to prevent sloppy callers from
> passing NULL to the &end parameter to underlying strtoul(3).  The
> first thing that came to my mind while reading your message up to
> this point was "why not use them more, tighten them, adding
> different variants if necessary, instead of introducing an entirely
> new set of functions in a new namespace?"

Here were my thoughts:

* I wanted to change the interface to look less like
  strtol()/strtoul(), so it seemed appropriate to make the names
  more dissimilar.

* The functions seemed long enough that they shouldn't be inline,
  and I wanted to put them in their own module rather than put
  them in git-compat-util.h.

* It wasn't obvious to me how to generalize the names, strtoul_ui()
  and strtol_i(), to the eight functions that I wanted.

That being said, I'm not married to the names. Suggestions are
welcome--but we would need names for 8 functions, not 4 [1].

Michael

> For example:
> 
>> * Am I making callers too strict? In many cases where a positive
>>   integer is expected (e.g., "--abbrev=<num>"), I have been replacing
>>   code like
>>
>>       result = strtoul(s, NULL, 10);
>>
>>   with
>>
>>       if (convert_i(s, 10, &result))
>>               die("...");
> 
> I think strictness would be good and those who did "--abbrev='  20'"
> deserve what they get from the additional strictness, but 
> 
> 	if (strtol_i(s, 10, &result))
> 
> would have been more familiar.

[1] It could be that when we're done, it will turn out that some of the
eight variants are not needed. If so, we can delete them then.

-- 
Michael Haggerty
mhagger@alum.mit.edu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-17 16:00 [PATCH 00/14] numparse module: systematically tighten up integer parsing Michael Haggerty
                   ` (14 preceding siblings ...)
  2015-03-17 18:48 ` [PATCH 00/14] numparse module: systematically tighten up integer parsing Junio C Hamano
@ 2015-03-17 23:05 ` Duy Nguyen
  2015-03-18  9:47   ` Michael Haggerty
  2015-03-19  5:26 ` Jeff King
  2015-03-19  6:22 ` Junio C Hamano
  17 siblings, 1 reply; 43+ messages in thread
From: Duy Nguyen @ 2015-03-17 23:05 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Junio C Hamano, Jeff King, Git Mailing List

On Tue, Mar 17, 2015 at 11:00 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> Michael Haggerty (14):
>   numparse: new module for parsing integral numbers
>   cacheinfo_callback(): use convert_ui() when handling "--cacheinfo"
>   write_subdirectory(): use convert_ui() for parsing mode
>   handle_revision_opt(): use skip_prefix() in many places
>   handle_revision_opt(): use convert_i() when handling "-<digit>"
>   strtoul_ui(), strtol_i(): remove functions
>   handle_revision_opt(): use convert_ui() when handling "--abbrev="
>   builtin_diff(): detect errors when parsing --unified argument
>   opt_arg(): val is always non-NULL
>   opt_arg(): use convert_i() in implementation
>   opt_arg(): report errors parsing option values
>   opt_arg(): simplify pointer handling
>   diff_opt_parse(): use convert_i() when handling "-l<num>"
>   diff_opt_parse(): use convert_i() when handling --abbrev=<num>

Thank you for doing it. I was about to write another number parser and
you did it :D Maybe you can add another patch to convert the only
strtol in upload-pack.c to parse_ui. This place should accept positive
number in base 10, plus sign is not accepted.
-- 
Duy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-17 23:05 ` Duy Nguyen
@ 2015-03-18  9:47   ` Michael Haggerty
  2015-03-18  9:58     ` Duy Nguyen
  2015-03-18 10:03     ` Jeff King
  0 siblings, 2 replies; 43+ messages in thread
From: Michael Haggerty @ 2015-03-18  9:47 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Junio C Hamano, Jeff King, Git Mailing List

On 03/18/2015 12:05 AM, Duy Nguyen wrote:
> On Tue, Mar 17, 2015 at 11:00 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> Michael Haggerty (14):
>>   numparse: new module for parsing integral numbers
>>   cacheinfo_callback(): use convert_ui() when handling "--cacheinfo"
>>   write_subdirectory(): use convert_ui() for parsing mode
>>   handle_revision_opt(): use skip_prefix() in many places
>>   handle_revision_opt(): use convert_i() when handling "-<digit>"
>>   strtoul_ui(), strtol_i(): remove functions
>>   handle_revision_opt(): use convert_ui() when handling "--abbrev="
>>   builtin_diff(): detect errors when parsing --unified argument
>>   opt_arg(): val is always non-NULL
>>   opt_arg(): use convert_i() in implementation
>>   opt_arg(): report errors parsing option values
>>   opt_arg(): simplify pointer handling
>>   diff_opt_parse(): use convert_i() when handling "-l<num>"
>>   diff_opt_parse(): use convert_i() when handling --abbrev=<num>
> 
> Thank you for doing it. I was about to write another number parser and
> you did it :D Maybe you can add another patch to convert the only
> strtol in upload-pack.c to parse_ui. This place should accept positive
> number in base 10, plus sign is not accepted.

If the general direction of this patch series is accepted, I'll
gradually try to go through the codebase, replacing *all* integer
parsing with these functions. So there's no need to request particular
callers of strtol()/strtoul() to be converted; I'll get to them all
sooner or later (I hope).

But in case you have some reason that you want upload-pack.c to be
converted right away, I just pushed that change (plus some related
cleanups) to my GitHub repo [1]. The branch depends only on the first
patch of the "numparse" patch series.

By the way, some other packet line parsing code in that file doesn't
verify that there are no trailing characters on the lines that they
process. That might be another thing that should be tightened up.

Michael

[1] https://github.com/mhagger/git.git, branch "upload-pack-numparse"

-- 
Michael Haggerty
mhagger@alum.mit.edu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-18  9:47   ` Michael Haggerty
@ 2015-03-18  9:58     ` Duy Nguyen
  2015-03-18 10:03     ` Jeff King
  1 sibling, 0 replies; 43+ messages in thread
From: Duy Nguyen @ 2015-03-18  9:58 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Junio C Hamano, Jeff King, Git Mailing List

On Wed, Mar 18, 2015 at 4:47 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> Thank you for doing it. I was about to write another number parser and
>> you did it :D Maybe you can add another patch to convert the only
>> strtol in upload-pack.c to parse_ui. This place should accept positive
>> number in base 10, plus sign is not accepted.
>
> If the general direction of this patch series is accepted, I'll
> gradually try to go through the codebase, replacing *all* integer
> parsing with these functions. So there's no need to request particular
> callers of strtol()/strtoul() to be converted; I'll get to them all
> sooner or later (I hope).

Good to know. No there's no hurry in converting this particular place.
It's just tightening things up, not bug fixing or anything.
-- 
Duy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-18  9:47   ` Michael Haggerty
  2015-03-18  9:58     ` Duy Nguyen
@ 2015-03-18 10:03     ` Jeff King
  2015-03-18 10:20       ` Michael Haggerty
  1 sibling, 1 reply; 43+ messages in thread
From: Jeff King @ 2015-03-18 10:03 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Duy Nguyen, Junio C Hamano, Git Mailing List

On Wed, Mar 18, 2015 at 10:47:40AM +0100, Michael Haggerty wrote:

> But in case you have some reason that you want upload-pack.c to be
> converted right away, I just pushed that change (plus some related
> cleanups) to my GitHub repo [1]. The branch depends only on the first
> patch of the "numparse" patch series.
> 
> By the way, some other packet line parsing code in that file doesn't
> verify that there are no trailing characters on the lines that they
> process. That might be another thing that should be tightened up.

Do you mean that upload-pack gets a pkt-line of length N that contains a
line of length M, and then doesn't check that M==N?  We use the space
between M and N for passing capabilities and other metadata around.

Or do you mean that we see lines like:

  want [0-9a-f]{40} ...\n

and do not bother looking at the "..." that comes after the data we
expect? That I can believe, and I don't think it would hurt to tighten
up (we shouldn't need it for extensibility, as anybody trying to stick
extra data there should do so only after using a capability flag earlier
in the protocol).

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-18 10:03     ` Jeff King
@ 2015-03-18 10:20       ` Michael Haggerty
  0 siblings, 0 replies; 43+ messages in thread
From: Michael Haggerty @ 2015-03-18 10:20 UTC (permalink / raw)
  To: Jeff King; +Cc: Duy Nguyen, Junio C Hamano, Git Mailing List

On 03/18/2015 11:03 AM, Jeff King wrote:
> On Wed, Mar 18, 2015 at 10:47:40AM +0100, Michael Haggerty wrote:
> 
>> But in case you have some reason that you want upload-pack.c to be
>> converted right away, I just pushed that change (plus some related
>> cleanups) to my GitHub repo [1]. The branch depends only on the first
>> patch of the "numparse" patch series.
>>
>> By the way, some other packet line parsing code in that file doesn't
>> verify that there are no trailing characters on the lines that they
>> process. That might be another thing that should be tightened up.
> 
> Do you mean that upload-pack gets a pkt-line of length N that contains a
> line of length M, and then doesn't check that M==N?  We use the space
> between M and N for passing capabilities and other metadata around.
> 
> Or do you mean that we see lines like:
> 
>   want [0-9a-f]{40} ...\n
> 
> and do not bother looking at the "..." that comes after the data we
> expect? That I can believe, and I don't think it would hurt to tighten
> up (we shouldn't need it for extensibility, as anybody trying to stick
> extra data there should do so only after using a capability flag earlier
> in the protocol).

The latter. For example here [1], the "have" command and its SHA-1 are
read from the line, but I don't see a check that there are no characters
after the SHA-1. The same here [2].

Michael

[1]
https://github.com/gitster/git/blob/9ab698f4000a736864c41f57fbae1e021ac27799/upload-pack.c#L404-L429
[2]
https://github.com/gitster/git/blob/9ab698f4000a736864c41f57fbae1e021ac27799/upload-pack.c#L550-L565

-- 
Michael Haggerty
mhagger@alum.mit.edu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 01/14] numparse: new module for parsing integral numbers
  2015-03-17 16:00 ` [PATCH 01/14] numparse: new module for parsing integral numbers Michael Haggerty
@ 2015-03-18 18:27   ` Eric Sunshine
  2015-03-18 22:47     ` Michael Haggerty
  2015-03-20 17:51   ` Junio C Hamano
  1 sibling, 1 reply; 43+ messages in thread
From: Eric Sunshine @ 2015-03-18 18:27 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Junio C Hamano, Jeff King, Git List

On Tuesday, March 17, 2015, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> Implement wrappers for strtol() and strtoul() that are safer and more
> convenient to use.
>
> Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
> ---
> diff --git a/numparse.c b/numparse.c
> new file mode 100644
> index 0000000..90b44ce
> --- /dev/null
> +++ b/numparse.c
> @@ -0,0 +1,180 @@
> +int parse_l(const char *s, unsigned int flags, long *result, char **endptr)
> +{
> +       long l;
> +       const char *end;
> +       int err = 0;
> +
> +       err = parse_precheck(s, &flags);
> +       if (err)
> +               return err;
> +       /*
> +        * Now let strtol() do the heavy lifting:
> +        */

Perhaps use a /* one-line style comment */ to reduce vertical space
consumption a bit, thus make it (very slightly) easier to run the eye
over the code.

> +       errno = 0;
> +       l = strtol(s, (char **)&end, flags & NUM_BASE_MASK);
> +       if (errno) {
> +               if (errno == ERANGE) {
> +                       if (!(flags & NUM_SATURATE))
> +                               return -NUM_SATURATE;
> +               } else {
> +                       return -NUM_OTHER_ERROR;
> +               }
> +       }

Would it reduce cognitive load slightly (and reduce vertical space
consumption) to rephrase the conditionals as:

    if (errno == ERANGE && !(flags & NUM_SATURATE))
        return -NUM_SATURATE;

    if (errno && errno != ERANGE)
        return -NUM_OTHER_ERROR;

or something similar?

More below.

> +       if (end == s)
> +               return -NUM_NO_DIGITS;
> +
> +       if (*end && !(flags & NUM_TRAILING))
> +               return -NUM_TRAILING;
> +
> +       /* Everything was OK */
> +       *result = l;
> +       if (endptr)
> +               *endptr = (char *)end;
> +       return 0;
> +}
> diff --git a/numparse.h b/numparse.h
> new file mode 100644
> index 0000000..4de5e10
> --- /dev/null
> +++ b/numparse.h
> @@ -0,0 +1,207 @@
> +#ifndef NUMPARSE_H
> +#define NUMPARSE_H
> +
> +/*
> + * Functions for parsing integral numbers.
> + *
> + * Examples:
> + *
> + * - Convert s into a signed int, interpreting prefix "0x" to mean
> + *   hexadecimal and "0" to mean octal. If the value doesn't fit in an
> + *   unsigned int, set result to INT_MIN or INT_MAX.

Did you mean s/unsigned int/signed int/ ?

> + *
> + *     if (convert_i(s, NUM_SLOPPY, &result))
> + *             die("...");
> + */
> +
> +/*
> + * The lowest 6 bits of flags hold the numerical base that should be
> + * used to parse the number, 2 <= base <= 36. If base is set to 0,
> + * then NUM_BASE_SPECIFIER must be set too; in this case, the base is
> + * detected automatically from the string's prefix.

Does this restriction go against the goal of making these functions
convenient, even while remaining strict? Is there a strong reason for
not merely inferring NUM_BASE_SPECIFIER when base is 0? Doing so would
make it consistent with strto*l() without (I think) introducing any
ambiguity.

> + */
> +/*
> + * Number parsing functions:
> + *
> + * The following functions parse a number (long, unsigned long, int,
> + * or unsigned int respectively) from the front of s, storing the
> + * value to *result and storing a pointer to the first character after
> + * the number to *endptr. flags specifies how the number should be
> + * parsed, including which base should be used. flags is a combination
> + * of the numerical base (2-36) and the NUM_* constants above (see).

"(see)" what?

> + * Return 0 on success or a negative value if there was an error. On
> + * failure, *result and *entptr are left unchanged.
> + *
> + * Please note that if NUM_TRAILING is not set, then it is
> + * nevertheless an error if there are any characters between the end
> + * of the number and the end of the string.

Again, on the subject of convenience, why this restriction? The stated
purpose of the parse_*() functions is to parse the number from the
front of the string and return a pointer to the first non-numeric
character following. As  a reader of this API, I interpret that as
meaning that NUM_TRAILING is implied. Is there a strong reason for not
inferring NUM_TRAILING for the parse_*() functions at the API level?
(I realize that the convert_*() functions are built atop parse_*(),
but that's an implementation detail.)

> + */
> +
> +int parse_l(const char *s, unsigned int flags,
> +           long *result, char **endptr);

Do we want to perpetuate the ugly (char **) convention for 'endptr'
from strto*l()? Considering that the incoming string is const, it
seems undesirable to return a non-const pointer to some place inside
that string.

> +/*
> + * Number conversion functions:
> + *
> + * The following functions parse a string into a number. They are
> + * identical to the parse_*() functions above, except that the endptr
> + * is not returned. These are most useful when parsing a whole string
> + * into a number; i.e., when (flags & NUM_TRAILING) is unset.

I can formulate arguments for allowing or disallowing NUM_TRAILING
with convert_*(), however, given their purpose of parsing the entire
string into a number, as a reader of the API, I would kind of expect
the convert_*() functions to ensure that NUM_TRAILING is not set
(either by forcibly clearing it or erroring out as an inconsistent
request if it is set).

> + */
> +static inline int convert_l(const char *s, unsigned int flags,
> +                           long *result)
> +{
> +       return parse_l(s, flags, result, NULL);
> +}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 01/14] numparse: new module for parsing integral numbers
  2015-03-18 18:27   ` Eric Sunshine
@ 2015-03-18 22:47     ` Michael Haggerty
  2015-03-20  8:54       ` Eric Sunshine
  0 siblings, 1 reply; 43+ messages in thread
From: Michael Haggerty @ 2015-03-18 22:47 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Junio C Hamano, Jeff King, Git List

On 03/18/2015 07:27 PM, Eric Sunshine wrote:
> On Tuesday, March 17, 2015, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> Implement wrappers for strtol() and strtoul() that are safer and more
>> convenient to use.
>>
>> Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
>> ---
>> diff --git a/numparse.c b/numparse.c
>> new file mode 100644
>> index 0000000..90b44ce
>> --- /dev/null
>> +++ b/numparse.c
>> @@ -0,0 +1,180 @@
>> +int parse_l(const char *s, unsigned int flags, long *result, char **endptr)
>> +{
>> +       long l;
>> +       const char *end;
>> +       int err = 0;
>> +
>> +       err = parse_precheck(s, &flags);
>> +       if (err)
>> +               return err;
>> +       /*
>> +        * Now let strtol() do the heavy lifting:
>> +        */
> 
> Perhaps use a /* one-line style comment */ to reduce vertical space
> consumption a bit, thus make it (very slightly) easier to run the eye
> over the code.

Yes, will change.

>> +       errno = 0;
>> +       l = strtol(s, (char **)&end, flags & NUM_BASE_MASK);
>> +       if (errno) {
>> +               if (errno == ERANGE) {
>> +                       if (!(flags & NUM_SATURATE))
>> +                               return -NUM_SATURATE;
>> +               } else {
>> +                       return -NUM_OTHER_ERROR;
>> +               }
>> +       }
> 
> Would it reduce cognitive load slightly (and reduce vertical space
> consumption) to rephrase the conditionals as:
> 
>     if (errno == ERANGE && !(flags & NUM_SATURATE))
>         return -NUM_SATURATE;
> 
>     if (errno && errno != ERANGE)
>         return -NUM_OTHER_ERROR;
> 
> or something similar?

The most common case by far should be that errno is zero. The code as
written only needs one test for that case, whereas your code needs two
tests. I think it is worth compromising on code clarity a tiny bit to
avoid the extra test.

> More below.
> 
>> +       if (end == s)
>> +               return -NUM_NO_DIGITS;
>> +
>> +       if (*end && !(flags & NUM_TRAILING))
>> +               return -NUM_TRAILING;
>> +
>> +       /* Everything was OK */
>> +       *result = l;
>> +       if (endptr)
>> +               *endptr = (char *)end;
>> +       return 0;
>> +}
>> diff --git a/numparse.h b/numparse.h
>> new file mode 100644
>> index 0000000..4de5e10
>> --- /dev/null
>> +++ b/numparse.h
>> @@ -0,0 +1,207 @@
>> +#ifndef NUMPARSE_H
>> +#define NUMPARSE_H
>> +
>> +/*
>> + * Functions for parsing integral numbers.
>> + *
>> + * Examples:
>> + *
>> + * - Convert s into a signed int, interpreting prefix "0x" to mean
>> + *   hexadecimal and "0" to mean octal. If the value doesn't fit in an
>> + *   unsigned int, set result to INT_MIN or INT_MAX.
> 
> Did you mean s/unsigned int/signed int/ ?

Thanks for catching this. I will fix it.

>> + *
>> + *     if (convert_i(s, NUM_SLOPPY, &result))
>> + *             die("...");
>> + */
>> +
>> +/*
>> + * The lowest 6 bits of flags hold the numerical base that should be
>> + * used to parse the number, 2 <= base <= 36. If base is set to 0,
>> + * then NUM_BASE_SPECIFIER must be set too; in this case, the base is
>> + * detected automatically from the string's prefix.
> 
> Does this restriction go against the goal of making these functions
> convenient, even while remaining strict? Is there a strong reason for
> not merely inferring NUM_BASE_SPECIFIER when base is 0? Doing so would
> make it consistent with strto*l() without (I think) introducing any
> ambiguity.

I thought about doing this. If it were possible to eliminate
NUM_BASE_SPECIFIER altogether, then there is no doubt that it would be a
good change. But NUM_BASE_SPECIFIER also has an effect when base==16;
namely, that an "0x" prefix, if present, is consumed. So

    parse_i("0xff", 16 | NUM_BASE_SPECIFIER, &result, &endptr)

gives result==255 and endptr==s+4, whereas

    parse_i("0xff", 16, &result, &endptr)

gives result==0 and entptr==s+1 (it treats the "x" as the end of the
string).

We could forgo that feature, effectively allowing a base specifier if
and only if base==0. But I didn't want to rule out allowing an optional
base specifier for base==16, in which case NUM_BASE_SPECIFIER can't be
dispensed with entirely.

If you agree with that, then the remaining question is: which policy is
less error-prone? My thinking was that forcing the caller to specify
NUM_BASE_SPECIFIER explicitly when they select base==0 will serve as a
reminder that the two features are intertwined. Because another
imaginable policy (arguably more consistent with the policy for base!=0)
would be that

    convert_i(s, 0, &result)

, because it *doesn't* specify NUM_BASE_SPECIFIER, doesn't allow a base
prefix, and therefore indirectly only allows base-10 numbers.

But I don't feel strongly about this.

>> + */
>> +/*
>> + * Number parsing functions:
>> + *
>> + * The following functions parse a number (long, unsigned long, int,
>> + * or unsigned int respectively) from the front of s, storing the
>> + * value to *result and storing a pointer to the first character after
>> + * the number to *endptr. flags specifies how the number should be
>> + * parsed, including which base should be used. flags is a combination
>> + * of the numerical base (2-36) and the NUM_* constants above (see).
> 
> "(see)" what?

That was meant to be a pointer to the documentation for the NUM_*
constants. But obviously it was too oblique, so I will make it more
explicit, maybe

    ... and the NUM_* constants documented above.

>> + * Return 0 on success or a negative value if there was an error. On
>> + * failure, *result and *entptr are left unchanged.
>> + *
>> + * Please note that if NUM_TRAILING is not set, then it is
>> + * nevertheless an error if there are any characters between the end
>> + * of the number and the end of the string.
> 
> Again, on the subject of convenience, why this restriction? The stated
> purpose of the parse_*() functions is to parse the number from the
> front of the string and return a pointer to the first non-numeric
> character following. As  a reader of this API, I interpret that as
> meaning that NUM_TRAILING is implied. Is there a strong reason for not
> inferring NUM_TRAILING for the parse_*() functions at the API level?
> (I realize that the convert_*() functions are built atop parse_*(),
> but that's an implementation detail.)

Yes, I'd also thought about that change:

* Make NUM_TRAILING private.
* Make the current parse_*() functions private.
* Add new parse_*(s, flags, result, endptr) functions that imply
NUM_TRAILING (and maybe they should *require* a non-null endptr argument).
* Change the convert_*() to not allow the NUM_TRAILING flag.

This would add a little bit of code, so I didn't do it originally. But
since you seem to like the idea too, I guess I will make the change.

>> + */
>> +
>> +int parse_l(const char *s, unsigned int flags,
>> +           long *result, char **endptr);
> 
> Do we want to perpetuate the ugly (char **) convention for 'endptr'
> from strto*l()? Considering that the incoming string is const, it
> seems undesirable to return a non-const pointer to some place inside
> that string.

Yes, I guess we could do some internal casting and expose endptr as
(const char **). It would make it a bit harder if the user actually
wants to pass a non-const string to the function and then modify the
string via the returned endptr, but I haven't run into that pattern yet
so let's make the change you suggested.

>> +/*
>> + * Number conversion functions:
>> + *
>> + * The following functions parse a string into a number. They are
>> + * identical to the parse_*() functions above, except that the endptr
>> + * is not returned. These are most useful when parsing a whole string
>> + * into a number; i.e., when (flags & NUM_TRAILING) is unset.
> 
> I can formulate arguments for allowing or disallowing NUM_TRAILING
> with convert_*(), however, given their purpose of parsing the entire
> string into a number, as a reader of the API, I would kind of expect
> the convert_*() functions to ensure that NUM_TRAILING is not set
> (either by forcibly clearing it or erroring out as an inconsistent
> request if it is set).

Yes, I think I will change this, as discussed above.

>> + */
>> +static inline int convert_l(const char *s, unsigned int flags,
>> +                           long *result)
>> +{
>> +       return parse_l(s, flags, result, NULL);
>> +}

Thanks for your careful and thoughtful comments!
Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-17 16:00 [PATCH 00/14] numparse module: systematically tighten up integer parsing Michael Haggerty
                   ` (15 preceding siblings ...)
  2015-03-17 23:05 ` Duy Nguyen
@ 2015-03-19  5:26 ` Jeff King
  2015-03-19  6:41   ` Junio C Hamano
                     ` (2 more replies)
  2015-03-19  6:22 ` Junio C Hamano
  17 siblings, 3 replies; 43+ messages in thread
From: Jeff King @ 2015-03-19  5:26 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Junio C Hamano, Jeff King, git

On Tue, Mar 17, 2015 at 05:00:02PM +0100, Michael Haggerty wrote:

> My main questions:
> 
> * Do people like the API? My main goal was to make these functions as
>   painless as possible to use correctly, because there are so many
>   call sites.
> 
> * Is it too gimmicky to encode the base together with other options in
>   `flags`? (I think it is worth it to avoid the need for another
>   parameter, which callers could easily put in the wrong order.)

I definitely like the overall direction of this. My first thought was
that it does seem like there are a lot of possible options to the
functions (and OR-ing the flags with the base does seem weird, though I
can't think of a plausible case where it would actually cause errors).
Many of those options don't seem used in the example conversions (I'm
not clear who would want NUM_SATURATE, for example).

I wondered if we could do away with the radix entirely. Wouldn't we be
asking for base 10 most of the time? Of course, your first few patches
show octal parsing, but I wonder if we should actually have a separate
parse_mode() for that, since that seems to be the general reason for
doing octal parsing. 100000644 does not overflow an int, but it is
hardly a reasonable mode.

I also wondered if we could get rid of NUM_SIGN in favor of just having
the type imply it (e.g., convert_l would always allow negative numbers,
whereas convert_ul would not). But I suppose there are times when we end
up using an "int" to store an unsigned value for a good reason (e.g.,
"-1" is a sentinel value, but we expect only positive values from the
user). So that might be a bad idea.

I notice that you go up to "unsigned long" here for sizes. If we want to
use this everywhere, we'll need something larger for parsing off_t
values on 32-bit machines (though I would not at all be surprised with
the current code if 32-bit machines have a hard time configuring a
pack.packSizeLimit above 4G).

I wonder how much of the boilerplate in the parse_* functions could be
factored out to use a uintmax_t, with the caller just providing the
range. That would make it easier to add new types like off_t, and
possibly even constrained types (e.g., an integer from 0 to 100). On the
other hand, you mentioned to me elsewhere that there may be some bugs in
the range-checks of config.c's integer parsing. I suspect they are
related to exactly this kind of refactoring, so perhaps writing things
out is best.

> * Am I making callers too strict? In many cases where a positive
>   integer is expected (e.g., "--abbrev=<num>"), I have been replacing
>   code like
> [...]

IMHO most of the tightening happening here is a good thing, and means we
are more likely to notice mistakes rather than silently doing something
stupid.

For sites that currently allow it, I could imagine people using hex
notation for some values, though, depending on the context. It looks
there aren't many of them ((it is just when the radix is "0", right?).
Some of them look to be accidental (does anybody really ask for
--threads=0x10?), but others might not be (the pack index-version
contains an offset field that might be quite big).


Feel free to ignore any or all of that. It is not so much criticism as
a dump of thoughts I had while reading the patches. Perhaps you can pick
something useful out of that, and perhaps not. :)

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-17 16:00 [PATCH 00/14] numparse module: systematically tighten up integer parsing Michael Haggerty
                   ` (16 preceding siblings ...)
  2015-03-19  5:26 ` Jeff King
@ 2015-03-19  6:22 ` Junio C Hamano
  2015-03-24 15:42   ` Michael Haggerty
  17 siblings, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2015-03-19  6:22 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, git

Michael Haggerty <mhagger@alum.mit.edu> writes:

> * It allows leading whitespace.

This might be blessing in disguise.  Our friends on MacOS may be
relying on that

    git cmd --abbrev="$(wc -c <foo)"

to work "as expected", even though their "wc" gives leading spaces,
for example.

> * It allows arbitrary trailing characters.

Which is why we have introduced strtoul_ui() and such.

> * It allows a leading sign character ('+' or '-'), even though the
>   result is unsigned.

I do not particularly see it a bad thing to accept "--abbrev=+7".
Using unsigned type to accept a width and parsing "--abbrev=-7" into
a large positive integer _is_ a problem, and we may want to fix it,
but I think that is still within the scope of the original "better
strtol/strtoul", and I do not see it as a justification to introduce
a set of functions with completely unfamiliar name.

> * If the string doesn't contain a number at all, it sets its "endptr"
>   argument to point at the start of the string but doesn't set errno.

Why is that a problem?  A caller that cares is supposed to check
endptr and the string it gave, no?  Now, if strtoul_ui() and friends
do not do so, I again think that is still within the scope of the
original "better strtol/strtoul".

> * If the value is larger than fits in an unsigned long, it returns the
>   value clamped to the range 0..ULONG_MAX (setting errno to ERANGE).

Ditto.

> * If the value is between -ULONG_MAX and 0, it returns the positive
>   integer with the same bit pattern, without setting errno(!) (I can
>   hardly think of a scenario where this would be useful.)

Ditto.

> * For base=0 (autodetect base), it allows a base specifier prefix "0x"
>   or "0" and parses the number accordingly. For base=16 it also allows
>   a "0x" prefix.

Why is it a problem to allow "git cmd --hexval=0x1234", even if "git
cmd --hexval=1234" would suffice?

> When I looked around, I found scores of sites that call atoi(),
> strtoul(), and strtol() carelessly. And it's understandable. Calling
> any of these functions safely is so much work as to be completely
> impractical in day-to-day code.

Yes, these burdens on the caller were exactly why strtoul_ui()
etc. wanted to reduce---and it will be a welcome change to do
necessary checks that are currently not done.

> Please see the docstrings in numparse.h in the first commit for
> detailed API docs. Briefly, there are eight new functions:
>
>     parse_{l,ul,i,ui}(const char *s, unsigned int flags,
>                       T *result, char **endptr);
>     convert_{l,ul,i,ui}(const char *s, unsigned int flags, T *result);
>
> The parse_*() functions are for parsing a number from the start of a
> string; the convert_*() functions are for parsing a string that
> consists of a single number.

I am not sure if I follow.  Why not unify them into one 4-function
set, with optional endptr that could be NULL?

While we are on the topic of improving number parsing, one thing
that I found somewhat frustrating is that config.c layer knows the
scaling suffix but option parsing layer does not.  These functions
should offer an option (via their "flags") to say "I allow scaled
numbers like 2k, 4M, etc.".

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-17 19:46   ` Michael Haggerty
@ 2015-03-19  6:31     ` Junio C Hamano
  0 siblings, 0 replies; 43+ messages in thread
From: Junio C Hamano @ 2015-03-19  6:31 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, git

Michael Haggerty <mhagger@alum.mit.edu> writes:

> Here were my thoughts:
>
> * I wanted to change the interface to look less like
>   strtol()/strtoul(), so it seemed appropriate to make the names
>   more dissimilar.

One reason I prefer the names in the compat-util is that it makes it
clear that what is converted into what (e.g. "str" to "ul") and what
type the result is stored ("i" or "ui"), and as an added bonus, with
a short-and-sweet "to" it is already so clear we are "converting"
that we do not have to use verbs like "parse_" or "convert_".  From
parse_i() and convert_i(), I cannot tell which one lacks the endptr
and parses to the end and which one takes the endptr and tells the
caller where the conversion ended (and it also does not tell us that
they are converting from a string).

> * The functions seemed long enough that they shouldn't be inline,
>   and I wanted to put them in their own module rather than put
>   them in git-compat-util.h.

I actually agree on this, but that is not a justification why they
have to be named very differently.

Having said that, I would say that it will be futile to discuss this
further, as we seem to agree to disagree.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 05/14] handle_revision_opt(): use convert_i() when handling "-<digit>"
  2015-03-17 16:00 ` [PATCH 05/14] handle_revision_opt(): use convert_i() when handling "-<digit>" Michael Haggerty
@ 2015-03-19  6:34   ` Junio C Hamano
  0 siblings, 0 replies; 43+ messages in thread
From: Junio C Hamano @ 2015-03-19  6:34 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, git

Michael Haggerty <mhagger@alum.mit.edu> writes:

> This tightens up the parsing a bit:
>
> * Leading whitespace is no longer allowed
> * '+' and '-' are no longer allowed
>
> It also removes the need to check separately that max_count is
> non-negative.

Hmmm, on the surface this sound nice, but I am not sure if it is a
good idea.  "git cmd-in-log-family -3 --max-count=-1" would have
been a nice way to flip a max-count given earlier on the command
line back to the default (unlimited) and we are removing it without
giving any escape hatch?

>
> Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
> ---
>  revision.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/revision.c b/revision.c
> index 25838fc..4908e66 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -1,4 +1,5 @@
>  #include "cache.h"
> +#include "numparse.h"
>  #include "tag.h"
>  #include "blob.h"
>  #include "tree.h"
> @@ -1709,8 +1710,7 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
>  		return argcount;
>  	} else if ((*arg == '-') && isdigit(arg[1])) {
>  		/* accept -<digit>, like traditional "head" */
> -		if (strtol_i(arg + 1, 10, &revs->max_count) < 0 ||
> -		    revs->max_count < 0)
> +		if (convert_i(arg + 1, 10, &revs->max_count))
>  			die("'%s': not a non-negative integer", arg + 1);

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 14/14] diff_opt_parse(): use convert_i() when handling --abbrev=<num>
  2015-03-17 16:00 ` [PATCH 14/14] diff_opt_parse(): use convert_i() when handling --abbrev=<num> Michael Haggerty
@ 2015-03-19  6:37   ` Junio C Hamano
  0 siblings, 0 replies; 43+ messages in thread
From: Junio C Hamano @ 2015-03-19  6:37 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, git

Michael Haggerty <mhagger@alum.mit.edu> writes:

> die() with an error message if the argument is not a non-negative
> integer. This change tightens up parsing: '+' and '-', leading
> whitespace, and trailing junk are all disallowed now.
>
> Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
> ---
>  diff.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/diff.c b/diff.c
> index be389ae..1cc5428 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -3830,7 +3830,8 @@ int diff_opt_parse(struct diff_options *options, const char **av, int ac)
>  	else if (!strcmp(arg, "--abbrev"))
>  		options->abbrev = DEFAULT_ABBREV;
>  	else if (skip_prefix(arg, "--abbrev=", &arg)) {
> -		options->abbrev = strtoul(arg, NULL, 10);
> +		if (convert_i(arg, 10, &options->abbrev))
> +			die("--abbrev requires an integer argument");

Everybody leading up to this step said "a non-negative integer", but
this one is different?

>  		if (options->abbrev < MINIMUM_ABBREV)
>  			options->abbrev = MINIMUM_ABBREV;
>  		else if (40 < options->abbrev)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-19  5:26 ` Jeff King
@ 2015-03-19  6:41   ` Junio C Hamano
  2015-03-19  7:32   ` Junio C Hamano
  2015-03-24 15:05   ` Michael Haggerty
  2 siblings, 0 replies; 43+ messages in thread
From: Junio C Hamano @ 2015-03-19  6:41 UTC (permalink / raw)
  To: Jeff King; +Cc: Michael Haggerty, Jeff King, git

Jeff King <peff@peff.net> writes:

> I wondered if we could do away with the radix entirely. Wouldn't we be
> asking for base 10 most of the time? Of course, your first few patches
> show octal parsing, but I wonder if we should actually have a separate
> parse_mode() for that, since that seems to be the general reason for
> doing octal parsing. 100000644 does not overflow an int, but it is
> hardly a reasonable mode.

True.

> I also wondered if we could get rid of NUM_SIGN in favor of just having
> the type imply it (e.g., convert_l would always allow negative numbers,
> whereas convert_ul would not). But I suppose there are times when we end
> up using an "int" to store an unsigned value for a good reason (e.g.,
> "-1" is a sentinel value, but we expect only positive values from the
> user). So that might be a bad idea.

Yeah, there is a reason why ssize_t exists.

> IMHO most of the tightening happening here is a good thing, and means we
> are more likely to notice mistakes rather than silently doing something
> stupid.

I do like the general idea of making sure we avoid use of bare
atoi() and unchecked use of strtol() and such, and have a set of
well defined helper functions to perform the common checks, exactly
for the reason you said.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-19  5:26 ` Jeff King
  2015-03-19  6:41   ` Junio C Hamano
@ 2015-03-19  7:32   ` Junio C Hamano
  2015-03-24 16:06     ` Michael Haggerty
  2015-03-24 15:05   ` Michael Haggerty
  2 siblings, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2015-03-19  7:32 UTC (permalink / raw)
  To: Jeff King; +Cc: Michael Haggerty, Jeff King, git

Jeff King <peff@peff.net> writes:

> I wonder how much of the boilerplate in the parse_* functions could be
> factored out to use a uintmax_t, with the caller just providing the
> range. That would make it easier to add new types like off_t, and
> possibly even constrained types (e.g., an integer from 0 to 100). On the
> other hand, you mentioned to me elsewhere that there may be some bugs in
> the range-checks of config.c's integer parsing. I suspect they are
> related to exactly this kind of refactoring, so perhaps writing things
> out is best.

I like this idea very well.  I wonder if we can implement the family
of

    parse_{type}(const char *, unsigned int flags,
    		 const char **endptr, {type} *result)

functions by calls a helper that internally deals with the numbers
in uintmax_t, and then checks if the value fits within the possible
range of the *result before returning.

    int parse_i(const char *str, unsigned flags,
		const char **endptr, int *result) {
	uintmax_t val;
        int sign = 1, status;
        if (*str == '-') {
		sign = -1; 
                str++;
	}
        status = parse_num(str, flags, endptr, &val, INT_MAX);
        if (status < 0)
        	return status;
	*result = sign < 0 ? -val : val;
        return 0;
    }

(assuming the last parameter to parse_num is used to check if the
result fits within that range).  Or that may be easier and cleaner
to be done in the caller with or something like that:

	status = parse_num(str, flags, endptr, &val);
        if (status < 0)
        	return status;
	if (INT_MAX <= val * sign || val * sign <= INT_MIN) {
        	errno = ERANGE;
                return -1;
	}

If we wanted to, we may even be able to avoid duplication of
boilerplate by wrapping the above pattern within a macro,
parameterizing the TYPE_{MIN,MAX} and using token pasting, to
expand to four necessary result types.

There is no reason for the implementation of the parse_num() helper
to be using strtoul(3) or strtoull(3); its behaviour will be under
our total control.  It can become fancier by enriching the flags
bits (e.g. allow scaling factor, etc.) only once and all variants
for various result types will get the same feature.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 01/14] numparse: new module for parsing integral numbers
  2015-03-18 22:47     ` Michael Haggerty
@ 2015-03-20  8:54       ` Eric Sunshine
  0 siblings, 0 replies; 43+ messages in thread
From: Eric Sunshine @ 2015-03-20  8:54 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Junio C Hamano, Jeff King, Git List

On Wed, Mar 18, 2015 at 6:47 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> On 03/18/2015 07:27 PM, Eric Sunshine wrote:
>> On Tuesday, March 17, 2015, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>>> Implement wrappers for strtol() and strtoul() that are safer and more
>>> convenient to use.
>>> + * The lowest 6 bits of flags hold the numerical base that should be
>>> + * used to parse the number, 2 <= base <= 36. If base is set to 0,
>>> + * then NUM_BASE_SPECIFIER must be set too; in this case, the base is
>>> + * detected automatically from the string's prefix.
>>
>> Does this restriction go against the goal of making these functions
>> convenient, even while remaining strict? Is there a strong reason for
>> not merely inferring NUM_BASE_SPECIFIER when base is 0? Doing so would
>> make it consistent with strto*l() without (I think) introducing any
>> ambiguity.
>
> I thought about doing this. If it were possible to eliminate
> NUM_BASE_SPECIFIER altogether, then there is no doubt that it would be a
> good change. But NUM_BASE_SPECIFIER also has an effect when base==16;
> namely, that an "0x" prefix, if present, is consumed. So
>
>     parse_i("0xff", 16 | NUM_BASE_SPECIFIER, &result, &endptr)
>
> gives result==255 and endptr==s+4, whereas
>
>     parse_i("0xff", 16, &result, &endptr)
>
> gives result==0 and entptr==s+1 (it treats the "x" as the end of the
> string).
>
> We could forgo that feature, effectively allowing a base specifier if
> and only if base==0. But I didn't want to rule out allowing an optional
> base specifier for base==16, in which case NUM_BASE_SPECIFIER can't be
> dispensed with entirely.

Making base==0 a special case doesn't have to mean ruling out or
eliminating NUM_BASE_SPECIFIER for the base==16 case. However, a
special case base==0 would make the API a bit non-orthogonal and
require extra documentation. But for those familiar with how strto*l()
treats base==0 specially, the rule of least surprise may apply.

> If you agree with that, then the remaining question is: which policy is
> less error-prone? My thinking was that forcing the caller to specify
> NUM_BASE_SPECIFIER explicitly when they select base==0 will serve as a
> reminder that the two features are intertwined.

Since base==0 would unambiguously imply NUM_BASE_SPECIFIER, being able
to tersely say convert_i(s, 0, &result) would be a win from a
convenience perspective. However...

> Because another
> imaginable policy (arguably more consistent with the policy for base!=0)
> would be that
>
>     convert_i(s, 0, &result)
>
> , because it *doesn't* specify NUM_BASE_SPECIFIER, doesn't allow a base
> prefix, and therefore indirectly only allows base-10 numbers.
>
> But I don't feel strongly about this.

I don't feel strongly about it either, and can formulate arguments
either way. Assuming you stick with your current design, if the
strictness of having to specify NUM_BASE_SPECIFIER for base==0 proves
too burdensome, it can be loosened later by making base==0 a special
case.

On the other hand, if you go with Junio's suggestion of choosing names
for these functions which more closely mirror strto*l() names, then
the rule of least surprise might suggest that a special case for
base==0 has merit.

>>> + * Return 0 on success or a negative value if there was an error. On
>>> + * failure, *result and *entptr are left unchanged.
>>> + *
>>> + * Please note that if NUM_TRAILING is not set, then it is
>>> + * nevertheless an error if there are any characters between the end
>>> + * of the number and the end of the string.
>>
>> Again, on the subject of convenience, why this restriction? The stated
>> purpose of the parse_*() functions is to parse the number from the
>> front of the string and return a pointer to the first non-numeric
>> character following. As  a reader of this API, I interpret that as
>> meaning that NUM_TRAILING is implied. Is there a strong reason for not
>> inferring NUM_TRAILING for the parse_*() functions at the API level?
>> (I realize that the convert_*() functions are built atop parse_*(),
>> but that's an implementation detail.)
>
> Yes, I'd also thought about that change:
>
> * Make NUM_TRAILING private.
> * Make the current parse_*() functions private.
> * Add new parse_*(s, flags, result, endptr) functions that imply
> NUM_TRAILING (and maybe they should *require* a non-null endptr argument).
> * Change the convert_*() to not allow the NUM_TRAILING flag.
>
> This would add a little bit of code, so I didn't do it originally. But
> since you seem to like the idea too, I guess I will make the change.

The other option (as Junio also suggested) is to collapse this API
into just the four parse_*() functions. All of the call-sites you
converted in this series invoked convert_*(), but it's not clear that
having two sets of functions which do almost the same thing is much of
a win. If the default is for NUM_TRAILING to be off, and if NULL
'endptr' is allowed, then:

    parse_ul(s, 10, &result, NULL);

doesn't seem particularly burdensome compared with:

    convert_ul(s, 10, &result);

A more compact API, with only the parse_*() functions, means less to
remember and less to document, and often is more orthogonal.

The argument for dropping the convert_*() functions is strengthened if
you follow Junio's suggestion to name these functions after the
strto*l() variations.

>>> +int parse_l(const char *s, unsigned int flags,
>>> +           long *result, char **endptr);
>>
>> Do we want to perpetuate the ugly (char **) convention for 'endptr'
>> from strto*l()? Considering that the incoming string is const, it
>> seems undesirable to return a non-const pointer to some place inside
>> that string.
>
> Yes, I guess we could do some internal casting and expose endptr as
> (const char **). It would make it a bit harder if the user actually
> wants to pass a non-const string to the function and then modify the
> string via the returned endptr, but I haven't run into that pattern yet
> so let's make the change you suggested.

I don't feel strongly about this either. It always struck me as a bit
of an API wart, but I can see the convenience value of non-const
'endptr' if the caller intends to modify the string. If you rename
these functions to mirror strto*l(), then the rule of least surprise
would suggest that 'endptr' should remain non-const.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 01/14] numparse: new module for parsing integral numbers
  2015-03-17 16:00 ` [PATCH 01/14] numparse: new module for parsing integral numbers Michael Haggerty
  2015-03-18 18:27   ` Eric Sunshine
@ 2015-03-20 17:51   ` Junio C Hamano
  1 sibling, 0 replies; 43+ messages in thread
From: Junio C Hamano @ 2015-03-20 17:51 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, git

Michael Haggerty <mhagger@alum.mit.edu> writes:

> +static int parse_precheck(const char *s, unsigned int *flags)
> +{
> +	const char *number;
> +
> +	if (isspace(*s)) {
> +		if (!(*flags & NUM_LEADING_WHITESPACE))
> +			return -NUM_LEADING_WHITESPACE;
> +		do {
> +			s++;
> +		} while (isspace(*s));
> +	}
> +
> +	if (*s == '+') {
> +		if (!(*flags & NUM_PLUS))
> +			return -NUM_PLUS;
> +		number = s + 1;
> +		*flags &= ~NUM_NEGATIVE;
> +	} else if (*s == '-') {
> +		if (!(*flags & NUM_MINUS))
> +			return -NUM_MINUS;
> +		number = s + 1;
> +		*flags |= NUM_NEGATIVE;
> +	} else {
> +		number = s;
> +		*flags &= ~NUM_NEGATIVE;
> +	}
> +
> +	if (!(*flags & NUM_BASE_SPECIFIER)) {
> +		int base = *flags & NUM_BASE_MASK;
> +		if (base == 0) {
> +			/* This is a pointless combination of options. */
> +			die("BUG: base=0 specified without NUM_BASE_SPECIFIER");
> +		} else if (base == 16 && starts_with(number, "0x")) {
> +			/*
> +			 * We want to treat this as zero terminated by
> +			 * an 'x', whereas strtol()/strtoul() would
> +			 * silently eat the "0x". We accomplish this
> +			 * by treating it as a base 10 number:
> +			 */
> +			*flags = (*flags & ~NUM_BASE_MASK) | 10;
> +		}
> +	}
> +	return 0;
> +}

I somehow feel that a pre-processing that only _inspects_ part of
the string, without munging that string (e.g. notice '-' but feed
that to underlying strtol(3)) somewhat a brittle approach.  When I
read the above for the first time, I somehow expected that the code
would notice leading '-', strip that leading '-' and remember the
fact that it did so in the *flags, let the strtol(3) to parse the
remainder _and_ always make sure the returned result is not negative
(because that would imply that the original input had two leading
minuses and digits), and give the sign based on what this preprocess
found out in *flags, and then seeing that there is no sign of such
processing in the caller I scratched my head.

I still have not convinced myself that what I am seeing in the
base==16 part in the above is correct.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-19  5:26 ` Jeff King
  2015-03-19  6:41   ` Junio C Hamano
  2015-03-19  7:32   ` Junio C Hamano
@ 2015-03-24 15:05   ` Michael Haggerty
  2 siblings, 0 replies; 43+ messages in thread
From: Michael Haggerty @ 2015-03-24 15:05 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Jeff King, git

On 03/19/2015 06:26 AM, Jeff King wrote:
> On Tue, Mar 17, 2015 at 05:00:02PM +0100, Michael Haggerty wrote:
> 
>> My main questions:
>>
>> * Do people like the API? My main goal was to make these functions as
>>   painless as possible to use correctly, because there are so many
>>   call sites.
>>
>> * Is it too gimmicky to encode the base together with other options in
>>   `flags`? (I think it is worth it to avoid the need for another
>>   parameter, which callers could easily put in the wrong order.)
> 
> I definitely like the overall direction of this. My first thought was
> that it does seem like there are a lot of possible options to the
> functions (and OR-ing the flags with the base does seem weird, though I
> can't think of a plausible case where it would actually cause errors).
> Many of those options don't seem used in the example conversions (I'm
> not clear who would want NUM_SATURATE, for example).

There are a lot of options, but so far only few of them have been used.
As the call sites are rewritten we will see which of these features are
useful, and which can be dispensed with. If groups of options tend to be
used together, we can define constants for them like I've done with
NUM_SLOPPY and NUM_SIGN.

Regarding NUM_SATURATE: I'm not sure who would want it either, but I
thought there might be places where the user wants to specify
(effectively) "infinity", and it might be convenient to let him specify
something easy to type like "--max=999999999999" rather than
"--max=2147483647".

> I wondered if we could do away with the radix entirely. Wouldn't we be
> asking for base 10 most of the time? Of course, your first few patches
> show octal parsing, but I wonder if we should actually have a separate
> parse_mode() for that, since that seems to be the general reason for
> doing octal parsing. 100000644 does not overflow an int, but it is
> hardly a reasonable mode.

Again, as a first pass I wanted to just have a really flexible API so
that call sites can be rewritten without a lot of extra thought. If
somebody wants to add a parse_mode() function, it will be easy to build
on top of convert_ui(). But that change can be done after this one.

> I also wondered if we could get rid of NUM_SIGN in favor of just having
> the type imply it (e.g., convert_l would always allow negative numbers,
> whereas convert_ul would not). But I suppose there are times when we end
> up using an "int" to store an unsigned value for a good reason (e.g.,
> "-1" is a sentinel value, but we expect only positive values from the
> user). So that might be a bad idea.

Yes, as I was rewriting call sites, I found many that used the unsigned
variants of the parsing functions but stored the result in an int.
Probably some of these use -1 to denote "unset"; it might be that there
are other cases where the variable could actually be declared to be
"unsigned int".

Prohibiting signs when parsing signed quantities isn't really elegant
from an API purity point of view, but it sure is handy!

> I notice that you go up to "unsigned long" here for sizes. If we want to
> use this everywhere, we'll need something larger for parsing off_t
> values on 32-bit machines (though I would not at all be surprised with
> the current code if 32-bit machines have a hard time configuring a
> pack.packSizeLimit above 4G).

Yes, probably. I haven't run into such call sites yet.

> I wonder how much of the boilerplate in the parse_* functions could be
> factored out to use a uintmax_t, with the caller just providing the
> range. That would make it easier to add new types like off_t, and
> possibly even constrained types (e.g., an integer from 0 to 100). On the
> other hand, you mentioned to me elsewhere that there may be some bugs in
> the range-checks of config.c's integer parsing. I suspect they are
> related to exactly this kind of refactoring, so perhaps writing things
> out is best.

It's not a lot of code yet. If we find out we need variants for size_t
and off_t and uintmax_t and intmax_t then such a refactoring would
definitely be worth considering.

>> * Am I making callers too strict? In many cases where a positive
>>   integer is expected (e.g., "--abbrev=<num>"), I have been replacing
>>   code like
>> [...]
> 
> IMHO most of the tightening happening here is a good thing, and means we
> are more likely to notice mistakes rather than silently doing something
> stupid.
> 
> For sites that currently allow it, I could imagine people using hex
> notation for some values, though, depending on the context. It looks
> there aren't many of them ((it is just when the radix is "0", right?).
> Some of them look to be accidental (does anybody really ask for
> --threads=0x10?), but others might not be (the pack index-version
> contains an offset field that might be quite big).

Yes, we can even debate whether we want to implement a general policy
that user-entered integers can be specified in any radix. Probably
nobody will ever specify "--threads=0x10", but is there harm in allowing it?

Against the gain in flexibility, I see the following potential
disadvantages:

* Added cognitive burden for rarely-used flexibility.
* Other Git clients might feel obliged to be just as flexible, causing
their implementers extra work.
* Users might see commands written in unfamiliar ways (e.g. in scripts
or stackoverflow) and get confused.
* Octal is ambiguous with decimal. What to make of "--count=010"? People
who expect octal numbers to be allowed will think it means 8, whereas
people who don't expect octal numbers to be allowed (or don't even know
what an octal number is!) will think that it means 10. Such uses might
even already exist and have their meaning changed if we start allowing
octal.

All in all, I thought it is less error-prone to default to allowing only
decimal, except in selected situations where hex and octal are
traditionally used (e.g., file modes, memory limits).

Most of the call sites so far have explicitly specified decimal parsing,
and I have left them unchanged.

> Feel free to ignore any or all of that. It is not so much criticism as
> a dump of thoughts I had while reading the patches. Perhaps you can pick
> something useful out of that, and perhaps not. :)

Thanks very much for the feedback!

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-19  6:22 ` Junio C Hamano
@ 2015-03-24 15:42   ` Michael Haggerty
  2015-03-24 15:58     ` Junio C Hamano
  0 siblings, 1 reply; 43+ messages in thread
From: Michael Haggerty @ 2015-03-24 15:42 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git

On 03/19/2015 07:22 AM, Junio C Hamano wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
> 
>> * It allows leading whitespace.
> 
> This might be blessing in disguise.  Our friends on MacOS may be
> relying on that
> 
>     git cmd --abbrev="$(wc -c <foo)"
> 
> to work "as expected", even though their "wc" gives leading spaces,
> for example.

Yuck. If you'd like, I can make sure to continue allowing leading
whitespace everywhere that it is allowed now. Alternatively, we could
make the parsing tighter and see if anybody squeals. What do you think?

>> * It allows arbitrary trailing characters.
> 
> Which is why we have introduced strtoul_ui() and such.
> 
>> * It allows a leading sign character ('+' or '-'), even though the
>>   result is unsigned.
> 
> I do not particularly see it a bad thing to accept "--abbrev=+7".
> Using unsigned type to accept a width and parsing "--abbrev=-7" into
> a large positive integer _is_ a problem, and we may want to fix it,
> but I think that is still within the scope of the original "better
> strtol/strtoul", and I do not see it as a justification to introduce
> a set of functions with completely unfamiliar name.

It is easy to allow "--abbrev=+7"; I would just need to add NUM_PLUS to
those call sites. Should I do so?

It would even be possible to allow "--abbrev=-0", which is also a
non-negative integer. But it's more work and it seems pretty silly to me.

>> * If the string doesn't contain a number at all, it sets its "endptr"
>>   argument to point at the start of the string but doesn't set errno.
> 
> Why is that a problem?  A caller that cares is supposed to check
> endptr and the string it gave, no?  Now, if strtoul_ui() and friends
> do not do so, I again think that is still within the scope of the
> original "better strtol/strtoul".

The text you are quoting was my argument for why we need wrappers around
strtol() and strtoul(), not for how the wrappers should be named.

The "endptr" convention for detecting errors is fine in theory, but in
practice a large majority of callers didn't check for errors correctly.
This is partly because the "endptr" convention is so awkward.

The existing strtoul_ui() and strtol_i() did check endptr correctly, but
there were only int-sized variants of the functions, and they didn't
give the caller the chance to capture endptr if the caller wanted to
allow trailing characters.

Why did I rename the wrapper functions?

* The old names, I think, emphasize that they take the long-sized
results of strtou?l() and convert them to integer size, which I think is
an implementation detail.
* The new functions will also have long versions. The obvious way to
generalize the existing function names for long variants (srtoul_ul()
and strtol_l()) seem kindof redundant.
* I wanted to change the signature and behavior of the functions, so
knowledge of the existing functions wouldn't be super helpful in
understanding the new ones.

>> * If the value is larger than fits in an unsigned long, it returns the
>>   value clamped to the range 0..ULONG_MAX (setting errno to ERANGE).
> 
> Ditto.
> 
>> * If the value is between -ULONG_MAX and 0, it returns the positive
>>   integer with the same bit pattern, without setting errno(!) (I can
>>   hardly think of a scenario where this would be useful.)
> 
> Ditto.
> 
>> * For base=0 (autodetect base), it allows a base specifier prefix "0x"
>>   or "0" and parses the number accordingly. For base=16 it also allows
>>   a "0x" prefix.
> 
> Why is it a problem to allow "git cmd --hexval=0x1234", even if "git
> cmd --hexval=1234" would suffice?

In some cases we would like to allow that flexibility; in some cases
not. But the strtol()/strtoul() functions *always* allow it.

>> When I looked around, I found scores of sites that call atoi(),
>> strtoul(), and strtol() carelessly. And it's understandable. Calling
>> any of these functions safely is so much work as to be completely
>> impractical in day-to-day code.
> 
> Yes, these burdens on the caller were exactly why strtoul_ui()
> etc. wanted to reduce---and it will be a welcome change to do
> necessary checks that are currently not done.
> 
>> Please see the docstrings in numparse.h in the first commit for
>> detailed API docs. Briefly, there are eight new functions:
>>
>>     parse_{l,ul,i,ui}(const char *s, unsigned int flags,
>>                       T *result, char **endptr);
>>     convert_{l,ul,i,ui}(const char *s, unsigned int flags, T *result);
>>
>> The parse_*() functions are for parsing a number from the start of a
>> string; the convert_*() functions are for parsing a string that
>> consists of a single number.
> 
> I am not sure if I follow.  Why not unify them into one 4-function
> set, with optional endptr that could be NULL?

In the next version of this patch series I plan to include only four
functions, str_to_{i,ui,l,ul}(), named that way to make them more
reminiscent of the functions that they replace. They will take an entptr
argument, but if that argument is set to NULL, then the function will
error out if there are any trailing characters.

> While we are on the topic of improving number parsing, one thing
> that I found somewhat frustrating is that config.c layer knows the
> scaling suffix but option parsing layer does not.  These functions
> should offer an option (via their "flags") to say "I allow scaled
> numbers like 2k, 4M, etc.".

That's a good idea. But I think I will leave that for a later patch
series, if that is OK with you.

Thanks for the great feedback!
Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-24 15:42   ` Michael Haggerty
@ 2015-03-24 15:58     ` Junio C Hamano
  2015-03-24 16:09       ` Junio C Hamano
  2015-03-24 17:39       ` Michael Haggerty
  0 siblings, 2 replies; 43+ messages in thread
From: Junio C Hamano @ 2015-03-24 15:58 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, git

Michael Haggerty <mhagger@alum.mit.edu> writes:

> It is easy to allow "--abbrev=+7"; I would just need to add NUM_PLUS to
> those call sites. Should I do so?

The more relevant question to ask from my point of view is why you
need to "add" NUM_PLUS to "enable" it.  What valid reason do you
have to forbid it anywhere?  Only because you do not accept it by
default, you need to "add" to "enable".

>> Why is it a problem to allow "git cmd --hexval=0x1234", even if "git
>> cmd --hexval=1234" would suffice?
>
> In some cases we would like to allow that flexibility; in some cases
> not. But the strtol()/strtoul() functions *always* allow it.

The same issue.  Whare are these "some cases"?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-19  7:32   ` Junio C Hamano
@ 2015-03-24 16:06     ` Michael Haggerty
  2015-03-24 16:49       ` René Scharfe
  0 siblings, 1 reply; 43+ messages in thread
From: Michael Haggerty @ 2015-03-24 16:06 UTC (permalink / raw)
  To: Junio C Hamano, Jeff King; +Cc: Jeff King, git

On 03/19/2015 08:32 AM, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
> 
>> I wonder how much of the boilerplate in the parse_* functions could be
>> factored out to use a uintmax_t, with the caller just providing the
>> range. That would make it easier to add new types like off_t, and
>> possibly even constrained types (e.g., an integer from 0 to 100). On the
>> other hand, you mentioned to me elsewhere that there may be some bugs in
>> the range-checks of config.c's integer parsing. I suspect they are
>> related to exactly this kind of refactoring, so perhaps writing things
>> out is best.
> 
> I like this idea very well.  I wonder if we can implement the family
> of
> 
>     parse_{type}(const char *, unsigned int flags,
>     		 const char **endptr, {type} *result)
> 
> functions by calls a helper that internally deals with the numbers
> in uintmax_t, and then checks if the value fits within the possible
> range of the *result before returning.
> 
>     int parse_i(const char *str, unsigned flags,
> 		const char **endptr, int *result) {
> 	uintmax_t val;
>         int sign = 1, status;
>         if (*str == '-') {
> 		sign = -1; 
>                 str++;
> 	}
>         status = parse_num(str, flags, endptr, &val, INT_MAX);
>         if (status < 0)
>         	return status;
> 	*result = sign < 0 ? -val : val;
>         return 0;
>     }
> 
> (assuming the last parameter to parse_num is used to check if the
> result fits within that range).  Or that may be easier and cleaner
> to be done in the caller with or something like that:
> 
> 	status = parse_num(str, flags, endptr, &val);
>         if (status < 0)
>         	return status;
> 	if (INT_MAX <= val * sign || val * sign <= INT_MIN) {
>         	errno = ERANGE;
>                 return -1;
> 	}
> 
> If we wanted to, we may even be able to avoid duplication of
> boilerplate by wrapping the above pattern within a macro,
> parameterizing the TYPE_{MIN,MAX} and using token pasting, to
> expand to four necessary result types.
> 
> There is no reason for the implementation of the parse_num() helper
> to be using strtoul(3) or strtoull(3); its behaviour will be under
> our total control.  It can become fancier by enriching the flags
> bits (e.g. allow scaling factor, etc.) only once and all variants
> for various result types will get the same feature.

Parsing numbers is not rocket science, but there are a lot of pitfalls,
especially around overflow. It's even harder to write such code via
macros and the result is less readable.

This patch series is mostly about finding a reasonable API and whipping
the callers into shape. That seems ambitious enough for me. I'd rather
stick with boring wrappers for now and lean on strtol()/strtoul() to do
the dirty work. It will be easy for a motivated person to change the
implementation later.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-24 15:58     ` Junio C Hamano
@ 2015-03-24 16:09       ` Junio C Hamano
  2015-03-24 17:39       ` Michael Haggerty
  1 sibling, 0 replies; 43+ messages in thread
From: Junio C Hamano @ 2015-03-24 16:09 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, git

Junio C Hamano <gitster@pobox.com> writes:

> Michael Haggerty <mhagger@alum.mit.edu> writes:
>
>> It is easy to allow "--abbrev=+7"; I would just need to add NUM_PLUS to
>> those call sites. Should I do so?
>
> The more relevant question to ask from my point of view is why you
> need to "add" NUM_PLUS to "enable" it.  What valid reason do you
> have to forbid it anywhere?  Only because you do not accept it by
> default, you need to "add" to "enable".
>
>>> Why is it a problem to allow "git cmd --hexval=0x1234", even if "git
>>> cmd --hexval=1234" would suffice?
>>
>> In some cases we would like to allow that flexibility; in some cases
>> not. But the strtol()/strtoul() functions *always* allow it.
>
> The same issue.  Whare are these "some cases"?

And the same issue appears in the "leading whitespace" thing I did
not mention in the earlier part of your message I responded to. I
also notice you answered yourself that there may not be a valid
reason to forbid end-user supplied "0x" prefix to arguments we
expect an integer for in your other message.

In short, if it is not a clearly bogus input that indicates a typo
or something (e.g.  "--size=48l? did the user meant 48, 48k, or
48m?"), and if it is clear we can tell the user meant what the code
would naturally interpret as (e.g. "--hexval=0x1234"), why forbid
it?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-24 16:06     ` Michael Haggerty
@ 2015-03-24 16:49       ` René Scharfe
  2015-03-25 21:14         ` Michael Haggerty
  0 siblings, 1 reply; 43+ messages in thread
From: René Scharfe @ 2015-03-24 16:49 UTC (permalink / raw)
  To: Michael Haggerty, Junio C Hamano, Jeff King; +Cc: Jeff King, git

Am 24.03.2015 um 17:06 schrieb Michael Haggerty:
> Parsing numbers is not rocket science, but there are a lot of pitfalls,
> especially around overflow. It's even harder to write such code via
> macros and the result is less readable.
>
> This patch series is mostly about finding a reasonable API and whipping
> the callers into shape. That seems ambitious enough for me. I'd rather
> stick with boring wrappers for now and lean on strtol()/strtoul() to do
> the dirty work. It will be easy for a motivated person to change the
> implementation later.

The OpenBSD developers invented strtonum for that.  Are you aware of it? 
  Would it fit?  This discussion may be useful:

	http://www.tedunangst.com/flak/post/the-design-of-strtonum

René

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-24 15:58     ` Junio C Hamano
  2015-03-24 16:09       ` Junio C Hamano
@ 2015-03-24 17:39       ` Michael Haggerty
  2015-03-24 18:08         ` Junio C Hamano
  1 sibling, 1 reply; 43+ messages in thread
From: Michael Haggerty @ 2015-03-24 17:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git

On 03/24/2015 04:58 PM, Junio C Hamano wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
> 
>> It is easy to allow "--abbrev=+7"; I would just need to add NUM_PLUS to
>> those call sites. Should I do so?
> 
> The more relevant question to ask from my point of view is why you
> need to "add" NUM_PLUS to "enable" it.  What valid reason do you
> have to forbid it anywhere?  Only because you do not accept it by
> default, you need to "add" to "enable".

I want to be able to plunge into this project without first auditing all
call sites to see which features will turn out to be needed. So I'm
erring on the side of flexibility. For now, I want to be able to
prohibit '+' signs.

Currently all of the flags cause additional features to be enabled. My
guess was that most callers *won't* need most features, so it seemed
easiest and most consistent to have all features be turned off by
default and let the caller add the features that it wants to allow.

Regarding specifically allowing/disallowing a leading '+': I saw a
couple of callsites that explicitly check that the first character is a
digit before calling strtol(). I assumed that is to disallow sign
characters [1]. For example,

    diff.c: optarg()
    builtin/apply.c: parse_num()
    maybe date.c: match_multi_number()

There are other callsites that call strtoul(), but store the result in a
signed variable. Those would presumably not want to allow leading '-',
but I'm not sure.

I also imagine that there are places in protocols or file formats where
signs should not be allowed (e.g., timestamps in commits?).

>>> Why is it a problem to allow "git cmd --hexval=0x1234", even if "git
>>> cmd --hexval=1234" would suffice?
>>
>> In some cases we would like to allow that flexibility; in some cases
>> not. But the strtol()/strtoul() functions *always* allow it.
> 
> The same issue.  Whare are these "some cases"?

I admit I'm not sure there are such places for hexadecimal numbers.

I'm coming around to an alternate plan:

Step 1: write a NUM_DEFAULT combination-of-options that gives the new
functions behavior very like strtol()/strtoul() but without their insane
features.

Step 2: rewrite all callers to use that option (and usually endptr=NULL,
meaning no trailing characters) unless it is manifestly clear that they
are already trying to forbid some other features. This will already
produce the largest benefit: avoiding overflows, missing error checking,
etc.

Steps 3 through ∞: as time allows, rewrite individual callsites to be
stricter where appropriate.

Hopefully steps 1 and 2 will not be too controversial.

Michael

[1] That assumption is based on a rather quick look over the code,
because with well over 100 callsites, it is not practical to study each
callsite carefully.

-- 
Michael Haggerty
mhagger@alum.mit.edu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-24 17:39       ` Michael Haggerty
@ 2015-03-24 18:08         ` Junio C Hamano
  0 siblings, 0 replies; 43+ messages in thread
From: Junio C Hamano @ 2015-03-24 18:08 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, git

Michael Haggerty <mhagger@alum.mit.edu> writes:

> Regarding specifically allowing/disallowing a leading '+': I saw a
> couple of callsites that explicitly check that the first character is a
> digit before calling strtol(). I assumed that is to disallow sign
> characters [1]. For example,
>
>     diff.c: optarg()

This one I know is from "if not a digit we know it is not a number";
it is not an attempt to say "we must forbid numbers to be spelled
with '+'", but more about "we do not need it and this is easier to
code without a full fledged str-to-num helper API" sloppiness.

>     builtin/apply.c: parse_num()

This parses "@@ -l,k +m,n @@@" after stripping the punctuation
around the numbers, so this is a valid reason why you would want an
optional feature "CANNOT_HAVE_SIGN" in the str-to-num parser and use
it.

>     maybe date.c: match_multi_number()

The approxidate callers parse random garbage input and attempt to
make best sense out of it, so they tend to try limitting the damage
caused by incorrect guesses by insisting "we do not consider +0 to
be zero" and such.  So this is a valid reason why you would want an
optional feature "CANNOT_HAVE_SIGN" in the str-to-num parser and use
it.

> I'm coming around to an alternate plan:
>
> Step 1: write a NUM_DEFAULT combination-of-options that gives the new
> functions behavior very like strtol()/strtoul() but without their insane
> features.
>
> Step 2: rewrite all callers to use that option (and usually endptr=NULL,
> meaning no trailing characters) unless it is manifestly clear that they
> are already trying to forbid some other features. This will already
> produce the largest benefit: avoiding overflows, missing error checking,
> etc.
>
> Steps 3 through ∞: as time allows, rewrite individual callsites to be
> stricter where appropriate.
>
> Hopefully steps 1 and 2 will not be too controversial.

All sounds sensible to me.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-24 16:49       ` René Scharfe
@ 2015-03-25 21:14         ` Michael Haggerty
  2015-03-25 21:59           ` Junio C Hamano
  0 siblings, 1 reply; 43+ messages in thread
From: Michael Haggerty @ 2015-03-25 21:14 UTC (permalink / raw)
  To: René Scharfe, Junio C Hamano, Jeff King; +Cc: Jeff King, git

On 03/24/2015 05:49 PM, René Scharfe wrote:
> Am 24.03.2015 um 17:06 schrieb Michael Haggerty:
>> Parsing numbers is not rocket science, but there are a lot of pitfalls,
>> especially around overflow. It's even harder to write such code via
>> macros and the result is less readable.
>>
>> This patch series is mostly about finding a reasonable API and whipping
>> the callers into shape. That seems ambitious enough for me. I'd rather
>> stick with boring wrappers for now and lean on strtol()/strtoul() to do
>> the dirty work. It will be easy for a motivated person to change the
>> implementation later.
> 
> The OpenBSD developers invented strtonum for that.  Are you aware of it?
>  Would it fit?  This discussion may be useful:
> 
>     http://www.tedunangst.com/flak/post/the-design-of-strtonum

I wasn't aware of strtonum; thanks for the reference. It has an
untraditional interface. Their willingness to sacrifice half of the
unsigned range and requirement that the user specify minval and maxval
have the nice effect of permitting one function to be used for all
integer types.

I think git will need more flexibility, for example to support other
radixes and to allow trailing characters. So personally I don't think we
should use (or imitate) strtonum().

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
  2015-03-25 21:14         ` Michael Haggerty
@ 2015-03-25 21:59           ` Junio C Hamano
  0 siblings, 0 replies; 43+ messages in thread
From: Junio C Hamano @ 2015-03-25 21:59 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: René Scharfe, Jeff King, git

Michael Haggerty <mhagger@alum.mit.edu> writes:

[jc: dropped a non-working address from Cc list]

> I wasn't aware of strtonum; thanks for the reference. It has an
> untraditional interface. Their willingness to sacrifice half of the
> unsigned range and requirement that the user specify minval and maxval
> have the nice effect of permitting one function to be used for all
> integer types.
> ...
> I think git will need more flexibility, for example to support other
> radixes and to allow trailing characters. So personally I don't think we
> should use (or imitate) strtonum().

I had the same impression.  An earlier suggestion by Peff about
uintmax_t may be something we might want to keep in mind; we might
want to use strtoll/strtoull as the underlying implementation if we
wanted to avoid rolling our own.

Thanks.

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2015-03-25 21:59 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-17 16:00 [PATCH 00/14] numparse module: systematically tighten up integer parsing Michael Haggerty
2015-03-17 16:00 ` [PATCH 01/14] numparse: new module for parsing integral numbers Michael Haggerty
2015-03-18 18:27   ` Eric Sunshine
2015-03-18 22:47     ` Michael Haggerty
2015-03-20  8:54       ` Eric Sunshine
2015-03-20 17:51   ` Junio C Hamano
2015-03-17 16:00 ` [PATCH 02/14] cacheinfo_callback(): use convert_ui() when handling "--cacheinfo" Michael Haggerty
2015-03-17 16:00 ` [PATCH 03/14] write_subdirectory(): use convert_ui() for parsing mode Michael Haggerty
2015-03-17 16:00 ` [PATCH 04/14] handle_revision_opt(): use skip_prefix() in many places Michael Haggerty
2015-03-17 16:00 ` [PATCH 05/14] handle_revision_opt(): use convert_i() when handling "-<digit>" Michael Haggerty
2015-03-19  6:34   ` Junio C Hamano
2015-03-17 16:00 ` [PATCH 06/14] strtoul_ui(), strtol_i(): remove functions Michael Haggerty
2015-03-17 16:00 ` [PATCH 07/14] handle_revision_opt(): use convert_ui() when handling "--abbrev=" Michael Haggerty
2015-03-17 16:00 ` [PATCH 08/14] builtin_diff(): detect errors when parsing --unified argument Michael Haggerty
2015-03-17 16:00 ` [PATCH 09/14] opt_arg(): val is always non-NULL Michael Haggerty
2015-03-17 16:00 ` [PATCH 10/14] opt_arg(): use convert_i() in implementation Michael Haggerty
2015-03-17 16:00 ` [PATCH 11/14] opt_arg(): report errors parsing option values Michael Haggerty
2015-03-17 16:00 ` [PATCH 12/14] opt_arg(): simplify pointer handling Michael Haggerty
2015-03-17 16:00 ` [PATCH 13/14] diff_opt_parse(): use convert_i() when handling "-l<num>" Michael Haggerty
2015-03-17 16:00 ` [PATCH 14/14] diff_opt_parse(): use convert_i() when handling --abbrev=<num> Michael Haggerty
2015-03-19  6:37   ` Junio C Hamano
2015-03-17 18:48 ` [PATCH 00/14] numparse module: systematically tighten up integer parsing Junio C Hamano
2015-03-17 19:46   ` Michael Haggerty
2015-03-19  6:31     ` Junio C Hamano
2015-03-17 23:05 ` Duy Nguyen
2015-03-18  9:47   ` Michael Haggerty
2015-03-18  9:58     ` Duy Nguyen
2015-03-18 10:03     ` Jeff King
2015-03-18 10:20       ` Michael Haggerty
2015-03-19  5:26 ` Jeff King
2015-03-19  6:41   ` Junio C Hamano
2015-03-19  7:32   ` Junio C Hamano
2015-03-24 16:06     ` Michael Haggerty
2015-03-24 16:49       ` René Scharfe
2015-03-25 21:14         ` Michael Haggerty
2015-03-25 21:59           ` Junio C Hamano
2015-03-24 15:05   ` Michael Haggerty
2015-03-19  6:22 ` Junio C Hamano
2015-03-24 15:42   ` Michael Haggerty
2015-03-24 15:58     ` Junio C Hamano
2015-03-24 16:09       ` Junio C Hamano
2015-03-24 17:39       ` Michael Haggerty
2015-03-24 18:08         ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).