All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] 'unsigned long' to 'size_t' conversion
@ 2017-12-06 15:08 Derrick Stolee
  2017-12-06 16:48 ` Thomas Braun
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Derrick Stolee @ 2017-12-06 15:08 UTC (permalink / raw)
  To: git; +Cc: Jeff Hostetler, Jeff King, Junio C Hamano

There are several places in Git where we refer to the size of an object 
by an 'unsigned long' instead of a 'size_t'. In 64-bit Linux, 'unsigned 
long' is 8 bytes, but in 64-bit Windows it is 4 bytes.

The main issue with this conversion is that large objects fail to load 
(they seem to hash and store just fine). For example, the following 
'blob8gb' is an 8 GB file where the ith byte is equal to i % 256:

$ git hash-object -w --no-filters blob8gb
5391939346b98600acc0283dda24649450cec51f

$ git cat-file -s 5391939346b98600acc0283dda24649450cec51f
error: bad object header
fatal: git cat-file: could not get object info

An existing discussion can be found here: 
https://github.com/git-for-windows/git/issues/1063

The error message results from unpack_object_header_buffer() which had 
its most-recent meaningful change in 'ea4f9685: 
unpack_object_header_buffer(): clear the size field upon error' (in 2011).

In my opinion, the correct thing to do would be to replace all 'unsigned 
long's that refer to an object size and replace them with 'size_t'. 
However, a simple "git grep 'unsigned long size'" reveals 194 results, 
and there are other permutations of names and pointer types all over.

This conversion would be a significant patch, so I wanted to get the 
community's thoughts on this conversion.

If there are small, isolated chunks that can be done safely, then this 
may be a good target for a first patch.

Thanks,

-Stolee


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] 'unsigned long' to 'size_t' conversion
  2017-12-06 15:08 [RFC] 'unsigned long' to 'size_t' conversion Derrick Stolee
@ 2017-12-06 16:48 ` Thomas Braun
  2017-12-06 18:10 ` Brandon Williams
  2017-12-06 21:43 ` Jeff King
  2 siblings, 0 replies; 4+ messages in thread
From: Thomas Braun @ 2017-12-06 16:48 UTC (permalink / raw)
  To: Derrick Stolee, git; +Cc: Jeff Hostetler, Jeff King, Junio C Hamano

Am 06.12.2017 um 16:08 schrieb Derrick Stolee:

Hi Derrick,

> If there are small, isolated chunks that can be done safely, then this
> may be a good target for a first patch.

Here are some pointers to past discussions:
-
https://public-inbox.org/git/trinity-9f703269-6f73-4f6d-b90b-45e09e1c094c-1489582854278@3capp-gmx-bs66/
-
https://public-inbox.org/git/1502527643-21944-1-git-send-email-martin@mail.zuhause/

I'm posting this mainly so that we can avoid, if possible, duplicated work.

Hope that helps,
Thomas

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] 'unsigned long' to 'size_t' conversion
  2017-12-06 15:08 [RFC] 'unsigned long' to 'size_t' conversion Derrick Stolee
  2017-12-06 16:48 ` Thomas Braun
@ 2017-12-06 18:10 ` Brandon Williams
  2017-12-06 21:43 ` Jeff King
  2 siblings, 0 replies; 4+ messages in thread
From: Brandon Williams @ 2017-12-06 18:10 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff Hostetler, Jeff King, Junio C Hamano

On 12/06, Derrick Stolee wrote:
> There are several places in Git where we refer to the size of an
> object by an 'unsigned long' instead of a 'size_t'. In 64-bit Linux,
> 'unsigned long' is 8 bytes, but in 64-bit Windows it is 4 bytes.
> 
> The main issue with this conversion is that large objects fail to
> load (they seem to hash and store just fine). For example, the
> following 'blob8gb' is an 8 GB file where the ith byte is equal to i
> % 256:
> 
> $ git hash-object -w --no-filters blob8gb
> 5391939346b98600acc0283dda24649450cec51f
> 
> $ git cat-file -s 5391939346b98600acc0283dda24649450cec51f
> error: bad object header
> fatal: git cat-file: could not get object info
> 
> An existing discussion can be found here:
> https://github.com/git-for-windows/git/issues/1063
> 
> The error message results from unpack_object_header_buffer() which
> had its most-recent meaningful change in 'ea4f9685:
> unpack_object_header_buffer(): clear the size field upon error' (in
> 2011).
> 
> In my opinion, the correct thing to do would be to replace all
> 'unsigned long's that refer to an object size and replace them with
> 'size_t'. However, a simple "git grep 'unsigned long size'" reveals
> 194 results, and there are other permutations of names and pointer
> types all over.
> 
> This conversion would be a significant patch, so I wanted to get the
> community's thoughts on this conversion.
> 
> If there are small, isolated chunks that can be done safely, then
> this may be a good target for a first patch.

I think that an effort like this would definitely be worthwhile.  Much
like the unsigned char[20] -> struct object_id conversion I would think
that the best way to go about such a conversion would be to do it in
small chunks as you've mentioned.  That way you are only causing churn
in hopefully small parts of the code base at a time instead of one
monolithic change that is sure to cause conflicts.

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] 'unsigned long' to 'size_t' conversion
  2017-12-06 15:08 [RFC] 'unsigned long' to 'size_t' conversion Derrick Stolee
  2017-12-06 16:48 ` Thomas Braun
  2017-12-06 18:10 ` Brandon Williams
@ 2017-12-06 21:43 ` Jeff King
  2 siblings, 0 replies; 4+ messages in thread
From: Jeff King @ 2017-12-06 21:43 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff Hostetler, Junio C Hamano

On Wed, Dec 06, 2017 at 10:08:23AM -0500, Derrick Stolee wrote:

> There are several places in Git where we refer to the size of an object by
> an 'unsigned long' instead of a 'size_t'. In 64-bit Linux, 'unsigned long'
> is 8 bytes, but in 64-bit Windows it is 4 bytes.
> 
> The main issue with this conversion is that large objects fail to load (they
> seem to hash and store just fine). For example, the following 'blob8gb' is
> an 8 GB file where the ith byte is equal to i % 256:

Yeah, I think there's widespread agreement that this needs fixing. It's
just big enough that nobody has tackled it. If you're willing, that
sounds great to me. ;)

> In my opinion, the correct thing to do would be to replace all 'unsigned
> long's that refer to an object size and replace them with 'size_t'. However,
> a simple "git grep 'unsigned long size'" reveals 194 results, and there are
> other permutations of names and pointer types all over.

Yep. I think it's actually even trickier than that, though. Something
like off_t is probably the right type for the abstract concept of an
object size, since objects can be larger than the address space (in
which case we'd generally hit streaming code paths when possible).

But then of course we'd want to use size_t for objects that we've
actually placed into a buffers. At first glance it might be OK to just
use a larger type everywhere, but I think that runs into some funny
situations. For instance, you would not want to pass an off_t to
xmalloc(), as it is truncated (which may lead to a too-small buffer and
then a heap-smashing overflow).

So in an ideal world it is something like:

  - abstract object types are always off_t

  - lower-level code that takes an object in a buffer always uses size_t

  - at the conversion point, we must always use xsize_t() to catch
    conversion problems.

  - Any time that xsize_t() might die is a place where the caller should
    probably figure out if it can use a streaming code path for large
    objects.

> This conversion would be a significant patch, so I wanted to get the
> community's thoughts on this conversion.
> 
> If there are small, isolated chunks that can be done safely, then this may
> be a good target for a first patch.

Yes, I agree this probably has to be done incrementally. The threads
Thomas linked might be good starting points. But do be careful with
incremental changes that you don't introduce any spots where we might
get a value mismatch that used to be consistently truncated. E.g., this
code is wrong but not a security problem:

  size_t full_len = strlen(src); /* ok */
  unsigned long len = full_len; /* possible truncation! */
  char *dst = xmalloc(len); /* buffer possibly smaller than full_len */
  memcpy(dst, src, len); /* copies truncated value */

But this has a vulnerability:

  memcpy(dst, src, full_len); /* whoops, buffer is too small! */

Obviously this is a stupid toy example, but in the real world things
like that "unsigned long" assignment happen via implicit conversion in
function parameters. Or we never have "full_len" in the first place, and
instead build it up over a series of operations that write into the
buffer. Etc.

I made a pass over the whole code-base a while back to fix any existing
problems and introduce helpers to make it harder to get this wrong
(e.g., by making sure a consistent value is used for allocating and
populating a buffer). So I _hope_ that it should be hard to accidentally
regress any code by switching out types. But it's something to be aware
of.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-12-06 21:43 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-06 15:08 [RFC] 'unsigned long' to 'size_t' conversion Derrick Stolee
2017-12-06 16:48 ` Thomas Braun
2017-12-06 18:10 ` Brandon Williams
2017-12-06 21:43 ` Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.