git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Changed path filter hash differs from murmur3 if char is signed
@ 2023-05-11 22:40 Jonathan Tan
  2023-05-11 22:51 ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Jonathan Tan @ 2023-05-11 22:40 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, derrickstolee

This issue arose while I was implementing changed path filters (the
bloom filters in the commit graph file) for JGit [1].

It turns out that if a repository contains paths that have at least one
character greater than 0x7f, Git writes these filters and interprets
them (when read from the commit graph file) differently depending on
whether "char" is signed or unsigned, and when "char" is signed (this
can be controlled by a compiler flag), the implementation of the murmur3
hash differs from the one in Wikipedia and elsewhere. (I looked into
Git's Makefile and didn't see anything that controlled whether "char"
is signed, so I presume that the default of the compiler used applies.)
This can be seen from the murmur3_seeded() function in bloom.c:

> uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
> {
[snip]
> 		uint32_t byte1 = (uint32_t)data[4*i];

Note that data[4*i] is of type "char". When I first read this, I thought
that no sign extension would occur regardless of the signedness of char
(that is, (uint32_t)(signed char)-1 == (uint32_t)(unsigned char)-1)
but it turns out that sign extension does occur if the thing being cast
is signed (that is, (uint32_t)(signed char)-1 != (uint32_t)(unsigned
char)-1) [2]. The implementation of the murmur3 hash in Wikipedia (and
presumably the intention of the author here, since the variable is named
"byte1") expects this value to not exceed 0xff, but this is not the case
if sign extension occurs.

I looked if this was discussed at the time this patch was presented [3]
but couldn't find anything related.

So...how do we proceed? I can see at least 2 ways:

 - Decide that we're going to stick with the details of the existing
   implementation and declare that "data" is always interpreted as signed.
   In that case, I would put "signed" wherever necessary, rename the
   function to something that is not "murmur3", and change the names of
   byte1 etc. to indicate that they are not constrained to be less than or
   equal to 0xff.

 - Bump the version number to 2 and correct the implementation to
   match murmur3 (so, "data" is unsigned). Then we would have to think of
   a transition plan. One possible one might be to always reject version
   1 bloom filters, which I'm personally OK with, but it may seem too
   heavy a cost to some since in the perhaps typical case where a repo has
   filenames restricted to 0x7f and below, the existing bloom filters are
   still correct.

Of the two, I prefer the latter, but I'm open to suggestions.

[1] https://git.eclipse.org/r/c/jgit/jgit/+/201851/2
[2] Verified in practice and also prescribed by the C standard; see
  e.g. https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1548.pdf section
  6.3.1.3 point 2.
[3] https://lore.kernel.org/git/a5aa3415c05ee9bc67a9471445a20c71a9834673.1586192395.git.gitgitgadget@gmail.com/



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-05-12 21:28 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-11 22:40 Changed path filter hash differs from murmur3 if char is signed Jonathan Tan
2023-05-11 22:51 ` Junio C Hamano
2023-05-11 23:10   ` Taylor Blau
2023-05-12 17:33     ` Jonathan Tan
2023-05-12 19:42       ` Junio C Hamano
2023-05-12 20:54         ` Jonathan Tan
2023-05-12 21:27           ` Taylor Blau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).