Changed path filter hash differs from murmur3 if char is signed

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Changed path filter hash differs from murmur3 if char is signed
@ 2023-05-11 22:40 Jonathan Tan
  2023-05-11 22:51 ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Jonathan Tan @ 2023-05-11 22:40 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, derrickstolee

This issue arose while I was implementing changed path filters (the
bloom filters in the commit graph file) for JGit [1].

It turns out that if a repository contains paths that have at least one
character greater than 0x7f, Git writes these filters and interprets
them (when read from the commit graph file) differently depending on
whether "char" is signed or unsigned, and when "char" is signed (this
can be controlled by a compiler flag), the implementation of the murmur3
hash differs from the one in Wikipedia and elsewhere. (I looked into
Git's Makefile and didn't see anything that controlled whether "char"
is signed, so I presume that the default of the compiler used applies.)
This can be seen from the murmur3_seeded() function in bloom.c:

> uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
> {
[snip]
> 		uint32_t byte1 = (uint32_t)data[4*i];

Note that data[4*i] is of type "char". When I first read this, I thought
that no sign extension would occur regardless of the signedness of char
(that is, (uint32_t)(signed char)-1 == (uint32_t)(unsigned char)-1)
but it turns out that sign extension does occur if the thing being cast
is signed (that is, (uint32_t)(signed char)-1 != (uint32_t)(unsigned
char)-1) [2]. The implementation of the murmur3 hash in Wikipedia (and
presumably the intention of the author here, since the variable is named
"byte1") expects this value to not exceed 0xff, but this is not the case
if sign extension occurs.

I looked if this was discussed at the time this patch was presented [3]
but couldn't find anything related.

So...how do we proceed? I can see at least 2 ways:

 - Decide that we're going to stick with the details of the existing
   implementation and declare that "data" is always interpreted as signed.
   In that case, I would put "signed" wherever necessary, rename the
   function to something that is not "murmur3", and change the names of
   byte1 etc. to indicate that they are not constrained to be less than or
   equal to 0xff.

 - Bump the version number to 2 and correct the implementation to
   match murmur3 (so, "data" is unsigned). Then we would have to think of
   a transition plan. One possible one might be to always reject version
   1 bloom filters, which I'm personally OK with, but it may seem too
   heavy a cost to some since in the perhaps typical case where a repo has
   filenames restricted to 0x7f and below, the existing bloom filters are
   still correct.

Of the two, I prefer the latter, but I'm open to suggestions.

[1] https://git.eclipse.org/r/c/jgit/jgit/+/201851/2
[2] Verified in practice and also prescribed by the C standard; see
  e.g. https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1548.pdf section
  6.3.1.3 point 2.
[3] https://lore.kernel.org/git/a5aa3415c05ee9bc67a9471445a20c71a9834673.1586192395.git.gitgitgadget@gmail.com/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Changed path filter hash differs from murmur3 if char is signed
  2023-05-11 22:40 Changed path filter hash differs from murmur3 if char is signed Jonathan Tan
@ 2023-05-11 22:51 ` Junio C Hamano
  2023-05-11 23:10   ` Taylor Blau
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2023-05-11 22:51 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, derrickstolee

Jonathan Tan <jonathantanmy@google.com> writes:

> So...how do we proceed? I can see at least 2 ways:
>
>  - Decide that we're going to stick with the details of the existing
>    implementation and declare that "data" is always interpreted as signed.
>    In that case, I would put "signed" wherever necessary, rename the
>    function to something that is not "murmur3", and change the names of
>    byte1 etc. to indicate that they are not constrained to be less than or
>    equal to 0xff.
>
>  - Bump the version number to 2 and correct the implementation to
>    match murmur3 (so, "data" is unsigned). Then we would have to think of
>    a transition plan. One possible one might be to always reject version
>    1 bloom filters, which I'm personally OK with, but it may seem too
>    heavy a cost to some since in the perhaps typical case where a repo has
>    filenames restricted to 0x7f and below, the existing bloom filters are
>    still correct.

If path filter hashing were merely advisory, in the sense that if a
matching data is found, great, the processing goes faster, but if
not, we would get correct results albeit not so quickly, a third
option would be to just update the implementation without updating
the version number.  But we may not be so lucky---you must have seen
a wrong result returned quickly, which is not what we want to see.

But if I recall correctly we made the file format in such a way that
bumping the version number is cheap in that transition can appear
seamless.  An updated implementation can just be told to _ignore_
old and possibly incorrect Bloom filters until it gets told to
recompute, at which time it can write a correct one with a new
version number.  So I would prefer your "Bump the version number and
ignore the old and possibly wrong data".

Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Changed path filter hash differs from murmur3 if char is signed
  2023-05-11 22:51 ` Junio C Hamano
@ 2023-05-11 23:10   ` Taylor Blau
  2023-05-12 17:33     ` Jonathan Tan
  0 siblings, 1 reply; 7+ messages in thread
From: Taylor Blau @ 2023-05-11 23:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git, derrickstolee

On Thu, May 11, 2023 at 03:51:16PM -0700, Junio C Hamano wrote:
> Jonathan Tan <jonathantanmy@google.com> writes:
>
> > So...how do we proceed? I can see at least 2 ways:
> >
> >  - Decide that we're going to stick with the details of the existing
> >    implementation and declare that "data" is always interpreted as signed.
> >    In that case, I would put "signed" wherever necessary, rename the
> >    function to something that is not "murmur3", and change the names of
> >    byte1 etc. to indicate that they are not constrained to be less than or
> >    equal to 0xff.
> >
> >  - Bump the version number to 2 and correct the implementation to
> >    match murmur3 (so, "data" is unsigned). Then we would have to think of
> >    a transition plan. One possible one might be to always reject version
> >    1 bloom filters, which I'm personally OK with, but it may seem too
> >    heavy a cost to some since in the perhaps typical case where a repo has
> >    filenames restricted to 0x7f and below, the existing bloom filters are
> >    still correct.
>
> If path filter hashing were merely advisory, in the sense that if a
> matching data is found, great, the processing goes faster, but if
> not, we would get correct results albeit not so quickly, a third
> option would be to just update the implementation without updating
> the version number.  But we may not be so lucky---you must have seen
> a wrong result returned quickly, which is not what we want to see.

Right; from my understanding of Jonathan's message, I believe that we
would get the wrong results if the implementation of not-quite-murmur3
were corrected without updating the on-disk Bloom filters.

We already have the bloom_filter_settings struct, which could also
encode whether or not the data was computed with sign extension or not.
If we are consistent in computing the hashes when we write Bloom filters
and when we re-compute hashes to query them, I'd think we would be able
to reuse the existing filters.

That would be nice to do if we could. Unfortunately, I don't think there
is an easy way to determine the signed-ness of an existing set of Bloom
filters, nor a guarantee that they all have the same signed-ness (IOW,
the Bloom filters could have been built up across multiple copies of
Git, each with different compiler settings).

So I'm not sure that that is a productive direction for us to take.

> But if I recall correctly we made the file format in such a way that
> bumping the version number is cheap in that transition can appear
> seamless.  An updated implementation can just be told to _ignore_
> old and possibly incorrect Bloom filters until it gets told to
> recompute, at which time it can write a correct one with a new
> version number.  So I would prefer your "Bump the version number and
> ignore the old and possibly wrong data".

Version numbers are easy to increment, as is the case when adding new
chunks to the chunked file format used by commit-graph,
multi-pack-index, etc.

However, computing Bloom filters en-masse from scratch is fairly
expensive. At GitHub when we were rolling out Bloom filters to all
repositories a few years ago, I wrote 809e0327f5
(builtin/commit-graph.c: introduce '--max-new-filters=<n>', 2020-09-18)
to avoid spending too much time computing new filters from scratch.

If we do have to re-compute all of the filters from scratch, it may be
worth considering enumerating all of the repository's paths first and
seeing if any of them are >0xff. If none are, we can propagate the
existing filters forward without having to compute new ones.

Better yet, we should be able to reuse existing Bloom filter data for
paths that have all characters <=0xff, and only recompute them where
necessary. That makes much more sense than the previous paragraph.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Changed path filter hash differs from murmur3 if char is signed
  2023-05-11 23:10   ` Taylor Blau
@ 2023-05-12 17:33     ` Jonathan Tan
  2023-05-12 19:42       ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Jonathan Tan @ 2023-05-12 17:33 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Tan, Junio C Hamano, git, derrickstolee

Taylor Blau <me@ttaylorr.com> writes:
> On Thu, May 11, 2023 at 03:51:16PM -0700, Junio C Hamano wrote:
> > If path filter hashing were merely advisory, in the sense that if a
> > matching data is found, great, the processing goes faster, but if
> > not, we would get correct results albeit not so quickly, a third
> > option would be to just update the implementation without updating
> > the version number.  But we may not be so lucky---you must have seen
> > a wrong result returned quickly, which is not what we want to see.
> 
> Right; from my understanding of Jonathan's message, I believe that we
> would get the wrong results if the implementation of not-quite-murmur3
> were corrected without updating the on-disk Bloom filters.

Yes - if the bloom filter contained junk data (in our example, created
using a different hash function on filenames that have characters that
exceed 0x7f), the bloom filter would report "no, this commit does not
contain a change in such-and-such path" and then we would skip the
commit, even if the commit did have a change in that path.

> We already have the bloom_filter_settings struct, which could also
> encode whether or not the data was computed with sign extension or not.
> If we are consistent in computing the hashes when we write Bloom filters
> and when we re-compute hashes to query them, I'd think we would be able
> to reuse the existing filters.
> 
> That would be nice to do if we could. Unfortunately, I don't think there
> is an easy way to determine the signed-ness of an existing set of Bloom
> filters, nor a guarantee that they all have the same signed-ness (IOW,
> the Bloom filters could have been built up across multiple copies of
> Git, each with different compiler settings).
> 
> So I'm not sure that that is a productive direction for us to take.

Agreed.

> > But if I recall correctly we made the file format in such a way that
> > bumping the version number is cheap in that transition can appear
> > seamless.  An updated implementation can just be told to _ignore_
> > old and possibly incorrect Bloom filters until it gets told to
> > recompute, at which time it can write a correct one with a new
> > version number.  So I would prefer your "Bump the version number and
> > ignore the old and possibly wrong data".
> 
> Version numbers are easy to increment, as is the case when adding new
> chunks to the chunked file format used by commit-graph,
> multi-pack-index, etc.
> 
> However, computing Bloom filters en-masse from scratch is fairly
> expensive. At GitHub when we were rolling out Bloom filters to all
> repositories a few years ago, I wrote 809e0327f5
> (builtin/commit-graph.c: introduce '--max-new-filters=<n>', 2020-09-18)
> to avoid spending too much time computing new filters from scratch.

Thanks for this pointer.

> If we do have to re-compute all of the filters from scratch, it may be
> worth considering enumerating all of the repository's paths first and
> seeing if any of them are >0xff. If none are, we can propagate the
> existing filters forward without having to compute new ones.

One way server operators can do this is by:
 - patching their Git to ignore the bloom filter version number
 - in a batch job, enumerate all trees and see their filenames
 - for the repos that have <=0x7f filenames, bump version number
 - update Git to the latest version (that uses the new hash)

Enumerating all trees saves on revision walking, which hopefully will
speed things up.

I don't have statistics on this, but if the majority of repos have
only <=0x7f filenames (which seems reasonable to me), this might save
sufficient work that we can proceed with bumping the version number and
ignoring old data.

> Better yet, we should be able to reuse existing Bloom filter data for
> paths that have all characters <=0xff, and only recompute them where
> necessary. That makes much more sense than the previous paragraph.

I would think that at the point that we know the paths that have been
changed for a commit, the cost of computation of the bloom filter (all
in the CPU) is negligible compared to the I/O it took for us to retrieve
all the paths (so the solution of just ignoring old data seems better
to me). But perhaps at large scales, we do want to save the computation
time anyway.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Changed path filter hash differs from murmur3 if char is signed
  2023-05-12 17:33     ` Jonathan Tan
@ 2023-05-12 19:42       ` Junio C Hamano
  2023-05-12 20:54         ` Jonathan Tan
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2023-05-12 19:42 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Taylor Blau, git, derrickstolee

Jonathan Tan <jonathantanmy@google.com> writes:

> Yes - if the bloom filter contained junk data (in our example, created
> using a different hash function on filenames that have characters that
> exceed 0x7f), the bloom filter would report "no, this commit does not
> contain a change in such-and-such path" and then we would skip the
> commit, even if the commit did have a change in that path.

Just to help my understanding (read: I am not suggesting this as one
of the holes to exploit to help a smooth transition), does the above
mean that, as long as the path we are asking about does not have a
byte with the high-bit set, we would be OK, even if the Bloom filter
were constructed with a bad function and there were other paths that
had such a byte?

> I don't have statistics on this, but if the majority of repos have
> only <=0x7f filenames (which seems reasonable to me), this might save
> sufficient work that we can proceed with bumping the version number and
> ignoring old data.
>
>> Better yet, we should be able to reuse existing Bloom filter data for
>> paths that have all characters <=0xff, and only recompute them where

"ff" -> "7f" I presume?

>> necessary. That makes much more sense than the previous paragraph.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Changed path filter hash differs from murmur3 if char is signed
  2023-05-12 19:42       ` Junio C Hamano
@ 2023-05-12 20:54         ` Jonathan Tan
  2023-05-12 21:27           ` Taylor Blau
  0 siblings, 1 reply; 7+ messages in thread
From: Jonathan Tan @ 2023-05-12 20:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, Taylor Blau, git, derrickstolee

Junio C Hamano <gitster@pobox.com> writes:
> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> > Yes - if the bloom filter contained junk data (in our example, created
> > using a different hash function on filenames that have characters that
> > exceed 0x7f), the bloom filter would report "no, this commit does not
> > contain a change in such-and-such path" and then we would skip the
> > commit, even if the commit did have a change in that path.
> 
> Just to help my understanding (read: I am not suggesting this as one
> of the holes to exploit to help a smooth transition), does the above
> mean that, as long as the path we are asking about does not have a
> byte with the high-bit set, we would be OK, even if the Bloom filter
> were constructed with a bad function and there were other paths that
> had such a byte?

Ah, thanks for asking. Yes, the false negative I describe above only
happens when the path we're querying for contains a character >0x7f (so
if there is no byte with the high-bit set, it is still OK).

> > I don't have statistics on this, but if the majority of repos have
> > only <=0x7f filenames (which seems reasonable to me), this might save
> > sufficient work that we can proceed with bumping the version number and
> > ignoring old data.
> >
> >> Better yet, we should be able to reuse existing Bloom filter data for
> >> paths that have all characters <=0xff, and only recompute them where
> 
> "ff" -> "7f" I presume?

That was my assumption too, but Taylor can clarify.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Changed path filter hash differs from murmur3 if char is signed
  2023-05-12 20:54         ` Jonathan Tan
@ 2023-05-12 21:27           ` Taylor Blau
  0 siblings, 0 replies; 7+ messages in thread
From: Taylor Blau @ 2023-05-12 21:27 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Junio C Hamano, git, derrickstolee

On Fri, May 12, 2023 at 01:54:27PM -0700, Jonathan Tan wrote:
> > > I don't have statistics on this, but if the majority of repos have
> > > only <=0x7f filenames (which seems reasonable to me), this might save
> > > sufficient work that we can proceed with bumping the version number and
> > > ignoring old data.
> > >
> > >> Better yet, we should be able to reuse existing Bloom filter data for
> > >> paths that have all characters <=0xff, and only recompute them where
> >
> > "ff" -> "7f" I presume?
>
> That was my assumption too, but Taylor can clarify.

Sorry, yes, I meant "0x7f" here not "0xff". Thanks for catching, both.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-05-12 21:28 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-11 22:40 Changed path filter hash differs from murmur3 if char is signed Jonathan Tan
2023-05-11 22:51 ` Junio C Hamano
2023-05-11 23:10   ` Taylor Blau
2023-05-12 17:33     ` Jonathan Tan
2023-05-12 19:42       ` Junio C Hamano
2023-05-12 20:54         ` Jonathan Tan
2023-05-12 21:27           ` Taylor Blau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).