* Changed path filter hash differs from murmur3 if char is signed @ 2023-05-11 22:40 Jonathan Tan 2023-05-11 22:51 ` Junio C Hamano 0 siblings, 1 reply; 7+ messages in thread From: Jonathan Tan @ 2023-05-11 22:40 UTC (permalink / raw) To: git; +Cc: Jonathan Tan, derrickstolee This issue arose while I was implementing changed path filters (the bloom filters in the commit graph file) for JGit [1]. It turns out that if a repository contains paths that have at least one character greater than 0x7f, Git writes these filters and interprets them (when read from the commit graph file) differently depending on whether "char" is signed or unsigned, and when "char" is signed (this can be controlled by a compiler flag), the implementation of the murmur3 hash differs from the one in Wikipedia and elsewhere. (I looked into Git's Makefile and didn't see anything that controlled whether "char" is signed, so I presume that the default of the compiler used applies.) This can be seen from the murmur3_seeded() function in bloom.c: > uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len) > { [snip] > uint32_t byte1 = (uint32_t)data[4*i]; Note that data[4*i] is of type "char". When I first read this, I thought that no sign extension would occur regardless of the signedness of char (that is, (uint32_t)(signed char)-1 == (uint32_t)(unsigned char)-1) but it turns out that sign extension does occur if the thing being cast is signed (that is, (uint32_t)(signed char)-1 != (uint32_t)(unsigned char)-1) [2]. The implementation of the murmur3 hash in Wikipedia (and presumably the intention of the author here, since the variable is named "byte1") expects this value to not exceed 0xff, but this is not the case if sign extension occurs. I looked if this was discussed at the time this patch was presented [3] but couldn't find anything related. So...how do we proceed? I can see at least 2 ways: - Decide that we're going to stick with the details of the existing implementation and declare that "data" is always interpreted as signed. In that case, I would put "signed" wherever necessary, rename the function to something that is not "murmur3", and change the names of byte1 etc. to indicate that they are not constrained to be less than or equal to 0xff. - Bump the version number to 2 and correct the implementation to match murmur3 (so, "data" is unsigned). Then we would have to think of a transition plan. One possible one might be to always reject version 1 bloom filters, which I'm personally OK with, but it may seem too heavy a cost to some since in the perhaps typical case where a repo has filenames restricted to 0x7f and below, the existing bloom filters are still correct. Of the two, I prefer the latter, but I'm open to suggestions. [1] https://git.eclipse.org/r/c/jgit/jgit/+/201851/2 [2] Verified in practice and also prescribed by the C standard; see e.g. https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1548.pdf section 6.3.1.3 point 2. [3] https://lore.kernel.org/git/a5aa3415c05ee9bc67a9471445a20c71a9834673.1586192395.git.gitgitgadget@gmail.com/ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Changed path filter hash differs from murmur3 if char is signed 2023-05-11 22:40 Changed path filter hash differs from murmur3 if char is signed Jonathan Tan @ 2023-05-11 22:51 ` Junio C Hamano 2023-05-11 23:10 ` Taylor Blau 0 siblings, 1 reply; 7+ messages in thread From: Junio C Hamano @ 2023-05-11 22:51 UTC (permalink / raw) To: Jonathan Tan; +Cc: git, derrickstolee Jonathan Tan <jonathantanmy@google.com> writes: > So...how do we proceed? I can see at least 2 ways: > > - Decide that we're going to stick with the details of the existing > implementation and declare that "data" is always interpreted as signed. > In that case, I would put "signed" wherever necessary, rename the > function to something that is not "murmur3", and change the names of > byte1 etc. to indicate that they are not constrained to be less than or > equal to 0xff. > > - Bump the version number to 2 and correct the implementation to > match murmur3 (so, "data" is unsigned). Then we would have to think of > a transition plan. One possible one might be to always reject version > 1 bloom filters, which I'm personally OK with, but it may seem too > heavy a cost to some since in the perhaps typical case where a repo has > filenames restricted to 0x7f and below, the existing bloom filters are > still correct. If path filter hashing were merely advisory, in the sense that if a matching data is found, great, the processing goes faster, but if not, we would get correct results albeit not so quickly, a third option would be to just update the implementation without updating the version number. But we may not be so lucky---you must have seen a wrong result returned quickly, which is not what we want to see. But if I recall correctly we made the file format in such a way that bumping the version number is cheap in that transition can appear seamless. An updated implementation can just be told to _ignore_ old and possibly incorrect Bloom filters until it gets told to recompute, at which time it can write a correct one with a new version number. So I would prefer your "Bump the version number and ignore the old and possibly wrong data". Thanks. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Changed path filter hash differs from murmur3 if char is signed 2023-05-11 22:51 ` Junio C Hamano @ 2023-05-11 23:10 ` Taylor Blau 2023-05-12 17:33 ` Jonathan Tan 0 siblings, 1 reply; 7+ messages in thread From: Taylor Blau @ 2023-05-11 23:10 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jonathan Tan, git, derrickstolee On Thu, May 11, 2023 at 03:51:16PM -0700, Junio C Hamano wrote: > Jonathan Tan <jonathantanmy@google.com> writes: > > > So...how do we proceed? I can see at least 2 ways: > > > > - Decide that we're going to stick with the details of the existing > > implementation and declare that "data" is always interpreted as signed. > > In that case, I would put "signed" wherever necessary, rename the > > function to something that is not "murmur3", and change the names of > > byte1 etc. to indicate that they are not constrained to be less than or > > equal to 0xff. > > > > - Bump the version number to 2 and correct the implementation to > > match murmur3 (so, "data" is unsigned). Then we would have to think of > > a transition plan. One possible one might be to always reject version > > 1 bloom filters, which I'm personally OK with, but it may seem too > > heavy a cost to some since in the perhaps typical case where a repo has > > filenames restricted to 0x7f and below, the existing bloom filters are > > still correct. > > If path filter hashing were merely advisory, in the sense that if a > matching data is found, great, the processing goes faster, but if > not, we would get correct results albeit not so quickly, a third > option would be to just update the implementation without updating > the version number. But we may not be so lucky---you must have seen > a wrong result returned quickly, which is not what we want to see. Right; from my understanding of Jonathan's message, I believe that we would get the wrong results if the implementation of not-quite-murmur3 were corrected without updating the on-disk Bloom filters. We already have the bloom_filter_settings struct, which could also encode whether or not the data was computed with sign extension or not. If we are consistent in computing the hashes when we write Bloom filters and when we re-compute hashes to query them, I'd think we would be able to reuse the existing filters. That would be nice to do if we could. Unfortunately, I don't think there is an easy way to determine the signed-ness of an existing set of Bloom filters, nor a guarantee that they all have the same signed-ness (IOW, the Bloom filters could have been built up across multiple copies of Git, each with different compiler settings). So I'm not sure that that is a productive direction for us to take. > But if I recall correctly we made the file format in such a way that > bumping the version number is cheap in that transition can appear > seamless. An updated implementation can just be told to _ignore_ > old and possibly incorrect Bloom filters until it gets told to > recompute, at which time it can write a correct one with a new > version number. So I would prefer your "Bump the version number and > ignore the old and possibly wrong data". Version numbers are easy to increment, as is the case when adding new chunks to the chunked file format used by commit-graph, multi-pack-index, etc. However, computing Bloom filters en-masse from scratch is fairly expensive. At GitHub when we were rolling out Bloom filters to all repositories a few years ago, I wrote 809e0327f5 (builtin/commit-graph.c: introduce '--max-new-filters=<n>', 2020-09-18) to avoid spending too much time computing new filters from scratch. If we do have to re-compute all of the filters from scratch, it may be worth considering enumerating all of the repository's paths first and seeing if any of them are >0xff. If none are, we can propagate the existing filters forward without having to compute new ones. Better yet, we should be able to reuse existing Bloom filter data for paths that have all characters <=0xff, and only recompute them where necessary. That makes much more sense than the previous paragraph. Thanks, Taylor ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Changed path filter hash differs from murmur3 if char is signed 2023-05-11 23:10 ` Taylor Blau @ 2023-05-12 17:33 ` Jonathan Tan 2023-05-12 19:42 ` Junio C Hamano 0 siblings, 1 reply; 7+ messages in thread From: Jonathan Tan @ 2023-05-12 17:33 UTC (permalink / raw) To: Taylor Blau; +Cc: Jonathan Tan, Junio C Hamano, git, derrickstolee Taylor Blau <me@ttaylorr.com> writes: > On Thu, May 11, 2023 at 03:51:16PM -0700, Junio C Hamano wrote: > > If path filter hashing were merely advisory, in the sense that if a > > matching data is found, great, the processing goes faster, but if > > not, we would get correct results albeit not so quickly, a third > > option would be to just update the implementation without updating > > the version number. But we may not be so lucky---you must have seen > > a wrong result returned quickly, which is not what we want to see. > > Right; from my understanding of Jonathan's message, I believe that we > would get the wrong results if the implementation of not-quite-murmur3 > were corrected without updating the on-disk Bloom filters. Yes - if the bloom filter contained junk data (in our example, created using a different hash function on filenames that have characters that exceed 0x7f), the bloom filter would report "no, this commit does not contain a change in such-and-such path" and then we would skip the commit, even if the commit did have a change in that path. > We already have the bloom_filter_settings struct, which could also > encode whether or not the data was computed with sign extension or not. > If we are consistent in computing the hashes when we write Bloom filters > and when we re-compute hashes to query them, I'd think we would be able > to reuse the existing filters. > > That would be nice to do if we could. Unfortunately, I don't think there > is an easy way to determine the signed-ness of an existing set of Bloom > filters, nor a guarantee that they all have the same signed-ness (IOW, > the Bloom filters could have been built up across multiple copies of > Git, each with different compiler settings). > > So I'm not sure that that is a productive direction for us to take. Agreed. > > But if I recall correctly we made the file format in such a way that > > bumping the version number is cheap in that transition can appear > > seamless. An updated implementation can just be told to _ignore_ > > old and possibly incorrect Bloom filters until it gets told to > > recompute, at which time it can write a correct one with a new > > version number. So I would prefer your "Bump the version number and > > ignore the old and possibly wrong data". > > Version numbers are easy to increment, as is the case when adding new > chunks to the chunked file format used by commit-graph, > multi-pack-index, etc. > > However, computing Bloom filters en-masse from scratch is fairly > expensive. At GitHub when we were rolling out Bloom filters to all > repositories a few years ago, I wrote 809e0327f5 > (builtin/commit-graph.c: introduce '--max-new-filters=<n>', 2020-09-18) > to avoid spending too much time computing new filters from scratch. Thanks for this pointer. > If we do have to re-compute all of the filters from scratch, it may be > worth considering enumerating all of the repository's paths first and > seeing if any of them are >0xff. If none are, we can propagate the > existing filters forward without having to compute new ones. One way server operators can do this is by: - patching their Git to ignore the bloom filter version number - in a batch job, enumerate all trees and see their filenames - for the repos that have <=0x7f filenames, bump version number - update Git to the latest version (that uses the new hash) Enumerating all trees saves on revision walking, which hopefully will speed things up. I don't have statistics on this, but if the majority of repos have only <=0x7f filenames (which seems reasonable to me), this might save sufficient work that we can proceed with bumping the version number and ignoring old data. > Better yet, we should be able to reuse existing Bloom filter data for > paths that have all characters <=0xff, and only recompute them where > necessary. That makes much more sense than the previous paragraph. I would think that at the point that we know the paths that have been changed for a commit, the cost of computation of the bloom filter (all in the CPU) is negligible compared to the I/O it took for us to retrieve all the paths (so the solution of just ignoring old data seems better to me). But perhaps at large scales, we do want to save the computation time anyway. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Changed path filter hash differs from murmur3 if char is signed 2023-05-12 17:33 ` Jonathan Tan @ 2023-05-12 19:42 ` Junio C Hamano 2023-05-12 20:54 ` Jonathan Tan 0 siblings, 1 reply; 7+ messages in thread From: Junio C Hamano @ 2023-05-12 19:42 UTC (permalink / raw) To: Jonathan Tan; +Cc: Taylor Blau, git, derrickstolee Jonathan Tan <jonathantanmy@google.com> writes: > Yes - if the bloom filter contained junk data (in our example, created > using a different hash function on filenames that have characters that > exceed 0x7f), the bloom filter would report "no, this commit does not > contain a change in such-and-such path" and then we would skip the > commit, even if the commit did have a change in that path. Just to help my understanding (read: I am not suggesting this as one of the holes to exploit to help a smooth transition), does the above mean that, as long as the path we are asking about does not have a byte with the high-bit set, we would be OK, even if the Bloom filter were constructed with a bad function and there were other paths that had such a byte? > I don't have statistics on this, but if the majority of repos have > only <=0x7f filenames (which seems reasonable to me), this might save > sufficient work that we can proceed with bumping the version number and > ignoring old data. > >> Better yet, we should be able to reuse existing Bloom filter data for >> paths that have all characters <=0xff, and only recompute them where "ff" -> "7f" I presume? >> necessary. That makes much more sense than the previous paragraph. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Changed path filter hash differs from murmur3 if char is signed 2023-05-12 19:42 ` Junio C Hamano @ 2023-05-12 20:54 ` Jonathan Tan 2023-05-12 21:27 ` Taylor Blau 0 siblings, 1 reply; 7+ messages in thread From: Jonathan Tan @ 2023-05-12 20:54 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jonathan Tan, Taylor Blau, git, derrickstolee Junio C Hamano <gitster@pobox.com> writes: > Jonathan Tan <jonathantanmy@google.com> writes: > > > Yes - if the bloom filter contained junk data (in our example, created > > using a different hash function on filenames that have characters that > > exceed 0x7f), the bloom filter would report "no, this commit does not > > contain a change in such-and-such path" and then we would skip the > > commit, even if the commit did have a change in that path. > > Just to help my understanding (read: I am not suggesting this as one > of the holes to exploit to help a smooth transition), does the above > mean that, as long as the path we are asking about does not have a > byte with the high-bit set, we would be OK, even if the Bloom filter > were constructed with a bad function and there were other paths that > had such a byte? Ah, thanks for asking. Yes, the false negative I describe above only happens when the path we're querying for contains a character >0x7f (so if there is no byte with the high-bit set, it is still OK). > > I don't have statistics on this, but if the majority of repos have > > only <=0x7f filenames (which seems reasonable to me), this might save > > sufficient work that we can proceed with bumping the version number and > > ignoring old data. > > > >> Better yet, we should be able to reuse existing Bloom filter data for > >> paths that have all characters <=0xff, and only recompute them where > > "ff" -> "7f" I presume? That was my assumption too, but Taylor can clarify. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Changed path filter hash differs from murmur3 if char is signed 2023-05-12 20:54 ` Jonathan Tan @ 2023-05-12 21:27 ` Taylor Blau 0 siblings, 0 replies; 7+ messages in thread From: Taylor Blau @ 2023-05-12 21:27 UTC (permalink / raw) To: Jonathan Tan; +Cc: Junio C Hamano, git, derrickstolee On Fri, May 12, 2023 at 01:54:27PM -0700, Jonathan Tan wrote: > > > I don't have statistics on this, but if the majority of repos have > > > only <=0x7f filenames (which seems reasonable to me), this might save > > > sufficient work that we can proceed with bumping the version number and > > > ignoring old data. > > > > > >> Better yet, we should be able to reuse existing Bloom filter data for > > >> paths that have all characters <=0xff, and only recompute them where > > > > "ff" -> "7f" I presume? > > That was my assumption too, but Taylor can clarify. Sorry, yes, I meant "0x7f" here not "0xff". Thanks for catching, both. Thanks, Taylor ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-05-12 21:28 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-05-11 22:40 Changed path filter hash differs from murmur3 if char is signed Jonathan Tan 2023-05-11 22:51 ` Junio C Hamano 2023-05-11 23:10 ` Taylor Blau 2023-05-12 17:33 ` Jonathan Tan 2023-05-12 19:42 ` Junio C Hamano 2023-05-12 20:54 ` Jonathan Tan 2023-05-12 21:27 ` Taylor Blau
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).