* Re: Mozilla .git tree [not found] ` <9e4733910608291807q9b896e4sdbfaa9e49de58c2b@mail.gmail.com> @ 2006-08-30 1:51 ` Shawn Pearce 2006-08-30 2:25 ` Shawn Pearce 2006-08-30 2:58 ` Jon Smirl 0 siblings, 2 replies; 52+ messages in thread From: Shawn Pearce @ 2006-08-30 1:51 UTC (permalink / raw) To: Jon Smirl; +Cc: git Jon Smirl <jonsmirl@gmail.com> wrote: > I suspect the bulk of the file will be the base blobs. A zlib > dictionary would help more with the trees and the 120K copies of the > GPL in the files. Here's what I got by taking the output of verify-pack -v run against the 430 MiB Mozilla pack and running that through a simple Perl script: COUNT BASE commit: 197613 COUNT BASE tree: 154496 COUNT BASE blob: 49860 COUNT BASE tag: 1203 COUNT DELTA commit: 3308 COUNT DELTA tree: 976712 COUNT DELTA blob: 579780 COUNT DELTA tag: 353 Those are just raw numbers of objects of each type broken out by base and delta. We gotta alotta objects. :-) We probably also have around 49,860 copies of the identical license text (one per base object). I'm just assuming the xdelta algorithm would recognize the identical run in the dependent object and copy it from the base rather than use a literal insert command. Thus I'm assuming the 579,780 deltas don't contain the license text. UNCOMP BASE commit: 55 MiB UNCOMP BASE tree: 30 MiB UNCOMP BASE blob: 597 MiB UNCOMP BASE tag: 0 MiB UNCOMP DELTA commit: 0 MiB UNCOMP DELTA tree: 44 MiB UNCOMP DELTA blob: 190 MiB UNCOMP DELTA tag: 0 MiB These are the sizes of the objects and deltas prior to using zlib to deflate them (aka the decompression buffer size, stored in the object header). ZIPPED BASE commit: 38 MiB ZIPPED BASE tree: 26 MiB ZIPPED BASE blob: 164 MiB ZIPPED BASE tag: 0 MiB ZIPPED DELTA commit : 0 MiB ZIPPED DELTA tree: 73 MiB ZIPPED DELTA blob: 126 MiB ZIPPED DELTA tag: 0 MiB These are the sizes of the objects within the pack, determined by computing the difference in adjacent objects' offsets. 55 MiB of commits compressed into 38 MiB (saved 30%). We can probably do better. 30 MiB of tree bases compressed into 26 MiB (saved 13.3%). With 154,496 tree bases I think we can do better _somehow_. It may just mean using more deltas so we have less bases. We don't have 154k unique directories. It may just mean using a tree specific pack dictionary is enough. 44 MiB of tree deltas compressed into 73 MiB (saved -65.9%). Ouch! We wasted 29 MiB by trying to compress tree deltas. Way to go zlib! Blob bases were 597 MiB uncompressed, 164 MiB compressed (saved 72%). Blob deltas were 190 MiB uncompressed, 126 MiB compressed (saved 33%). We might be able to do better here, but we're already fairing pretty well. To compare a .tar.gz of the ,v files from CVS is around 550 MiB. We're already smaller than that in a pack file. But ,v is not the most compact representation. I hoped we could do even better than 430 MiB. I ran the same script against my Git pack. There I'm seeing the same explosion of tree deltas: uncompressed they are 1380174 bytes, compressed they are 1620439 bytes (-17.4% saved). We may well have a general problem here with always compressing tree deltas. It appears to be a minor dent in the space required for a pack but its certainly a non-trivial amount on the larger Mozilla pack. The wasted space is 2% of the Git pack and its 6.7% of the Mozilla pack. -- Shawn. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-08-30 1:51 ` Mozilla .git tree Shawn Pearce @ 2006-08-30 2:25 ` Shawn Pearce 2006-08-30 2:58 ` Jon Smirl 1 sibling, 0 replies; 52+ messages in thread From: Shawn Pearce @ 2006-08-30 2:25 UTC (permalink / raw) To: Jon Smirl; +Cc: git Shawn Pearce <spearce@spearce.org> wrote: > We may well have a general problem here with always compressing > tree deltas. It appears to be a minor dent in the space required > for a pack but its certainly a non-trivial amount on the larger > Mozilla pack. The wasted space is 2% of the Git pack and its 6.7% > of the Mozilla pack. Whoops. mugwump (Sam Vilain) just pointed out on #git that I didn't account for the 20 byte base when comparing the offset differences (claimed compressed size) and the uncompressed size. Nor did I account for the variable sized headers. I stated that 29 MiB of the Mozilla pack was wasted by 976712 tree deltas. Of that 29 MiB we know that 18.6 MiB must be the 20 byte base-SHA1 header. That leaves 10.4 MiB unaccounted for. But we also have the variable length header; lets say the average uncompressed length of a tree delta is 44 MiB/976712 so 47 bytes. That average length can be encoded in two header bytes, so that's another 1.8 MiB. Which leaves us with 8.6 MiB of wasted space. Which is clearly not the 29 MiB I previously stated. But we're still wasting a small amount of space over not compressing them. -- Shawn. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-08-30 1:51 ` Mozilla .git tree Shawn Pearce 2006-08-30 2:25 ` Shawn Pearce @ 2006-08-30 2:58 ` Jon Smirl 2006-08-30 3:10 ` Shawn Pearce 1 sibling, 1 reply; 52+ messages in thread From: Jon Smirl @ 2006-08-30 2:58 UTC (permalink / raw) To: Shawn Pearce; +Cc: git sha1s are effectively 20 byte pointer addresses into the pack. With 2M objects you can easily get away with 4 byte address and a mapping table. Another idea would be to replace the 20 byte sha1 in tree objects with 32b file offsets - requiring that anything the tree refers to has to already be in the pack before the tree entry can be written. On 8/29/06, Shawn Pearce <spearce@spearce.org> wrote: > Jon Smirl <jonsmirl@gmail.com> wrote: > > I suspect the bulk of the file will be the base blobs. A zlib > > dictionary would help more with the trees and the 120K copies of the > > GPL in the files. > > Here's what I got by taking the output of verify-pack -v run > against the 430 MiB Mozilla pack and running that through a simple > Perl script: > > COUNT BASE commit: 197613 > COUNT BASE tree: 154496 > COUNT BASE blob: 49860 > COUNT BASE tag: 1203 > COUNT DELTA commit: 3308 > COUNT DELTA tree: 976712 > COUNT DELTA blob: 579780 > COUNT DELTA tag: 353 > > Those are just raw numbers of objects of each type broken out by > base and delta. We gotta alotta objects. :-) > > We probably also have around 49,860 copies of the identical license > text (one per base object). I'm just assuming the xdelta algorithm The Mozilla license has changed at least five times. That makes 250K copies of licenses. > would recognize the identical run in the dependent object and > copy it from the base rather than use a literal insert command. > Thus I'm assuming the 579,780 deltas don't contain the license text. > > UNCOMP BASE commit: 55 MiB > UNCOMP BASE tree: 30 MiB > UNCOMP BASE blob: 597 MiB > UNCOMP BASE tag: 0 MiB > UNCOMP DELTA commit: 0 MiB > UNCOMP DELTA tree: 44 MiB > UNCOMP DELTA blob: 190 MiB > UNCOMP DELTA tag: 0 MiB > > These are the sizes of the objects and deltas prior to using zlib > to deflate them (aka the decompression buffer size, stored in the > object header). > > ZIPPED BASE commit: 38 MiB > ZIPPED BASE tree: 26 MiB > ZIPPED BASE blob: 164 MiB > ZIPPED BASE tag: 0 MiB > ZIPPED DELTA commit : 0 MiB > ZIPPED DELTA tree: 73 MiB > ZIPPED DELTA blob: 126 MiB > ZIPPED DELTA tag: 0 MiB > > These are the sizes of the objects within the pack, determined by > computing the difference in adjacent objects' offsets. > > > 55 MiB of commits compressed into 38 MiB (saved 30%). > We can probably do better. > > 30 MiB of tree bases compressed into 26 MiB (saved 13.3%). > With 154,496 tree bases I think we can do better _somehow_. It may > just mean using more deltas so we have less bases. We don't have > 154k unique directories. It may just mean using a tree specific > pack dictionary is enough. I suspect a tree specific zlib dictionary will be a good win. But those trees contain a lot of uncompressible data, the sha1. Those sha1s are in binary not hex, right? > > 44 MiB of tree deltas compressed into 73 MiB (saved -65.9%). > Ouch! We wasted 29 MiB by trying to compress tree deltas. > Way to go zlib! The git tools can be modified to set the compression level to 0 before compressing tree deltas. There is no need to change the decoding code. Even with compression level 0 they still get slightly larger because zlib tacks on a header. > > Blob bases were 597 MiB uncompressed, 164 MiB compressed (saved 72%). > Blob deltas were 190 MiB uncompressed, 126 MiB compressed (saved 33%). > We might be able to do better here, but we're already fairing pretty > well. > > > To compare a .tar.gz of the ,v files from CVS is around 550 MiB. > We're already smaller than that in a pack file. But ,v is not the > most compact representation. I hoped we could do even better than > 430 MiB. > > > I ran the same script against my Git pack. There I'm seeing the > same explosion of tree deltas: uncompressed they are 1380174 bytes, > compressed they are 1620439 bytes (-17.4% saved). > > We may well have a general problem here with always compressing > tree deltas. It appears to be a minor dent in the space required > for a pack but its certainly a non-trivial amount on the larger > Mozilla pack. The wasted space is 2% of the Git pack and its 6.7% > of the Mozilla pack. I'm still interested in getting an idea of how much a Clucene type dictionary compression would help. It is hard to see how you can get smaller than that method. Note that you don't want to include the indexing portion of Clucene in the comparison. Just the part where everything gets tokenized into a big dictionary, arithmetic encoded based on usage frequency, and then the strings in the orginal documents are replaced with the codes. You want to do the diffs before replacing everything with codes. Encoding this way is a two pass process so it is easiest to work from an existing pack. The indexing phase then constructs a bit vector for each word representing all of the documents in the archive and whether they contain the word or not. The vectors are then compressed using something similar to zlib. To query you and/or/not the word vectors together to identify candidate documents. There are algorithms for combining the compressed vectors without decompressing them. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-08-30 2:58 ` Jon Smirl @ 2006-08-30 3:10 ` Shawn Pearce 2006-08-30 3:27 ` Jon Smirl 2006-08-30 5:53 ` Nicolas Pitre 0 siblings, 2 replies; 52+ messages in thread From: Shawn Pearce @ 2006-08-30 3:10 UTC (permalink / raw) To: Jon Smirl; +Cc: git Jon Smirl <jonsmirl@gmail.com> wrote: > sha1s are effectively 20 byte pointer addresses into the pack. With 2M > objects you can easily get away with 4 byte address and a mapping > table. Another idea would be to replace the 20 byte sha1 in tree > objects with 32b file offsets - requiring that anything the tree > refers to has to already be in the pack before the tree entry can be > written. I've thought of that, but when you transfer a "thin" pack over the wire the base object may not even be in the pack. Thus you can't use an offset to reference it. Otherwise there's probably little reason why the base couldn't be referenced by its 4 byte offset rather than its full 20 byte object ID. Added up over all deltas in the mozilla pack it saves a whopping 23 MiB. > The Mozilla license has changed at least five times. That makes 250K > copies of licenses. Cute. > I suspect a tree specific zlib dictionary will be a good win. But > those trees contain a lot of uncompressible data, the sha1. Those > sha1s are in binary not hex, right? Yup, binary. > The git tools can be modified to set the compression level to 0 before > compressing tree deltas. There is no need to change the decoding code. > Even with compression level 0 they still get slightly larger because > zlib tacks on a header. See my followup email to myself; I think we're talking a zlib overhead of 9.2 bytes on average per tree delta. That's with a compression level of -1 (default, which is 6). > I'm still interested in getting an idea of how much a Clucene type > dictionary compression would help. It is hard to see how you can get > smaller than that method. Note that you don't want to include the > indexing portion of Clucene in the comparison. Just the part where > everything gets tokenized into a big dictionary, arithmetic encoded > based on usage frequency, and then the strings in the orginal > documents are replaced with the codes. You want to do the diffs before > replacing everything with codes. Encoding this way is a two pass > process so it is easiest to work from an existing pack. >From what I was able to gather I don't think Clucene stores the documents themselves as the tokenized compressed data. Or if it does you lose everything between the tokens. There's a number of things we want to preserve in the original "document" like whitespace that would be likely stripped when constructing tokens. But it shouldn't be that difficult to produce a rough estimate of what that storage size would be. -- Shawn. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-08-30 3:10 ` Shawn Pearce @ 2006-08-30 3:27 ` Jon Smirl 2006-08-30 5:53 ` Nicolas Pitre 1 sibling, 0 replies; 52+ messages in thread From: Jon Smirl @ 2006-08-30 3:27 UTC (permalink / raw) To: Shawn Pearce; +Cc: git On 8/29/06, Shawn Pearce <spearce@spearce.org> wrote: > Jon Smirl <jonsmirl@gmail.com> wrote: > > sha1s are effectively 20 byte pointer addresses into the pack. With 2M > > objects you can easily get away with 4 byte address and a mapping > > table. Another idea would be to replace the 20 byte sha1 in tree > > objects with 32b file offsets - requiring that anything the tree > > refers to has to already be in the pack before the tree entry can be > > written. > > I've thought of that, but when you transfer a "thin" pack over the > wire the base object may not even be in the pack. Thus you can't > use an offset to reference it. Otherwise there's probably little > reason why the base couldn't be referenced by its 4 byte offset > rather than its full 20 byte object ID. Added up over all deltas > in the mozilla pack it saves a whopping 23 MiB. Every time an object goes on the wire these 'pack internal' optimizations need to be undone. If you are sending the whole pack everything can be sent as is. These intense compression schemes are meant for archival level data. Everybody should end up with a copy of the entire archive and that will be the end of those objects moving on the wire. > From what I was able to gather I don't think Clucene stores the > documents themselves as the tokenized compressed data. Or if it > does you lose everything between the tokens. There's a number of > things we want to preserve in the original "document" like whitespace > that would be likely stripped when constructing tokens. I can't remember if the Clucene code includes the ability to compress using the dictionary. I had thought that the code was in there but maybe not. Things that aren't in the dictionary use an escape code and are copied intact. I have the Lucene book on my desk, I'll flip through it and see what it says. Might be worthwhile to poke around on the net and see if you can come up with an existing dictionary based compressor. There has got to be one out there, this is a 30 year old concept. > But it shouldn't be that difficult to produce a rough estimate of > what that storage size would be. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-08-30 3:10 ` Shawn Pearce 2006-08-30 3:27 ` Jon Smirl @ 2006-08-30 5:53 ` Nicolas Pitre 2006-08-30 11:42 ` Junio C Hamano 1 sibling, 1 reply; 52+ messages in thread From: Nicolas Pitre @ 2006-08-30 5:53 UTC (permalink / raw) To: Shawn Pearce; +Cc: Jon Smirl, git On Tue, 29 Aug 2006, Shawn Pearce wrote: > Jon Smirl <jonsmirl@gmail.com> wrote: > > The git tools can be modified to set the compression level to 0 before > > compressing tree deltas. There is no need to change the decoding code. > > Even with compression level 0 they still get slightly larger because > > zlib tacks on a header. > > See my followup email to myself; I think we're talking a zlib > overhead of 9.2 bytes on average per tree delta. That's with a > compression level of -1 (default, which is 6). In fact, the bulk of a tree delta is most likely to contain the literal sha1 of one or more directory entries that changed, and this is hardly compressible. There is nothing to gain by forcing zlib level to 0 for tree deltas since it never makes the deflated stream smaller from the tests I've performed in the past. It seems that zlib is smart enough not to attempt any compression when there is no gain. That leaves the zlib header as the only overhead. And the zlib header contains a CRC which we're about to use for validating the data when doing delta data reuse in order to prevent pack corruption propagation like the one recently posted on the list. Without that a pack corruption (from a bad disk sector for example) is likely to go unnoticed when doing a repack. The data could be validated by expanding deltas and verifying the sha1 on the end result but this is a really expensive operation if performed on all deltas which is best left to git-fsck-objects --full. So I think the small overhead relative to total pack size might be worth it for better data integrity. Using an offset instead of a sha1 to reference a delta base object is certainly a good idea though. But I'd use the same variable encoding as the object size to avoid the 32-bit limit issue. When generating a thin pack the real sha1 of the delta object could be substituted for the offset quite easily if the base object is not sent a part of the same pack. Nicolas ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-08-30 5:53 ` Nicolas Pitre @ 2006-08-30 11:42 ` Junio C Hamano 2006-09-01 7:42 ` Junio C Hamano 2006-09-01 17:45 ` A Large Angry SCM 0 siblings, 2 replies; 52+ messages in thread From: Junio C Hamano @ 2006-08-30 11:42 UTC (permalink / raw) To: Nicolas Pitre; +Cc: git, Linus Torvalds Nicolas Pitre <nico@cam.org> writes: > And the zlib header contains a CRC which we're about to use for > validating the data when doing delta data reuse in order to prevent pack > corruption propagation like the one recently posted on the list. Ah, never thought of using the CRC directly. I was thinking about inflating into void and see if it succeeds, which as you say is perhaps quite expensive. This brings me back to my pet-peeve, though. I do not think zlib stream seeks back and leaves some clue at the beginning to tell me the deflated length, so it is quite hard to find where each deflated stream ends in a packfile cheaply. Loose objects (with new or legacy style header) are easy (st.st_size is available), but I do not think of a way short of building a reverse index of pack .idx file, which means I am already talking about not so cheap way X-<. It might be a reason to define a new .idx format. We could lift 32-bit offset limitation while we are at it. Each entry could have 20-byte hash, 64-bit offset into the corresponding .pack, and 32-bit deflated length (heh, why not make it 64-bit while we are at it). Luckily, .idx _is_ a local matter so we can even have a flag day and tell people to run the updated index-pack on existing packfiles to regenerate .idx. > Using an offset instead of a sha1 to reference a delta base object is > certainly a good idea though. But I'd use the same variable encoding as > the object size to avoid the 32-bit limit issue. When generating a thin > pack the real sha1 of the delta object could be substituted for the > offset quite easily if the base object is not sent a part of the same > pack. That sounds quite a reasonable suggestion. I love this kind of moment when I find us very fortunate to have bright people on the list ;-). ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-08-30 11:42 ` Junio C Hamano @ 2006-09-01 7:42 ` Junio C Hamano 2006-09-02 1:19 ` Shawn Pearce 2006-09-02 2:04 ` Shawn Pearce 2006-09-01 17:45 ` A Large Angry SCM 1 sibling, 2 replies; 52+ messages in thread From: Junio C Hamano @ 2006-09-01 7:42 UTC (permalink / raw) To: Nicolas Pitre; +Cc: git, Linus Torvalds Junio C Hamano <junkio@cox.net> writes: > Nicolas Pitre <nico@cam.org> writes: > >> And the zlib header contains a CRC which we're about to use for >> validating the data when doing delta data reuse in order to prevent pack >> corruption propagation like the one recently posted on the list. > > Ah, never thought of using the CRC directly. I was thinking > about inflating into void and see if it succeeds, which as you > say is perhaps quite expensive. > > This brings me back to my pet-peeve, though. I do not think > zlib stream seeks back and leaves some clue at the beginning to > tell me the deflated length, so it is quite hard to find where > each deflated stream ends in a packfile cheaply. Loose objects > (with new or legacy style header) are easy (st.st_size is > available), but I do not think of a way short of building a > reverse index of pack .idx file, which means I am already > talking about not so cheap way X-<. > > It might be a reason to define a new .idx format. We could lift > 32-bit offset limitation while we are at it. Each entry could > have 20-byte hash, 64-bit offset into the corresponding .pack, > and 32-bit deflated length (heh, why not make it 64-bit while we > are at it). Luckily, .idx _is_ a local matter so we can even > have a flag day and tell people to run the updated index-pack on > existing packfiles to regenerate .idx. So I whipped up this as the zeroth step to propose a migration plan: Step 0. Prepare conversion and debugging tool. Step 1. Prepare sha1_file (consumer) side to understand the idx file in the new format. I haven't decided if making the runtime to understand both formats is feasible; I believe it is with what we currently do, but finding out where the end of each pack entry cheaply was one of the side effects I wanted to get from this conversion whose primary purpose is to bust the 4GB ceiling Step 2. Update pack-objects and index-pack to emit the new format. Step 3. Work on integrating partial mmap() support with Shawn. This is more or less orthogonal to 4GB ceiling (people would hit mmap() limit even with a 1.5GB pack), but I suspect it would be necessary to be able to tell where the end of each pack entry is cheaply to implement this. This patch is step 0. -- >8 -- Subject: [PATCH] convert-idx: prepare large pack idx support. This adds git-convert-idx which converts between the traditional 32-bit offset pack .idx file and the new 64-bit offset pack .idx file, and updates git-show-index to understand both formats. Don't get too excited, not just yet. The programs pack-objects and index-pack need to be updated to produce 64-bit pack .idx files, and the core sha1_file routines also needs to be told about the new format. Signed-off-by: Junio C Hamano <junkio@cox.net> --- Makefile | 1 builtin-convert-idx.c | 351 +++++++++++++++++++++++++++++++++++++++++++++++++ builtin.h | 1 git.c | 1 show-index.c | 36 ++++- 5 files changed, 384 insertions(+), 6 deletions(-) diff --git a/Makefile b/Makefile index 05bd77f..fccf747 100644 --- a/Makefile +++ b/Makefile @@ -260,6 +260,7 @@ BUILTIN_OBJS = \ builtin-check-ref-format.o \ builtin-commit-tree.o \ builtin-count-objects.o \ + builtin-convert-idx.o \ builtin-diff.o \ builtin-diff-files.o \ builtin-diff-index.o \ diff --git a/builtin-convert-idx.c b/builtin-convert-idx.c new file mode 100644 index 0000000..af111d2 --- /dev/null +++ b/builtin-convert-idx.c @@ -0,0 +1,351 @@ +#include "cache.h" + +/***************************************************************** + +The differences between the original and new style pack-*.idx files +are as follows: + + - A new style .idx file can record 64-bit offsets into corresponding + .pack file (an original .idx has 32-bit offsets hence limiting the + .pack file to 4GB). + + - A new style .idx file allows finding out the end of one entry in + the .pack (you need to find the object with next offset in an + original .idx file). + +Other than these major differences, the overall structure is quite +similar. + +A new style .idx file begins with a signature, "\377tOc", and a +version number as a 4-byte network byte order integer. The version +of this implementation is 2. + +The location of the signature used to hold the number of objects in +the pack whose object name start with byte 0x00 as a 4-byte integer in +the network byte order. The original .idx format allowed only 32-bit +offset into the pack, and data for 0xFF744F63 objects would certainly +exceed 4GB (because an object needs at least 1-byte for type and its +deflated length, and deflated data will be 1 or more bytes), so the +older git would not mistake the new style .idx file as something +valid. + +Following the signature and the version number is a header, which +consists of 256 4-byte network byte order integers. N-th entry of +this table records the number of objects in the corresponding pack, +the first byte of whose object name are smaller than N. This means a +.pack file is still limited to maximum 4 billion objects. + +The header is followed by the main toc table, which is a list of +N-byte entries, one entry per object in the pack, sorted by the object +name. Each entry is: + + . 8-byte network byte order integer, recording where the object is + stored in the packfile as the offset from the beginning. + + . 4-byte network byte order integer, recording the location of the + next object in the main toc table. + + . 20-byte object name. + +The file is concluded with a trailer: + + . A copy of the 20-byte SHA-1 checksum at the end of the + corresponding packfile. + + . 20-byte SHA-1 checksum of all of the above. + +******************************************************************/ + +#define PACK_IDX_SIGNATURE 0xFF744F63 /* "\377tOc" */ +#define PACK_IDX_VERSION 2 + +static unsigned get4(const void *buf, unsigned long offset) +{ + const unsigned char *p = ((const unsigned char *)buf) + offset; + + return (p[3] | (p[2]<<8) | (p[1]<<16) | (p[0]<<24)); +} + +static unsigned long long get8(const void *buf, unsigned long offset) +{ + unsigned long long n, m; + n = get4(buf, offset); + m = get4(buf, offset + 4); + return (n << 32) | m; +} + +static void put4(void *buf, unsigned long offset, unsigned data) +{ + unsigned char *p = ((unsigned char *)buf) + offset; + + p[3] = data; data >>= 8; + p[2] = data; data >>= 8; + p[1] = data; data >>= 8; + p[0] = data; +} + +static void put8(void *buf, unsigned long offset, unsigned long long data) +{ + put4(buf, offset, data >> 32); + put4(buf, offset + 4, data); +} + +static int write_out(const char *path, void *data, unsigned long sz) +{ + char newpath[PATH_MAX]; + int fd; + + if (PATH_MAX < strlen(path) + 10) + return error("%s: pathname too long", path); + sprintf(newpath, "%s_XXXXXX", path); + fd = mkstemp(newpath); + if (fd < 0) + return error("%s: unable to create temporary file: %s", + newpath, strerror(errno)); + write_or_die(fd, data, sz); + return rename(newpath, path); +} + +static int downgrade_idx(const char *path, const unsigned char *data, unsigned long sz) +{ + unsigned i, num = 0; + unsigned osz; + unsigned char *odata, *otable; + const unsigned char *table; + unsigned char hash[20]; + SHA_CTX ctx; + + if (get4(data, 0) != PACK_IDX_SIGNATURE) + return error("%s is not a new style index", path); + if (get4(data, 4) != 2) + return error("%s: version %u of new style index not supported", + path, get4(data, 4)); + + /* Validate the fan-out and find out the number of objects */ + for (i = 0; i < 256; i++) { + unsigned n = get4(data + 8, i * 4); + if (n < num) + return error("%s: corrupt old index file", path); + num = n; + } + + /* Make sure the sz is correct */ + if (sz != 8 + 4 * 256 + 32 * num + 40) + return error("%s: new index file with incorrect size", path); + + /* Make sure the hash matches */ + SHA1_Init(&ctx); + SHA1_Update(&ctx, data, sz - 20); + SHA1_Final(hash, &ctx); + if (memcmp(hash, data + sz - 20, 20)) + return error("%s: new index file with incorrect hash", path); + + osz = 4 * 256 + 24 * num + 40; + odata = xcalloc(1, osz); + + /* Convert the fan-out */ + for (i = 0; i < 256; i++) + put4(odata, i * 4, get4(data + 8, i * 4)); + + /* Make them point at the main table */ + table = data + 4 * 256 + 8; + otable = odata + 4 * 256; + + /* Convert entries, while validating that they are sorted */ + for (i = 0; i < num; i++) { + unsigned long long ofs = get8(table, 32 * i); + unsigned const char *hash1, *hash2; + + if (ofs & (0xFFFFFFFFLL << 32)) + return error("%s: the packfile is too big to be " + "indexed with an old index", path); + put4(otable + 24 * i, 0, ofs); + memcpy(otable + 24 * i + 4, table + 32 * i + 12, 20); + if (!i) + continue; + hash1 = table + 32 * (i-1) + 12; + hash2 = hash1 + 32; + if (0 <= memcmp(hash1, hash2, 20)) + return error("%s: new index file with unsorted table", + path); + if (*hash1 != *hash2) { + /* fan-out of hash1[0] should point here */ + if (get4(data + 8, *hash1 * 4) != i) + return error("%s: old index file with " + "invalid fan-out", path); + } + } + + /* Copy the pack SHA-1 checksum */ + memcpy(odata + osz - 40, data + sz - 40, 20); + + /* Compute the overall SHA-1 checksum */ + SHA1_Init(&ctx); + SHA1_Update(&ctx, odata, osz - 20); + SHA1_Final(odata + osz - 20, &ctx); + + /* Write it out */ + return write_out(path, odata, osz); +} + +static int map_compare(const void *a_, const void *b_) +{ + /* These point at new style table entry whose first element is + * 8-byte network byte order offset. Compare them + */ + const char *a = *((const char **)a_); + const char *b = *((const char **)b_); + unsigned long long aofs, bofs; + + aofs = get8(a, 0); + bofs = get8(b, 0); + if (aofs < bofs) + return -1; + if (aofs == bofs) + return 0; + return 1; +} + +static int upgrade_idx(const char *path, const unsigned char *data, unsigned long sz) +{ + const unsigned char *table; + unsigned char *ndata, *ntable, **map; + unsigned i, num = 0; + unsigned nsz; + unsigned prev_ix; + unsigned char hash[20]; + SHA_CTX ctx; + + if (get4(data, 0) == PACK_IDX_SIGNATURE) + return error("%s is a new style index", path); + + /* Validate the fan-out and find out the number of objects */ + for (i = 0; i < 256; i++) { + unsigned n = get4(data, i * 4); + if (n < num) + return error("%s: corrupt old index file", path); + num = n; + } + + /* Make sure the sz is correct */ + if (sz != 4 * 256 + 24 * num + 40) + return error("%s: old index file with incorrect size", path); + + /* Make sure the hash matches */ + SHA1_Init(&ctx); + SHA1_Update(&ctx, data, sz - 20); + SHA1_Final(hash, &ctx); + if (memcmp(hash, data + sz - 20, 20)) + return error("%s: old index file with incorrect hash", path); + + nsz = 8 + 4 * 256 + 32 * num + 40; + ndata = xcalloc(1, nsz); + map = xcalloc(sizeof(*map), num); + + put4(ndata, 0, PACK_IDX_SIGNATURE); + put4(ndata, 4, PACK_IDX_VERSION); + + /* Convert the fan-out */ + for (i = 0; i < 256; i++) + put4(ndata, 8 + i * 4, get4(data, i * 4)); + + /* Make them point at the main table */ + table = data + 4 * 256; + ntable = ndata + 8 + 4 * 256; + + /* Convert entries, while validating that they are sorted */ + for (i = 0; i < num; i++) { + unsigned ofs = get4(table, 24 * i); + unsigned const char *hash1, *hash2; + map[i] = ntable + 32 * i; + put8(ntable + 32 * i, 0, ofs); + memcpy(ntable + 32 * i + 12, table + 24 * i + 4, 20); + if (!i) + continue; + hash1 = table + 24 * (i-1) + 4; + hash2 = hash1 + 24; + if (0 <= memcmp(hash1, hash2, 20)) + return error("%s: old index file with unsorted table", + path); + if (*hash1 != *hash2) { + /* fan-out of hash1[0] should point here */ + if (get4(data, *hash1 * 4) != i) + return error("%s: old index file with " + "invalid fan-out", path); + } + } + + /* Using the reverse map, fill in the next object number */ + qsort(map, num, sizeof(*map), map_compare); + prev_ix = (map[0] - ntable) / 32; + for (i = 1; i < num; i++) { + /* map is an array of main table entry index, + * sorted by the offset of the pack data. + */ + unsigned next_ix = (map[i] - ntable) / 32; + put4(ntable + 32 * prev_ix, 8, next_ix); + prev_ix = next_ix; + } + put4(map[num-1], 8, num); /* Mark the last entry with num */ + + /* Copy the pack SHA-1 checksum */ + memcpy(ndata + nsz - 40, data + sz - 40, 20); + + /* Compute the overall SHA-1 checksum */ + SHA1_Init(&ctx); + SHA1_Update(&ctx, ndata, nsz - 20); + SHA1_Final(ndata + nsz - 20, &ctx); + + /* Write it out */ + return write_out(path, ndata, nsz); +} + +static int convert_one(const char *path, int downgrade) +{ + int fd, err; + struct stat st; + void *map; + + fd = open(path, O_RDONLY); + if (fd < 0) + return error("%s: cannot open: %s", path, strerror(errno)); + if (fstat(fd, &st)) { + err = errno; + close(fd); + return error("%s: cannot stat: %s", path, strerror(err)); + } + map = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0); + close(fd); + err = downgrade + ? downgrade_idx(path, map, st.st_size) + : upgrade_idx(path, map, st.st_size); + munmap(map, st.st_size); + return err; +} + +static const char convert_idx_usage[] = +"git-convert-idx [-d] <file>"; + +int cmd_convert_idx(int ac, const char **av, const char *prefix) +{ + int i, downgrade = 0, status = 0; + + for (i = 1; i < ac; i++) { + const char *arg = av[i]; + if (arg[0] != '-') + break; + if (!strcmp(arg, "--")) { + i++; + break; + } + if (!strcmp(arg, "-d") || !strcmp(arg, "--downgrade")) { + downgrade = 1; + continue; + } + usage(convert_idx_usage); + } + + for ( ; i < ac; i++) + status |= convert_one(av[i], downgrade); + return status; +} diff --git a/builtin.h b/builtin.h index 25431d7..e282cb8 100644 --- a/builtin.h +++ b/builtin.h @@ -19,6 +19,7 @@ extern int cmd_cat_file(int argc, const extern int cmd_checkout_index(int argc, const char **argv, const char *prefix); extern int cmd_check_ref_format(int argc, const char **argv, const char *prefix); extern int cmd_commit_tree(int argc, const char **argv, const char *prefix); +extern int cmd_convert_idx(int ac, const char **av, const char *prefix); extern int cmd_count_objects(int argc, const char **argv, const char *prefix); extern int cmd_diff_files(int argc, const char **argv, const char *prefix); extern int cmd_diff_index(int argc, const char **argv, const char *prefix); diff --git a/git.c b/git.c index 05871ad..22dbad8 100644 --- a/git.c +++ b/git.c @@ -230,6 +230,7 @@ static void handle_internal_command(int { "checkout-index", cmd_checkout_index, RUN_SETUP }, { "check-ref-format", cmd_check_ref_format }, { "commit-tree", cmd_commit_tree, RUN_SETUP }, + { "convert-idx", cmd_convert_idx }, { "count-objects", cmd_count_objects }, { "diff", cmd_diff, RUN_SETUP }, { "diff-files", cmd_diff_files, RUN_SETUP }, diff --git a/show-index.c b/show-index.c index c21d660..a00b691 100644 --- a/show-index.c +++ b/show-index.c @@ -1,14 +1,28 @@ #include "cache.h" +#define PACK_IDX_SIGNATURE 0xFF744F63 /* "\377tOc" */ +#define PACK_IDX_VERSION 2 + int main(int argc, char **argv) { int i; unsigned nr; - unsigned int entry[6]; + unsigned int entry[8]; static unsigned int top_index[256]; + int new_style = 0; if (fread(top_index, sizeof(top_index), 1, stdin) != 1) - die("unable to read idex"); + die("unable to read index"); + + if (ntohl(top_index[0]) == PACK_IDX_SIGNATURE) { + fprintf(stderr, "new style idx version %u\n", + ntohl(top_index[1])); + memmove(top_index, top_index + 2, 4 * 254); + if (fread(top_index + 254, sizeof(unsigned), 2, stdin) != 2) + die("unable to read index"); + new_style = 1; + } + nr = 0; for (i = 0; i < 256; i++) { unsigned n = ntohl(top_index[i]); @@ -16,13 +30,23 @@ int main(int argc, char **argv) die("corrupt index file"); nr = n; } + for (i = 0; i < nr; i++) { - unsigned offset; + unsigned long long offset; - if (fread(entry, 24, 1, stdin) != 1) + if (fread(entry, + new_style ? 32 : 24, + 1, stdin) != 1) die("unable to read entry %u/%u", i, nr); - offset = ntohl(entry[0]); - printf("%u %s\n", offset, sha1_to_hex((void *)(entry+1))); + if (new_style) { + offset = ntohl(entry[0]); + offset = (offset << 32) | ntohl(entry[1]); + } + else { + offset = ntohl(entry[0]); + } + printf("%llu %s\n", offset, + sha1_to_hex((void *)(entry + 1 + new_style * 2))); } return 0; } -- 1.4.2.gc9dce ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-09-01 7:42 ` Junio C Hamano @ 2006-09-02 1:19 ` Shawn Pearce 2006-09-02 4:01 ` Junio C Hamano 2006-09-02 2:04 ` Shawn Pearce 1 sibling, 1 reply; 52+ messages in thread From: Shawn Pearce @ 2006-09-02 1:19 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, git, Linus Torvalds Junio C Hamano <junkio@cox.net> wrote: > Step 3. Work on integrating partial mmap() support with Shawn. > This is more or less orthogonal to 4GB ceiling (people > would hit mmap() limit even with a 1.5GB pack), but I > suspect it would be necessary to be able to tell where > the end of each pack entry is cheaply to implement > this. I was just getting ready to move my partial mmap support over from fast-import. Although I did the implementation a little differently in fast-import than what I think I'll do in core Git. In fast-import store a hashtable in memory of all objects in the pack but I chose not to store the ending offset (or compressed length) and instead just guess about where the object ends. I did that to save 4 bytes of memory per object. :-) Its necessary to know where the object ends to ensure that your current mapping (or any remapping you are about to do) covers the entire object before you start deflating. Otherwise you might have to remap the pack in the middle of the inflate operation. (Of course you might need to do this anyway if the compressed object is larger than your default mapping unit.) What I did in fast-import was give inflate whatever was left in the current mapping; then if I got a Z_OK or Z_BUF_ERROR back from inflate I move the mapping to the next 128 MiB chunk and reset my z_stream's next_in/avail_in accordingly, then recall inflate. No I didn't performance test it to see how frequently I'm mapping a pack multiple times to get one object. But I'm going to stick my neck out and say that most objects probably don't have a compressed length exceeding 128 MiB so we're talking one remap that we would have had to do anyway if the object spanned over the end of the current mapping. If the object's starting offset was completely outside of the current mapping then I rounded the offset down to the page size (from getpagesize) and remapped; therefore we also probably only do one remap on objects needing it. But having the length or ending offset in the index will help with copying the object during a repack as well as prevent us from needing to guess during accesses. So good news indeed that you are adding it to the index. -- Shawn. -- VGER BF report: U 0.5 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-09-02 1:19 ` Shawn Pearce @ 2006-09-02 4:01 ` Junio C Hamano 2006-09-02 4:39 ` Shawn Pearce 0 siblings, 1 reply; 52+ messages in thread From: Junio C Hamano @ 2006-09-02 4:01 UTC (permalink / raw) To: Shawn Pearce; +Cc: git Shawn Pearce <spearce@spearce.org> writes: > What I did in fast-import was give inflate whatever was left in > the current mapping; then if I got a Z_OK or Z_BUF_ERROR back from > inflate I move the mapping to the next 128 MiB chunk and reset my > z_stream's next_in/avail_in accordingly, then recall inflate. Makes sense. > But having the length or ending offset in the index will help with > copying the object during a repack as well as prevent us from needing > to guess during accesses. Actually pack-objects already builds the reverse index without the help from the updated .idx file format, so while having it pre-built in .idx may help, it is not necessary for it. With your "feed until the end of the current window and if we need more map in the next window" logic, we do not even need to know the length of each entry at runtime either. So I am inclined to chuck the previous patch that records the next object number in each entry. We could keep the 64-bit offset, which would make an entry to be 28-byte instead of 32-byte. -- VGER BF report: U 0.5 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-09-02 4:01 ` Junio C Hamano @ 2006-09-02 4:39 ` Shawn Pearce 2006-09-02 11:06 ` Junio C Hamano 0 siblings, 1 reply; 52+ messages in thread From: Shawn Pearce @ 2006-09-02 4:39 UTC (permalink / raw) To: Junio C Hamano; +Cc: git Junio C Hamano <junkio@cox.net> wrote: > So I am inclined to chuck the previous patch that records the > next object number in each entry. We could keep the 64-bit > offset, which would make an entry to be 28-byte instead of > 32-byte. I can agree with that. :-) Repacking is infrequent compared to searching for an object. Asking the repack code to determine object lengths (especially if it is going to deflate the entry anyway to verify it) isn't that much of a performance hit. As you pointed out its already happening today. Using a 28 byte index entry instead of a 32 byte index entry means the Mozilla historical pack index will only be 52.4 MiB rather than the slightly larger 59.9 MiB. 1.9x10^6 objects will do that to you. -- Shawn. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-09-02 4:39 ` Shawn Pearce @ 2006-09-02 11:06 ` Junio C Hamano 2006-09-02 14:20 ` Jon Smirl 2006-09-02 17:44 ` Shawn Pearce 0 siblings, 2 replies; 52+ messages in thread From: Junio C Hamano @ 2006-09-02 11:06 UTC (permalink / raw) To: Shawn Pearce; +Cc: git Shawn Pearce <spearce@spearce.org> writes: > Using a 28 byte index entry instead of a 32 byte index entry means > the Mozilla historical pack index will only be 52.4 MiB rather than > the slightly larger 59.9 MiB. Yup, that was one consideration. One small worry is we will be placing u64 at 4-byte alignment by using 28-byte entries but I think that is Ok. -- VGER BF report: U 0.5 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-09-02 11:06 ` Junio C Hamano @ 2006-09-02 14:20 ` Jon Smirl 2006-09-02 17:39 ` Shawn Pearce 2006-09-02 17:44 ` Shawn Pearce 1 sibling, 1 reply; 52+ messages in thread From: Jon Smirl @ 2006-09-02 14:20 UTC (permalink / raw) To: Junio C Hamano; +Cc: Shawn Pearce, git On 9/2/06, Junio C Hamano <junkio@cox.net> wrote: > Shawn Pearce <spearce@spearce.org> writes: > > > Using a 28 byte index entry instead of a 32 byte index entry means > > the Mozilla historical pack index will only be 52.4 MiB rather than > > the slightly larger 59.9 MiB. > > Yup, that was one consideration. One small worry is we will be > placing u64 at 4-byte alignment by using 28-byte entries but I > think that is Ok. If you're going to redo the pack formats another big win for the Mozilla pack is to convert pack internal sha1 references into file offsets.within the pack. Doing that will take around 30MB off from the Mozilla pack size. sha1's are not compressible so this is a direct savings. This might reduce memory usage too. The index is only needed to get the initial object from the pack. Since index use is lighter it could just be open/closed when needed. You could also introduce a zlib dictionary object into the format and just leave it empty for now. -- Jon Smirl jonsmirl@gmail.com -- VGER BF report: U 0.5 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-09-02 14:20 ` Jon Smirl @ 2006-09-02 17:39 ` Shawn Pearce 2006-09-02 18:56 ` Linus Torvalds 0 siblings, 1 reply; 52+ messages in thread From: Shawn Pearce @ 2006-09-02 17:39 UTC (permalink / raw) To: Jon Smirl; +Cc: Junio C Hamano, git Jon Smirl <jonsmirl@gmail.com> wrote: > If you're going to redo the pack formats another big win for the > Mozilla pack is to convert pack internal sha1 references into file > offsets.within the pack. Doing that will take around 30MB off from the > Mozilla pack size. sha1's are not compressible so this is a direct > savings. Right now Junio's working on the index to break the 4 GiB barrier. I think Junio and Nico have already agreed to change the base SHA1 to be an offset instead; though this is an issue for the current way the base gets written out behind the delta as you need to know exactly how many bytes the delta is going to be so you can correctly compute the offset. > This might reduce memory usage too. The index is only needed to get > the initial object from the pack. Since index use is lighter it could > just be open/closed when needed. True; however when you are walking a series of commits (to produce output for `git log` for example) every time you parse a commit you need to go back to the .idx to relookup the ancestor commit(s). So you don't want to open/close the .idx file on every object; instead put the .idx file into the LRU like the .pack files are (or into their own LRU chain) and maintain some threshold on how many bytes worth of .idx is kept live. > You could also introduce a zlib dictionary object into the format and > just leave it empty for now. No. I'm not sure I'm ready to propose that as a solution for decreasing pack size. Now that my exams are over I've started working on a true dictionary based compression implementation. I want to try to get Git itself repacked under it, then try the Mozilla pack after I get my new amd64 based system built. If that's as big of space saver as we're hoping it would be then the pack format would be radically different and need to change; if it doesn't gain us anything (or is worse!) then we can go back to the drawing board and consider other pack format changes such as a zlib dictionary. But right now its measly 4% gain isn't very much. -- Shawn. -- VGER BF report: U 0.653439 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-09-02 17:39 ` Shawn Pearce @ 2006-09-02 18:56 ` Linus Torvalds 2006-09-02 20:53 ` Junio C Hamano 0 siblings, 1 reply; 52+ messages in thread From: Linus Torvalds @ 2006-09-02 18:56 UTC (permalink / raw) To: Shawn Pearce; +Cc: Jon Smirl, Junio C Hamano, git On Sat, 2 Sep 2006, Shawn Pearce wrote: > Jon Smirl <jonsmirl@gmail.com> wrote: > > If you're going to redo the pack formats another big win for the > > Mozilla pack is to convert pack internal sha1 references into file > > offsets.within the pack. Doing that will take around 30MB off from the > > Mozilla pack size. sha1's are not compressible so this is a direct > > savings. > > Right now Junio's working on the index to break the 4 GiB barrier. > I think Junio and Nico have already agreed to change the base SHA1 > to be an offset instead; though this is an issue for the current > way the base gets written out behind the delta as you need to know > exactly how many bytes the delta is going to be so you can correctly > compute the offset. I think moving the delta to after the base object is fine, and solves that problem. However, the "--thin" pack problem is potentially worse, since the object you delta against doesn't even _exist_, and as such you'd then end up having two totally different formats (or perhaps we'd add a new object type for use in thin packs: a "dummy object" that says "my type is <so-and-so> and my SHA1 is <so-and-so>, you'd better have me already". I'm not actually personally convinced that we need to solve the 4GB pack-file problem right now. Does anybody have that problem yet? The "partial mapping" issue is a much bigger one, I suspect, even if the tree delta stuff seems to have fixed it largely (yes?) for the mozilla fast-import pack-file. So in many ways, I'd think that the 4GB problem is one where we are perfectly happy to _know_ that a solution exists, rather than have to actually solve it today. Of course, if we change the pack/index-file format for other reasons, then we should obviously fix the 4G issue at the same time, but were there actually any real other reasons at this point? Linus -- VGER BF report: U 0.500001 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-09-02 18:56 ` Linus Torvalds @ 2006-09-02 20:53 ` Junio C Hamano 0 siblings, 0 replies; 52+ messages in thread From: Junio C Hamano @ 2006-09-02 20:53 UTC (permalink / raw) To: Linus Torvalds; +Cc: Shawn Pearce, Jon Smirl, git Linus Torvalds <torvalds@osdl.org> writes: > On Sat, 2 Sep 2006, Shawn Pearce wrote: > >> I think Junio and Nico have already agreed to change the base SHA1 >> to be an offset instead; I am not in favor of that, at least not yet, especially I haven't thought about the implications of that on streaming unpacker. I do not have much objections about changing base-delta order, though. > Of course, if we change the pack/index-file format for other reasons, then > we should obviously fix the 4G issue at the same time, but were there > actually any real other reasons at this point? Not really, from my point of view. Probably 70% of motivation of the new \377tOc index format originally came from wanting to have "where is the end of this object" which turned out to be unneeded; and by making repack easier to use I think we do not have to solve 4G issue either. There is only one case that might need a single pack that can be indexed with 4G offset: initial clone or repack of a huge repository. -- VGER BF report: U 0.788936 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-09-02 11:06 ` Junio C Hamano 2006-09-02 14:20 ` Jon Smirl @ 2006-09-02 17:44 ` Shawn Pearce 1 sibling, 0 replies; 52+ messages in thread From: Shawn Pearce @ 2006-09-02 17:44 UTC (permalink / raw) To: Junio C Hamano; +Cc: git Junio C Hamano <junkio@cox.net> wrote: > Shawn Pearce <spearce@spearce.org> writes: > > > Using a 28 byte index entry instead of a 32 byte index entry means > > the Mozilla historical pack index will only be 52.4 MiB rather than > > the slightly larger 59.9 MiB. > > Yup, that was one consideration. One small worry is we will be > placing u64 at 4-byte alignment by using 28-byte entries but I > think that is Ok. On some systems (SPARC?) that's definately an issue. Other systems (x86) obviously don't have that problem. If you code the decode function as treating that 64 bit value as two 32 bit values in network byte order and combine them together after ntohl() then this shouldn't be a problem. Though big-endian systems which can take a u64 on a 4-byte alignment just fine would pay a small performance penalty that they don't need to pay. -- Shawn. -- VGER BF report: S 0.998672 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-09-01 7:42 ` Junio C Hamano 2006-09-02 1:19 ` Shawn Pearce @ 2006-09-02 2:04 ` Shawn Pearce 2006-09-02 11:02 ` Junio C Hamano 1 sibling, 1 reply; 52+ messages in thread From: Shawn Pearce @ 2006-09-02 2:04 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, git, Linus Torvalds Junio C Hamano <junkio@cox.net> wrote: [snip] > +A new style .idx file begins with a signature, "\377tOc", and a > +version number as a 4-byte network byte order integer. The version > +of this implementation is 2. Ick. I understand why you did this (and thanks for such a good explanation of it by the way) but what a horrible signature number and way to wedge in a version number. >From now on I propose that if we write a file - especially a binary file - to disk that we always include a 4 byte version number into the header of the file. Then we can avoid trying to wedge this into the file after the fact when some worthwhile improvement requires changing the format. I think we probably should have done this when the binary headers were introduced into loose objects. Would 4 bytes of file format version number with an (initial version number which wasn't 0x78 in byte 0 and failed the % 31 test) really have hurt that much in a loose object? [snip] > + . 4-byte network byte order integer, recording the location of the > + next object in the main toc table. Why not just the 4 byte object entry length? To load an object we have to go find the next object in the idx file so we can compute the offset difference. On very large packs (e.g. the Mozilla pack) the index is 46 MiB. The jump across the index could be the entire thing from back to front just to compute the size of an object when the fan-out table and the binary search process really only poked the tail end of the index when searching for the entry. So we're demand paging in the front of the index just to compute a length Sure the scheme you outlined allows a 64 bit difference but uncompressed objects already can't be larger than 2**32-1 and we could just as easily move that limit down to say 2**32-16 to leave room for the object header and zlib header. -- Shawn. -- VGER BF report: U 0.5 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-09-02 2:04 ` Shawn Pearce @ 2006-09-02 11:02 ` Junio C Hamano 2006-09-02 17:51 ` Shawn Pearce 0 siblings, 1 reply; 52+ messages in thread From: Junio C Hamano @ 2006-09-02 11:02 UTC (permalink / raw) To: Shawn Pearce; +Cc: Nicolas Pitre, git, Linus Torvalds Shawn Pearce <spearce@spearce.org> writes: > Junio C Hamano <junkio@cox.net> wrote: > [snip] >> +A new style .idx file begins with a signature, "\377tOc", and a >> +version number as a 4-byte network byte order integer. The version >> +of this implementation is 2. > > Ick. I understand why you did this (and thanks for such a good > explanation of it by the way) but what a horrible signature number > and way to wedge in a version number. I do not think it is particularly horrible. I initially was planning to introduce a new file extention .toc (table of contents), not just because that would let us start afresh with file format that has version number from the beginning, but also because we use the word "index" to mean "cache" these days, and it would be a good idea to use a different word to mean different things. The latter 3 bytes reflects that. > I think we probably should have done this when the binary headers > were introduced into loose objects. No. That was purely .pack format and did not affect .idx format. Honestly .idx is purely a local matter and not as important to keep stable as the .pack format. > Sure the scheme you outlined allows a 64 bit difference but > uncompressed objects already can't be larger than 2**32-1... Where do we have that limitation? In any case, the next round I am planning to get rid of this "where the tail is" business. I do not think it buys us much and inflates the 46MB index you need to deal with even more. I think your "allow zlib to eat the remainder of the current window and slide window when it exhausts the current window" logic is a very good on and makes it unnecessary to know the tail of each object. -- VGER BF report: U 0.5 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-09-02 11:02 ` Junio C Hamano @ 2006-09-02 17:51 ` Shawn Pearce 2006-09-02 20:55 ` Junio C Hamano 0 siblings, 1 reply; 52+ messages in thread From: Shawn Pearce @ 2006-09-02 17:51 UTC (permalink / raw) To: Junio C Hamano; +Cc: git Junio C Hamano <junkio@cox.net> wrote: > > I think we probably should have done this when the binary headers > > were introduced into loose objects. > > No. That was purely .pack format and did not affect .idx > format. Honestly .idx is purely a local matter and not as > important to keep stable as the .pack format. I think you missed the point of my comment. I apologize for it not being clear the first time. The loose objects have format #1 (legacy text format compressed as one big zlib stream) and now format #2 (pack object header followed by data-only zlib stream). In looking at this dictionary packing code that I'm working on right now I may be introducing format #3 (dictionary based compression rather than zlib compression). We have no easy way to mark the version number of the loose objects. I'm sure we can shoehorn in something though. > > Sure the scheme you outlined allows a 64 bit difference but > > uncompressed objects already can't be larger than 2**32-1... > > Where do we have that limitation? sha1_file.c uses "unsigned long" a lot to deal with size of an object, deflated. iirc unsigned long is only 32 bits even in many 64 bit architectures. Or am I wrong? > I think your "allow zlib to eat the remainder of the current > window and slide window when it exhausts the current window" > logic is a very good on and makes it unnecessary to know the > tail of each object. I'll send patches to sha1_file.c probably later tonight. My last round of offset changes was the start of that, this next round should finish it. Its mostly a copy-n-paste from fast-import.c so it should go quickly. -- Shawn. -- VGER BF report: U 0.960524 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-09-02 17:51 ` Shawn Pearce @ 2006-09-02 20:55 ` Junio C Hamano 2006-09-03 3:54 ` Shawn Pearce 0 siblings, 1 reply; 52+ messages in thread From: Junio C Hamano @ 2006-09-02 20:55 UTC (permalink / raw) To: Shawn Pearce; +Cc: git Shawn Pearce <spearce@spearce.org> writes: >> > Sure the scheme you outlined allows a 64 bit difference but >> > uncompressed objects already can't be larger than 2**32-1... >> >> Where do we have that limitation? > > sha1_file.c uses "unsigned long" a lot to deal with size of an > object, deflated. iirc unsigned long is only 32 bits even in many > 64 bit architectures. Or am I wrong? Of course 4G .idx (\377tOc) patch would update them to u64. What is the problem? -- VGER BF report: U 0.905241 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-09-02 20:55 ` Junio C Hamano @ 2006-09-03 3:54 ` Shawn Pearce 0 siblings, 0 replies; 52+ messages in thread From: Shawn Pearce @ 2006-09-03 3:54 UTC (permalink / raw) To: Junio C Hamano; +Cc: git Junio C Hamano <junkio@cox.net> wrote: > Shawn Pearce <spearce@spearce.org> writes: > > >> > Sure the scheme you outlined allows a 64 bit difference but > >> > uncompressed objects already can't be larger than 2**32-1... > >> > >> Where do we have that limitation? > > > > sha1_file.c uses "unsigned long" a lot to deal with size of an > > object, deflated. iirc unsigned long is only 32 bits even in many > > 64 bit architectures. Or am I wrong? > > Of course 4G .idx (\377tOc) patch would update them to u64. > What is the problem? None then. I must have missed that. Besides this issue is moot as you are removing it from the .idx. Sorry for the noise. :-) -- Shawn. -- VGER BF report: S 0.999999 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-08-30 11:42 ` Junio C Hamano 2006-09-01 7:42 ` Junio C Hamano @ 2006-09-01 17:45 ` A Large Angry SCM 2006-09-01 18:35 ` Linus Torvalds 1 sibling, 1 reply; 52+ messages in thread From: A Large Angry SCM @ 2006-09-01 17:45 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, git, Linus Torvalds Junio C Hamano wrote: > Nicolas Pitre <nico@cam.org> writes: > >> And the zlib header contains a CRC which we're about to use for >> validating the data when doing delta data reuse in order to prevent pack >> corruption propagation like the one recently posted on the list. > > Ah, never thought of using the CRC directly. I was thinking > about inflating into void and see if it succeeds, which as you > say is perhaps quite expensive. Unfortunately, the zlib CRC is of the _uncompressed_ data [1], so inflating the stream is still necessary to check for corruption. [1] RFC 1950, "ZLIB Compressed Data Format Specification", May 1996. Pages 4-6. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-09-01 17:45 ` A Large Angry SCM @ 2006-09-01 18:35 ` Linus Torvalds 2006-09-01 19:56 ` Junio C Hamano 2006-09-01 23:14 ` [PATCH] pack-objects: re-validate data we copy from elsewhere Junio C Hamano 0 siblings, 2 replies; 52+ messages in thread From: Linus Torvalds @ 2006-09-01 18:35 UTC (permalink / raw) To: A Large Angry SCM; +Cc: Junio C Hamano, Nicolas Pitre, git On Fri, 1 Sep 2006, A Large Angry SCM wrote: > > Unfortunately, the zlib CRC is of the _uncompressed_ data [1], so > inflating the stream is still necessary to check for corruption. I don't think that is unfortunate. We really should inflate the stream anyway, since not only inflating it, but also applying any deltas to the base object is really the only way to verify its correctness for a delta thing. Otherwise, the SHA1 of the base could be totally corrupt. And once you inflate it and apply all deltas, you obviously also get the full SHA1 check, so you're _really_ safe. So let's do the really safe thing first, and see if it actually results in any problems. NOTE NOTE NOTE! We might well choose to do this checking _only_ when we write the index file (ie we have "!pack_to_stdout" set). Why? Because if we pack to stdout, and don't generate an index file, then by _definition_ the other end has to do the index generation, which means that the other end will be doing all the SHA1 re-calculation and sanity checking (and thus inflation and CRC checking). So this means that if we only do this for the "!pack_to_stdout" case, we won't be adding any overhead to the git network protocol server, only to "git repack". Which is exactly what we want to do. Linus ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Mozilla .git tree 2006-09-01 18:35 ` Linus Torvalds @ 2006-09-01 19:56 ` Junio C Hamano 2006-09-01 23:14 ` [PATCH] pack-objects: re-validate data we copy from elsewhere Junio C Hamano 1 sibling, 0 replies; 52+ messages in thread From: Junio C Hamano @ 2006-09-01 19:56 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > On Fri, 1 Sep 2006, A Large Angry SCM wrote: >> >> Unfortunately, the zlib CRC is of the _uncompressed_ data [1], so >> inflating the stream is still necessary to check for corruption. > > I don't think that is unfortunate. > > We really should inflate the stream anyway, since not only inflating it, > but also applying any deltas to the base object is really the only way to > verify its correctness for a delta thing. Otherwise, the SHA1 of the base > could be totally corrupt. > > And once you inflate it and apply all deltas, you obviously also get the > full SHA1 check, so you're _really_ safe. > > So let's do the really safe thing first, and see if it actually results in > any problems. I think this makes sense. ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-01 18:35 ` Linus Torvalds 2006-09-01 19:56 ` Junio C Hamano @ 2006-09-01 23:14 ` Junio C Hamano 2006-09-02 0:23 ` Linus Torvalds 1 sibling, 1 reply; 52+ messages in thread From: Junio C Hamano @ 2006-09-01 23:14 UTC (permalink / raw) To: Linus Torvalds; +Cc: git When reusing data from an existing pack and from a new style loose objects, we used to just copy it staight into the resulting pack. Instead make sure they are not corrupt, but do so only when we are not streaming to stdout, in which case the receiving end will do the validation either by unpacking the stream or by constructing the .idx file. Signed-off-by: Junio C Hamano <junkio@cox.net> --- Linus Torvalds <torvalds@osdl.org> writes: > On Fri, 1 Sep 2006, A Large Angry SCM wrote: >> >> Unfortunately, the zlib CRC is of the _uncompressed_ data [1], so >> inflating the stream is still necessary to check for corruption. > > I don't think that is unfortunate. > > We really should inflate the stream anyway, since not only inflating it, > but also applying any deltas to the base object is really the only way to > verify its correctness for a delta thing. Otherwise, the SHA1 of the base > could be totally corrupt. > > And once you inflate it and apply all deltas, you obviously also get the > full SHA1 check, so you're _really_ safe. > > So let's do the really safe thing first, and see if it actually results in > any problems. So here is an attempt to do this. builtin-pack-objects.c | 64 +++++++++++++++++++++++++++++++++++++++++++++--- 1 files changed, 60 insertions(+), 4 deletions(-) diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c index 46f524d..10ba866 100644 --- a/builtin-pack-objects.c +++ b/builtin-pack-objects.c @@ -65,6 +65,7 @@ static unsigned char pack_file_sha1[20]; static int progress = 1; static volatile sig_atomic_t progress_update; static int window = 10; +static int pack_to_stdout; /* * The object names in objects array are hashed with this hashtable, @@ -242,6 +243,55 @@ static int encode_header(enum object_typ return n; } +static int revalidate_one(struct object_entry *entry, + void *data, char *type, unsigned long size) +{ + unsigned char hdr[50]; + int hdrlen; + unsigned char sha1[20]; + + if (!data) + return -1; + if (size != entry->size) + return -1; + write_sha1_file_prepare(data, size, type, sha1, hdr, &hdrlen); + free(data); + if (hashcmp(sha1, entry->sha1)) + return -1; + return 0; +} + +/* + * we are going to reuse the existing pack entry data. make + * sure it is not corrupt. + */ +static int revalidate_pack_entry(struct object_entry *entry) +{ + void *data; + char type[20]; + unsigned long size; + struct pack_entry e; + + e.p = entry->in_pack; + e.offset = entry->in_pack_offset; + + /* the caller has already called use_packed_git() for us */ + data = unpack_entry_gently(&e, type, &size); + return revalidate_one(entry, data, type, size); +} + +static int revalidate_loose_object(struct object_entry *entry, + unsigned char *map, + unsigned long mapsize) +{ + /* we already know this is a loose object with new type header. */ + void *data; + char type[20]; + unsigned long size; + data = unpack_sha1_file(map, mapsize, type, &size); + return revalidate_one(entry, data, type, size); +} + static unsigned long write_object(struct sha1file *f, struct object_entry *entry) { @@ -276,6 +326,9 @@ static unsigned long write_object(struct map = map_sha1_file(entry->sha1, &mapsize); if (map && !legacy_loose_object(map)) { /* We can copy straight into the pack file */ + if (revalidate_loose_object(entry, map, mapsize)) + die("corrupt loose object %s", + sha1_to_hex(entry->sha1)); sha1write(f, map, mapsize); munmap(map, mapsize); written++; @@ -286,7 +339,7 @@ static unsigned long write_object(struct munmap(map, mapsize); } - if (! to_reuse) { + if (!to_reuse) { buf = read_sha1_file(entry->sha1, type, &size); if (!buf) die("unable to read %s", sha1_to_hex(entry->sha1)); @@ -319,6 +372,9 @@ static unsigned long write_object(struct datalen = find_packed_object_size(p, entry->in_pack_offset); buf = (char *) p->pack_base + entry->in_pack_offset; + + if (revalidate_pack_entry(entry)) + die("corrupt delta in pack %s", sha1_to_hex(entry->sha1)); sha1write(f, buf, datalen); unuse_packed_git(p); hdrlen = 0; /* not really */ @@ -1163,7 +1219,7 @@ static void prepare_pack(int window, int find_deltas(sorted_by_type, window+1, depth); } -static int reuse_cached_pack(unsigned char *sha1, int pack_to_stdout) +static int reuse_cached_pack(unsigned char *sha1) { static const char cache[] = "pack-cache/pack-%s.%s"; char *cached_pack, *cached_idx; @@ -1247,7 +1303,7 @@ int cmd_pack_objects(int argc, const cha { SHA_CTX ctx; char line[40 + 1 + PATH_MAX + 2]; - int depth = 10, pack_to_stdout = 0; + int depth = 10; struct object_entry **list; int num_preferred_base = 0; int i; @@ -1367,7 +1423,7 @@ int cmd_pack_objects(int argc, const cha if (progress && (nr_objects != nr_result)) fprintf(stderr, "Result has %d objects.\n", nr_result); - if (reuse_cached_pack(object_list_sha1, pack_to_stdout)) + if (reuse_cached_pack(object_list_sha1)) ; else { if (nr_result) -- 1.4.2.ga2654 -- VGER BF report: U 0.5 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-01 23:14 ` [PATCH] pack-objects: re-validate data we copy from elsewhere Junio C Hamano @ 2006-09-02 0:23 ` Linus Torvalds 2006-09-02 1:39 ` VGER BF report? Johannes Schindelin ` (2 more replies) 0 siblings, 3 replies; 52+ messages in thread From: Linus Torvalds @ 2006-09-02 0:23 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Fri, 1 Sep 2006, Junio C Hamano wrote: > > [...] Instead make sure they are not corrupt, but > do so only when we are not streaming to stdout, [...] Hmm. I see you making pack_to_stdout available to those functions, but I don't actually see you using it - looks like you revalidate regardless. Which is safe, of course, but it doesn't match your description ;) Linus -- VGER BF report: H 0.0110285 ^ permalink raw reply [flat|nested] 52+ messages in thread
* VGER BF report? 2006-09-02 0:23 ` Linus Torvalds @ 2006-09-02 1:39 ` Johannes Schindelin 2006-09-02 5:58 ` Sam Ravnborg 2006-09-02 1:52 ` [PATCH] pack-objects: re-validate data we copy from elsewhere Junio C Hamano 2006-09-02 3:52 ` Junio C Hamano 2 siblings, 1 reply; 52+ messages in thread From: Johannes Schindelin @ 2006-09-02 1:39 UTC (permalink / raw) To: git > VGER BF report: H 0.0110285 What is this? -- VGER BF report: H 0.00947004 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: VGER BF report? 2006-09-02 1:39 ` VGER BF report? Johannes Schindelin @ 2006-09-02 5:58 ` Sam Ravnborg 0 siblings, 0 replies; 52+ messages in thread From: Sam Ravnborg @ 2006-09-02 5:58 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git On Sat, Sep 02, 2006 at 03:39:30AM +0200, Johannes Schindelin wrote: > > VGER BF report: H 0.0110285 > > What is this? vger.kernel.org is experimentibg with bogofilter. Seems to be a ham score. Matti announced the use of bogofilter on linux-kernel yesterday. Sam -- VGER BF report: U 0.497275 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-02 0:23 ` Linus Torvalds 2006-09-02 1:39 ` VGER BF report? Johannes Schindelin @ 2006-09-02 1:52 ` Junio C Hamano 2006-09-02 3:52 ` Junio C Hamano 2 siblings, 0 replies; 52+ messages in thread From: Junio C Hamano @ 2006-09-02 1:52 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > On Fri, 1 Sep 2006, Junio C Hamano wrote: >> >> [...] Instead make sure they are not corrupt, but >> do so only when we are not streaming to stdout, [...] > > Hmm. I see you making pack_to_stdout available to those functions, but I > don't actually see you using it - looks like you revalidate regardless. > > Which is safe, of course, but it doesn't match your description ;) Oops, that was a thinko. My previous crappoid had if (pack_to_stdout) return 0; at the beginning of revalidate function. -- VGER BF report: U 0.499951 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-02 0:23 ` Linus Torvalds 2006-09-02 1:39 ` VGER BF report? Johannes Schindelin 2006-09-02 1:52 ` [PATCH] pack-objects: re-validate data we copy from elsewhere Junio C Hamano @ 2006-09-02 3:52 ` Junio C Hamano 2006-09-02 4:52 ` Shawn Pearce 2006-09-02 18:43 ` Linus Torvalds 2 siblings, 2 replies; 52+ messages in thread From: Junio C Hamano @ 2006-09-02 3:52 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > On Fri, 1 Sep 2006, Junio C Hamano wrote: >> >> [...] Instead make sure they are not corrupt, but >> do so only when we are not streaming to stdout, [...] > > Hmm. I see you making pack_to_stdout available to those functions, but I > don't actually see you using it - looks like you revalidate regardless. > > Which is safe, of course, but it doesn't match your description ;) But "git repack -a -d", which you now consider almost being free, in the recent kernel repository counts 300k objects, and reuses 298k objects or so. That means we expand and recompress that many objects, totalling 120MB. It might be worthwhile to disable revalidate reused objects individually and instead scan and checksum the entire .pack file when the number of objects being reused exceeds certain threshold, relative to the number of objects in existing pack, perhaps. -- VGER BF report: U 0.5 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-02 3:52 ` Junio C Hamano @ 2006-09-02 4:52 ` Shawn Pearce 2006-09-02 9:42 ` Junio C Hamano 2006-09-02 10:09 ` Junio C Hamano 2006-09-02 18:43 ` Linus Torvalds 1 sibling, 2 replies; 52+ messages in thread From: Shawn Pearce @ 2006-09-02 4:52 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git Junio C Hamano <junkio@cox.net> wrote: > It might be worthwhile to disable revalidate reused objects > individually and instead scan and checksum the entire .pack file > when the number of objects being reused exceeds certain > threshold, relative to the number of objects in existing pack, > perhaps. Correct me if I'm wrong but didn't this revalidate check happen because the SHA1 of the pack was correct but there was a bad bit in the zlib stream? If we are trying to detect such an error before removing the possibly valid pack how are we supposed to do that if we are bypassing the code on larger packs? I think the better thing to do here is to not repack objects which are already contained in very large packs. Just leave them be. If the pack you are about to copy an object out of is over 25 MiB, you aren't outputting to stdout and the object isn't needed as a delta base in the new pack then don't copy it. Introduce a new flag to git-pack-objects such as "--max-source-pack-size=100" which can be used to change this 25 MiB threshold; setting it to 0 would act as "-a" does today. This way users can repack with 'git repack -a -d' as though it were free and much less frequently (such as once a year) combine their medium sized packs together based on a larger maximum threshold while still ignoring their really large historical packs. Note that you are never bypassing the deflate validation; before copying an object you *always* validate it is correct, even if the source pack SHA1 is correct. But this time consuming validation should not be a big issue as users shouldn't repack very large packs very frequently with this strategy. E.g. some kernel devs might repack once a year with --max-source-pack-size=512 (512 MiB) but during normal use accept the 25 MiB default and the slightly larger number of small packs that result. -- Shawn. -- VGER BF report: U 0.5 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-02 4:52 ` Shawn Pearce @ 2006-09-02 9:42 ` Junio C Hamano 2006-09-02 17:43 ` Linus Torvalds 2006-09-02 10:09 ` Junio C Hamano 1 sibling, 1 reply; 52+ messages in thread From: Junio C Hamano @ 2006-09-02 9:42 UTC (permalink / raw) To: Shawn Pearce; +Cc: Linus Torvalds, git Shawn Pearce <spearce@spearce.org> writes: > Junio C Hamano <junkio@cox.net> wrote: >> It might be worthwhile to disable revalidate reused objects >> individually and instead scan and checksum the entire .pack file >> when the number of objects being reused exceeds certain >> threshold, relative to the number of objects in existing pack, >> perhaps. > > Correct me if I'm wrong but didn't this revalidate check happen > because the SHA1 of the pack was correct but there was a bad bit > in the zlib stream? There are two more or less independent problems and you are correct to point out that summing the entire .pack does not catch one class of problem while it does catch the other. The Linus's theory goes like this: (1) A bit in an existing pack was damaged somehow. It might have happened on the mothership machine when it was first created, or after it was read and copied to the notebook via rsync. (2) A 'repack -a -d' was done in a repository that had that damanged pack, it decided to reuse a deflated object from the existing, damaged pack, and copied that deflated representation into a newly created pack without validating. (3) The 'repack -a -d', when finishing the newly created pack, computed a checksum over the whole new pack and wrote the SHA-1 checksum in it. Now, sha1_file() has check_packed_git_idx() that makes sure .idx file is not corrupted; it runs the checksum over the whole .idx data when it first maps in and makes sure the sum matches what is recorded at the end. So if the corruption in (1) happened to the .idx file, 'repack -a -d' in (2) would have noticed and refused to use the corresponding .pack. The problem is that there is however no corresponding "checksum over the whole file before use" for .pack file during a normal operation of use_packed_git(). Otherwise we would have caught the corruption in the existing pack and prevented step (2) from propagating the error. The idea proposed by Linus and implemented in the patch in this thread is to mitigate this by revalidating the individual pieces we copy in (2). If we copy out most of what is in the existing pack, however, it may be cheaper to do the "whole file checksum before use" instead. On the other hand, the "alpha particle at the right moment" theory goes like this: (1) write_object() gave the buffer to sha1write_compressed(); (2) sha1write_compressed() asked zlib to deflate the data and received the result in buffer pointed by void *out; (3) bit flipped in that buffer, after zlib finished writing into it with the CRC; (4) sha1write() wrote out the now-corrupt buffer, while computing the correct checksum for the end result (which is now corrupt). This breakage will not be caught unless we revalidate everything we copy out to the new pack. -- VGER BF report: U 0.5 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-02 9:42 ` Junio C Hamano @ 2006-09-02 17:43 ` Linus Torvalds 0 siblings, 0 replies; 52+ messages in thread From: Linus Torvalds @ 2006-09-02 17:43 UTC (permalink / raw) To: Junio C Hamano; +Cc: Shawn Pearce, git On Sat, 2 Sep 2006, Junio C Hamano wrote: > > The Linus's theory goes like this: > > (1) A bit in an existing pack was damaged somehow. It might have > happened on the mothership machine when it was first created, > or after it was read and copied to the notebook via rsync. NOTE! With the new loose object format, this will be true even of individually packed files (if you set "core.legacyheaders" to false). So checking the SHA1 of the pack-files is insufficient - at least for those loose objects. So revalidating the individual objects will catch that case too, while revalidating the SHA1 of the old pack-files won't. Linus -- VGER BF report: U 0.5 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-02 4:52 ` Shawn Pearce 2006-09-02 9:42 ` Junio C Hamano @ 2006-09-02 10:09 ` Junio C Hamano 2006-09-02 17:54 ` Shawn Pearce 2006-09-03 0:27 ` Linus Torvalds 1 sibling, 2 replies; 52+ messages in thread From: Junio C Hamano @ 2006-09-02 10:09 UTC (permalink / raw) To: Shawn Pearce; +Cc: Linus Torvalds, git Shawn Pearce <spearce@spearce.org> writes: > I think the better thing to do here is to not repack objects which > are already contained in very large packs. Just leave them be. I've been thinking about updating rev-list so that repack can be used to organize packs into zero or more "archive packs" and one "active pack". repack without -a essentially boils down to: rev-list --objects --all --unpacked | pack-objects $new_pack which picks up all live loose objects and create a new pack. If rev-list had an extention that lets you say rev-list --objects --all --unpacked=$active_pack | pack-objects $new_pack instead, to mean "all live loose objects but pretend as if objects that are in this pack are also unpacked", then the newly created pack would be perfect for updating $active_pack by replacing it. Perhaps something like this. -- >8 -- diff --git a/builtin-count-objects.c b/builtin-count-objects.c index 1d3729a..73c5982 100644 --- a/builtin-count-objects.c +++ b/builtin-count-objects.c @@ -62,7 +62,7 @@ static void count_objects(DIR *d, char * hex[40] = 0; if (get_sha1_hex(hex, sha1)) die("internal error"); - if (has_sha1_pack(sha1)) + if (has_sha1_pack(sha1, NULL)) (*packed_loose)++; } } diff --git a/builtin-prune-packed.c b/builtin-prune-packed.c index d3dd94d..960db49 100644 --- a/builtin-prune-packed.c +++ b/builtin-prune-packed.c @@ -19,7 +19,7 @@ static void prune_dir(int i, DIR *dir, c memcpy(hex+2, de->d_name, 38); if (get_sha1_hex(hex, sha1)) continue; - if (!has_sha1_pack(sha1)) + if (!has_sha1_pack(sha1, NULL)) continue; memcpy(pathname + len, de->d_name, 38); if (dryrun) diff --git a/cache.h b/cache.h index 7257c4c..b50a823 100644 --- a/cache.h +++ b/cache.h @@ -259,7 +259,7 @@ extern int write_sha1_from_fd(const unsi extern int write_sha1_to_fd(int fd, const unsigned char *sha1); extern int move_temp_to_file(const char *tmpfile, const char *filename); -extern int has_sha1_pack(const unsigned char *sha1); +extern int has_sha1_pack(const unsigned char *sha1, const char *ignore); extern int has_sha1_file(const unsigned char *sha1); extern void *map_sha1_file(const unsigned char *sha1, unsigned long *); extern int legacy_loose_object(unsigned char *); diff --git a/revision.c b/revision.c index b588f74..fad3b24 100644 --- a/revision.c +++ b/revision.c @@ -416,7 +416,8 @@ static void limit_list(struct rev_info * if (revs->max_age != -1 && (commit->date < revs->max_age)) obj->flags |= UNINTERESTING; - if (revs->unpacked && has_sha1_pack(obj->sha1)) + if (revs->unpacked && + has_sha1_pack(obj->sha1, revs->ignore_packed)) obj->flags |= UNINTERESTING; add_parents_to_list(revs, commit, &list); if (obj->flags & UNINTERESTING) { @@ -737,6 +738,12 @@ int setup_revisions(int argc, const char } if (!strcmp(arg, "--unpacked")) { revs->unpacked = 1; + revs->ignore_packed = NULL; + continue; + } + if (!strncmp(arg, "--unpacked=", 11)) { + revs->unpacked = 1; + revs->ignore_packed = arg + 11; continue; } if (!strcmp(arg, "-r")) { @@ -1046,7 +1053,8 @@ struct commit *get_revision(struct rev_i */ if (!revs->limited) { if ((revs->unpacked && - has_sha1_pack(commit->object.sha1)) || + has_sha1_pack(commit->object.sha1, + revs->ignore_packed)) || (revs->max_age != -1 && (commit->date < revs->max_age))) continue; diff --git a/revision.h b/revision.h index d289781..843aeea 100644 --- a/revision.h +++ b/revision.h @@ -57,6 +57,9 @@ struct rev_info { unsigned int shown_one:1, abbrev_commit:1, relative_date:1; + + const char *ignore_packed; + unsigned int abbrev; enum cmit_fmt commit_format; struct log_info *loginfo; diff --git a/sha1_file.c b/sha1_file.c index 76f66e6..7031f65 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -1189,12 +1189,15 @@ int find_pack_entry_one(const unsigned c return 0; } -static int find_pack_entry(const unsigned char *sha1, struct pack_entry *e) +static int find_pack_entry(const unsigned char *sha1, + struct pack_entry *e, const char *ignore_packed) { struct packed_git *p; prepare_packed_git(); for (p = packed_git; p; p = p->next) { + if (ignore_packed && !strcmp(p->pack_name, ignore_packed)) + continue; if (find_pack_entry_one(sha1, e, p)) return 1; } @@ -1227,10 +1230,10 @@ int sha1_object_info(const unsigned char if (!map) { struct pack_entry e; - if (find_pack_entry(sha1, &e)) + if (find_pack_entry(sha1, &e, NULL)) return packed_object_info(&e, type, sizep); reprepare_packed_git(); - if (find_pack_entry(sha1, &e)) + if (find_pack_entry(sha1, &e, NULL)) return packed_object_info(&e, type, sizep); return error("unable to find %s", sha1_to_hex(sha1)); } @@ -1253,7 +1256,7 @@ static void *read_packed_sha1(const unsi { struct pack_entry e; - if (!find_pack_entry(sha1, &e)) { + if (!find_pack_entry(sha1, &e, NULL)) { error("cannot read sha1_file for %s", sha1_to_hex(sha1)); return NULL; } @@ -1266,7 +1269,7 @@ void * read_sha1_file(const unsigned cha void *map, *buf; struct pack_entry e; - if (find_pack_entry(sha1, &e)) + if (find_pack_entry(sha1, &e, NULL)) return read_packed_sha1(sha1, type, size); map = map_sha1_file(sha1, &mapsize); if (map) { @@ -1275,7 +1278,7 @@ void * read_sha1_file(const unsigned cha return buf; } reprepare_packed_git(); - if (find_pack_entry(sha1, &e)) + if (find_pack_entry(sha1, &e, NULL)) return read_packed_sha1(sha1, type, size); return NULL; } @@ -1707,10 +1710,10 @@ int has_pack_file(const unsigned char *s return 1; } -int has_sha1_pack(const unsigned char *sha1) +int has_sha1_pack(const unsigned char *sha1, const char *ignore_packed) { struct pack_entry e; - return find_pack_entry(sha1, &e); + return find_pack_entry(sha1, &e, ignore_packed); } int has_sha1_file(const unsigned char *sha1) @@ -1718,7 +1721,7 @@ int has_sha1_file(const unsigned char *s struct stat st; struct pack_entry e; - if (find_pack_entry(sha1, &e)) + if (find_pack_entry(sha1, &e, NULL)) return 1; return find_sha1_file(sha1, &st) ? 1 : 0; } -- VGER BF report: U 0.5 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-02 10:09 ` Junio C Hamano @ 2006-09-02 17:54 ` Shawn Pearce 2006-09-03 21:00 ` Junio C Hamano 2006-09-03 0:27 ` Linus Torvalds 1 sibling, 1 reply; 52+ messages in thread From: Shawn Pearce @ 2006-09-02 17:54 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git Junio C Hamano <junkio@cox.net> wrote: > Shawn Pearce <spearce@spearce.org> writes: > > > I think the better thing to do here is to not repack objects which > > are already contained in very large packs. Just leave them be. > > I've been thinking about updating rev-list so that repack > can be used to organize packs into zero or more "archive packs" > and one "active pack". > > repack without -a essentially boils down to: > > rev-list --objects --all --unpacked | > pack-objects $new_pack > > which picks up all live loose objects and create a new pack. > > If rev-list had an extention that lets you say > > rev-list --objects --all --unpacked=$active_pack | > pack-objects $new_pack Hmm. Seems very reasonable actually. :-) How do we pick the "active pack" in git-repack.sh? How about "--include-pack=$active_pack" instead of "--unpacked=$active_pack"? The latter just reads really funny to me. -- Shawn. -- VGER BF report: S 0.9993 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-02 17:54 ` Shawn Pearce @ 2006-09-03 21:00 ` Junio C Hamano 2006-09-04 4:10 ` Shawn Pearce 0 siblings, 1 reply; 52+ messages in thread From: Junio C Hamano @ 2006-09-03 21:00 UTC (permalink / raw) To: Shawn Pearce; +Cc: Linus Torvalds, git Shawn Pearce <spearce@spearce.org> writes: > Junio C Hamano <junkio@cox.net> wrote: > >> If rev-list had an extention that lets you say >> >> rev-list --objects --all --unpacked=$active_pack | >> pack-objects $new_pack > > Hmm. Seems very reasonable actually. :-) > > How do we pick the "active pack" in git-repack.sh? I was hoping that will be a Porcelain policy issue I do not have to decide ;-). You could introduce repack.active in .git/config that points at the latest active, make git-repack to notice and update when it updates it. We could also just use .git/objects/pack/pack-active.{pack,idx} files. This needs some surgery to get rid of packed_git.sha1[], sha1_pack_name() and friends and have them only require .pack and .idx are linked by their basename only as was discussed in a separate thread to make it dumb-transport friendly. -- VGER BF report: U 0.658122 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-03 21:00 ` Junio C Hamano @ 2006-09-04 4:10 ` Shawn Pearce 2006-09-04 5:50 ` Junio C Hamano 0 siblings, 1 reply; 52+ messages in thread From: Shawn Pearce @ 2006-09-04 4:10 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git Junio C Hamano <junkio@cox.net> wrote: > Shawn Pearce <spearce@spearce.org> writes: > > > Junio C Hamano <junkio@cox.net> wrote: > > > >> If rev-list had an extention that lets you say > >> > >> rev-list --objects --all --unpacked=$active_pack | > >> pack-objects $new_pack > > > > Hmm. Seems very reasonable actually. :-) > > > > How do we pick the "active pack" in git-repack.sh? > > I was hoping that will be a Porcelain policy issue I do not have > to decide ;-). But would need to update git-repack.sh. :-) > You could introduce repack.active in .git/config that points at > the latest active, make git-repack to notice and update when it > updates it. No. I don't think its a good idea for git-repack to be updating your configuration file during a repack. If something goes badly there you could lose settings you would rather not lose. Better that it update a symref like thing instead. For example create a ".git/objects/pack/active" file holding the name ("pack-n{40}.pack") of the current active pack. If this file is missing choose the pack that is smallest as the active pack. -- Shawn. -- VGER BF report: S 0.999999 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-04 4:10 ` Shawn Pearce @ 2006-09-04 5:50 ` Junio C Hamano 2006-09-04 6:44 ` Shawn Pearce 0 siblings, 1 reply; 52+ messages in thread From: Junio C Hamano @ 2006-09-04 5:50 UTC (permalink / raw) To: Shawn Pearce; +Cc: Linus Torvalds, git Shawn Pearce <spearce@spearce.org> writes: > Junio C Hamano <junkio@cox.net> wrote: >... >> You could introduce repack.active in .git/config that points at >> the latest active, make git-repack to notice and update when it >> updates it. >... > Better that it update a symref like thing instead. For example > create a ".git/objects/pack/active" file holding the name > ("pack-n{40}.pack") of the current active pack. If this file is > missing choose the pack that is smallest as the active pack. Actually I like the one you did not quote even better the more I think about it. >> We could also just use .git/objects/pack/pack-active.{pack,idx} >> files. This needs some surgery to get rid of packed_git.sha1[], >> sha1_pack_name() and friends and have them only require .pack >> and .idx are linked by their basename only as was discussed in a >> separate thread to make it dumb-transport friendly. The argument given as the reason (rather, excuse) the dumb transport routines wanted to rely on the packed_git.sha1[] and sha1_pack_name() was because we would need to avoid packname collisions _anyway_ so relying on the convention to have "pack-[0-9a-f]{40}.(pack|idx)" is OK or even desirable. We need to avoid packname collisions, and it is acceptable to assume that SHA-1 is practically collision free like the rest of the system does. However, if the dumb transport wants to avoid packname collision, it should not rely on the way how the other side names its packs. It first downloads the .idx files, so it can compute the _right_ packname using the sorted object names recorded there [*1*], and store the downloaded pack/idx under the right name, without relying on the way how the other side names their packs (it still needs to rely on the names in that their name end with .pack and .idx, and .pack and .idx corresponds with each other by their basenames. But the point is it should not depend on more than that, especially that the basename is of the form "pack-[0-9a-f]{40}" nor the hex part is the correct packname). Now if we fix dumb transport downloaders, then we could even make a convention that the packs named pack-[0-9a-f]{40}.pack are archive packs. And git-repack can even have a convention that .git/objects/pack/pack-active.(pack|idx) is the active pack. [Footnote] *1* a refresher course of the packname generation; it is SHA-1 over the object names (20-byte binary representation) in the pack, sorted in byte order. See builtin-pack-objects.c for details. -- VGER BF report: S 0.99883 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-04 5:50 ` Junio C Hamano @ 2006-09-04 6:44 ` Shawn Pearce 2006-09-04 7:39 ` Junio C Hamano 0 siblings, 1 reply; 52+ messages in thread From: Shawn Pearce @ 2006-09-04 6:44 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git Junio C Hamano <junkio@cox.net> wrote: > Now if we fix dumb transport downloaders, then we could even > make a convention that the packs named pack-[0-9a-f]{40}.pack > are archive packs. And git-repack can even have a convention > that .git/objects/pack/pack-active.(pack|idx) is the active > pack. Seems reasonable. I take it you are proposing that a dumb transport always downloads pack-active.pack as pack-n{40}.pack where the dumb protocol downloader computed the correct pack name from its contents. Thus any remote pack downloaded over a dumb transport is automatically treated as a historical pack by the receiving repository. This will cause someone tracking a remote repository over a dumb transport to need to repack only a subset of their historical packs frequently into their own active.pack while leaving other historical packs untouched. But the more that I think about this neither solution (an active pack symref or pack-active.pack) really solves this. Being limited to just one active pack seems to be a problem with at least the dumb transports. I think that's why I preferred the size threshold idea. The active packs are cheap to repack because they are small. The larger packs aren't cheap to repack because they are large - and probably historical. What we are trying to get is fast repacks for the active objects while still getting full validation anytime we do a repack and (possibly) destroy the source. A size threshold does it. When Jon Smirl and I started kicking around the idea of a historical pack for Mozilla I was thinking of just storing a list of pack base names in ".git/objects/pack/historical". Packs listed there should generally be exempt from repacking. During an initial clone we'd need to deliver the contents of that file to the new repository, as if the source considers a pack historical its likely the new repository would want to as well. But now as I write this email I'm thinking that it may be just as easy to change the base name of the pack to "hist-n{40}" when we want to consider it historical. [snipped and re-ordered] > It first downloads the .idx files, so it can compute the > _right_ packname using the sorted object names recorded there Why trust the .idx? I've seen you post that the .idx is purely a local matter. The "smart" Git protocol only receives the .pack from the remote and computes the .idx locally or unpacks it to loose objects locally; why should a dumb transport trust the remote .idx? Oh, I know, when the .idx is >50 MiB, the .pack is >450 MiB, has 2 million objects and delta chains ~5000 long. Are we thinking that .idx files may need to have a slightly wider distribution than "local"? -- Shawn. -- VGER BF report: S 1 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-04 6:44 ` Shawn Pearce @ 2006-09-04 7:39 ` Junio C Hamano 0 siblings, 0 replies; 52+ messages in thread From: Junio C Hamano @ 2006-09-04 7:39 UTC (permalink / raw) To: Shawn Pearce; +Cc: Linus Torvalds, git Shawn Pearce <spearce@spearce.org> writes: > Junio C Hamano <junkio@cox.net> wrote: >> Now if we fix dumb transport downloaders, then we could even >> make a convention that the packs named pack-[0-9a-f]{40}.pack >> are archive packs. And git-repack can even have a convention >> that .git/objects/pack/pack-active.(pack|idx) is the active >> pack. > > Seems reasonable. > > I take it you are proposing that a dumb transport always downloads > pack-active.pack as pack-n{40}.pack where the dumb protocol > downloader computed the correct pack name from its contents. Thus > any remote pack downloaded over a dumb transport is automatically > treated as a historical pack by the receiving repository. It is a lot lot lot stronger than that. If the dumb transport (or git in general) uses [0-9a-f]{40} part as the assuarance mechanism to avoid packname collisions, it should be prepared to recompute the correct name for not just pack-acrive but _all_ remote packs and store them under the right name. Also, if repack uses the convention to treat pack-active.* as the active pack, it might make sense for dumb transport to use that as a clue and explode what the other side calls pack-active.* upon reception. For the purpose of repacking decision for the person at the other end, they are subject for frequent repacking, so maybe we should treat it as such. > [snipped and re-ordered] >> It first downloads the .idx files, so it can compute the >> _right_ packname using the sorted object names recorded there > > Why trust the .idx? I've seen you post that the .idx is purely > a local matter. The "smart" Git protocol only receives the .pack > from the remote and computes the .idx locally or unpacks it to loose > objects locally; why should a dumb transport trust the remote .idx? Remote .idx is paired with remote .pack and must be able to access into .pack otherwise the remote person could not use the pack locally ;-). Note that we are not talking about malicious repository. The issue here is not about _trusting_, but adjusting its name to its contents to match your local convention, which is what the recent pack-objects spits out to its standard output. Older versions of pack-objects named the resulting pack with pretty much randomly (it used to use SHA-1 checksum of object names in input order, not in index order which is sorted, so the same pack _could_ have different names), and its output files should be considered valid packs if they are still around. -- VGER BF report: S 0.999988 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-02 10:09 ` Junio C Hamano 2006-09-02 17:54 ` Shawn Pearce @ 2006-09-03 0:27 ` Linus Torvalds 2006-09-03 0:32 ` Junio C Hamano 2006-09-05 8:12 ` Junio C Hamano 1 sibling, 2 replies; 52+ messages in thread From: Linus Torvalds @ 2006-09-03 0:27 UTC (permalink / raw) To: Junio C Hamano; +Cc: Shawn Pearce, git On Sat, 2 Sep 2006, Junio C Hamano wrote: > > repack without -a essentially boils down to: > > rev-list --objects --all --unpacked | > pack-objects $new_pack > > which picks up all live loose objects and create a new pack. > > If rev-list had an extention that lets you say > > rev-list --objects --all --unpacked=$active_pack | > pack-objects $new_pack I have to say, that adding another pack-objects-specific flag to rev-list doesn't necessarily strike me as very nice. It might be much better to just teach "git pack-objects" to walk the object list by hand. The whole "git-rev-list | git-pack-objects" approach made more sense back when git-rev-list was the _only_ think listing objects, but now that we've librarized the whole object listing thing, maybe it's time to avoid the pipeline and just do it inside the packer logic. Of course, having git-pack-objects be able to take input from stdin is still useful, but I'd rather start moving the obviously packing-specific flags out of git-rev-list, and into git-pack-objects instead. [ That would also potentially make packing more efficient - right now the "phase 1" of packing is literally to just figure out the type and the size of the object, in order to sort the object list. So when you do a big repack, first "git-rev-list" ends up doing all this work, and then "git-pack-object" ends up _redoing_ some of it. Especially for tree objects (which are one of the most common kinds), we already opened the object once when we traversed it, so opening it again just to look at its size is kind of sad.. ] We also used to have a bug with repacking, where a git-rev-list that died with a failure would silently cause a "good" repack to happen because it had fed git-pack-objects an incomplete list of objects. Junio fixed that with a hack, but if we did all the object listing inside git-pack-objects, we'd not need that kind of things at all.. Hmm? Comments? Linus -- VGER BF report: U 0.500053 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-03 0:27 ` Linus Torvalds @ 2006-09-03 0:32 ` Junio C Hamano 2006-09-05 8:12 ` Junio C Hamano 1 sibling, 0 replies; 52+ messages in thread From: Junio C Hamano @ 2006-09-03 0:32 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > I have to say, that adding another pack-objects-specific flag to rev-list > doesn't necessarily strike me as very nice. > > It might be much better to just teach "git pack-objects" to walk the > object list by hand. The whole "git-rev-list | git-pack-objects" approach > made more sense back when git-rev-list was the _only_ think listing > objects, but now that we've librarized the whole object listing thing, > maybe it's time to avoid the pipeline and just do it inside the packer > logic. >... > Hmm? Comments? Yes it all makes perfect sense, though I am a bit too tired to assess how much work it would involve right now. -- VGER BF report: S 0.993177 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-03 0:27 ` Linus Torvalds 2006-09-03 0:32 ` Junio C Hamano @ 2006-09-05 8:12 ` Junio C Hamano 1 sibling, 0 replies; 52+ messages in thread From: Junio C Hamano @ 2006-09-05 8:12 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > On Sat, 2 Sep 2006, Junio C Hamano wrote: >> >> repack without -a essentially boils down to: >> >> rev-list --objects --all --unpacked | >> pack-objects $new_pack >> >> which picks up all live loose objects and create a new pack. >> >> If rev-list had an extention that lets you say >> >> rev-list --objects --all --unpacked=$active_pack | >> pack-objects $new_pack > > I have to say, that adding another pack-objects-specific flag to rev-list > doesn't necessarily strike me as very nice. > > It might be much better to just teach "git pack-objects" to walk the > object list by hand. The whole "git-rev-list | git-pack-objects" approach > made more sense back when git-rev-list was the _only_ think listing > objects, but now that we've librarized the whole object listing thing, > maybe it's time to avoid the pipeline and just do it inside the packer > logic. > > Of course, having git-pack-objects be able to take input from stdin is > still useful, but I'd rather start moving the obviously packing-specific > flags out of git-rev-list, and into git-pack-objects instead. > > [ That would also potentially make packing more efficient - right now the > "phase 1" of packing is literally to just figure out the type and the > size of the object, in order to sort the object list. > > So when you do a big repack, first "git-rev-list" ends up doing all this > work, and then "git-pack-object" ends up _redoing_ some of it. > > Especially for tree objects (which are one of the most common kinds), we > already opened the object once when we traversed it, so opening it again > just to look at its size is kind of sad.. ] Well, I tried (see two patches from tonight, and the result is sitting in "pu"), but it turns out that the information pack-objects wants to get in check_object() is somewhat different from what "struct object" family is willing to give back easily during the traversal done by get_revision() interface. Since "struct object" traditionally has not had provision of recording size (created_object() interface just takes sha1 without size, for example), and it was slimmed down to contain absolute minimum information, I am reluctant to add fields to record the size there. Also having in-pack size available would be rather nice but the functions involved in the call chain (including process_tree and process_blob) did not even care where the objects are stored, so it would involve rather nasty surgery. So I stopped short of that. I am not quite sure how useful tonight's patches are in practice. It gets rid of the pipe, but moves the revision walker memory pressure from rev-list to pack-objects itself, so... ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-02 3:52 ` Junio C Hamano 2006-09-02 4:52 ` Shawn Pearce @ 2006-09-02 18:43 ` Linus Torvalds 2006-09-02 20:56 ` Junio C Hamano 2006-09-03 21:48 ` Junio C Hamano 1 sibling, 2 replies; 52+ messages in thread From: Linus Torvalds @ 2006-09-02 18:43 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Fri, 1 Sep 2006, Junio C Hamano wrote: > > But "git repack -a -d", which you now consider almost being > free, in the recent kernel repository counts 300k objects, and > reuses 298k objects or so. That means we expand and recompress > that many objects, totalling 120MB. Sure. Do we have data for how expensive that is (ie did you apply the patch and time it)? I'd rather be really safe by default, and then if somebody knows to trust their archive, maybe add a "--fast" flag (or even a "core.reliablepack" config option) to disable it for people who have backups and think their machines are infallible - or have slow CPU's.. For me, performance has always been one of the primary goals, but being able to trust the end result has been even _more_ primary. A lot of the design has centered around not doing things that are unsafe (eg the whole "never ever re-write an object" thing was obviously a big part of the design, and a lot of it is about being able to do things quickly _without_ having to do slow things like fsync). Linus -- VGER BF report: U 0.5 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-02 18:43 ` Linus Torvalds @ 2006-09-02 20:56 ` Junio C Hamano 2006-09-03 21:48 ` Junio C Hamano 1 sibling, 0 replies; 52+ messages in thread From: Junio C Hamano @ 2006-09-02 20:56 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > For me, performance has always been one of the primary goals, but being > able to trust the end result has been even _more_ primary. Agreed, violently ;-). -- VGER BF report: U 0.735423 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-02 18:43 ` Linus Torvalds 2006-09-02 20:56 ` Junio C Hamano @ 2006-09-03 21:48 ` Junio C Hamano 2006-09-03 22:00 ` Linus Torvalds 1 sibling, 1 reply; 52+ messages in thread From: Junio C Hamano @ 2006-09-03 21:48 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > On Fri, 1 Sep 2006, Junio C Hamano wrote: >> >> But "git repack -a -d", which you now consider almost being >> free, in the recent kernel repository counts 300k objects, and >> reuses 298k objects or so. That means we expand and recompress >> that many objects, totalling 120MB. > > Sure. Do we have data for how expensive that is (ie did you apply the > patch and time it)? Quite bad. For the kernel archive of today (I usually am nearly fully packed): $ /usr/bin/time ~/git-master/bin/git-pack-objects p1 </var/tmp/1 Generating pack... Done counting 301361 objects. Deltifying 301361 objects. 100% (301361/301361) done Writing 301361 objects. 100% (301361/301361) done a13dc6646622537d29af92b4cfc6d49b82e77e49 Total 301361, written 301361 (delta 238935), reused 300995 (delta 238663) 3.58user 0.84system 0:04.44elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+62727minor)pagefaults 0swaps $ /usr/bin/time ../git.junio/git-pack-objects p2 </var/tmp/1 Generating pack... Done counting 301361 objects. Deltifying 301361 objects. 100% (301361/301361) done Writing 301361 objects. 100% (301361/301361) done a13dc6646622537d29af92b4cfc6d49b82e77e49 Total 301361, written 301361 (delta 238935), reused 300995 (delta 238663) 57.84user 3.39system 1:01.36elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+1022768minor)pagefaults 0swaps By the way, the one in "next" has a thinko I just noticed. -- >8 -- [PATCH] pack-objects: fix thinko in revalidate code When revalidating an entry from an existing pack entry->size and entry->type are not necessarily the size of the final object when the entry is deltified, but for base objects they must match. Signed-off-by: Junio C Hamano <junkio@cox.net> --- builtin-pack-objects.c | 13 +++++++------ 1 files changed, 7 insertions(+), 6 deletions(-) diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c index 11cc3c8..5e42387 100644 --- a/builtin-pack-objects.c +++ b/builtin-pack-objects.c @@ -247,12 +247,13 @@ static int revalidate_one(struct object_ void *data, char *type, unsigned long size) { int err; - if (!data) - return -1; - if (size != entry->size) - return -1; - err = check_sha1_signature(entry->sha1, data, size, - type_names[entry->type]); + if ((!data) || + ((entry->type != OBJ_DELTA) && + ( (size != entry->size) || + strcmp(type_names[entry->type], type)))) + err = -1; + else + err = check_sha1_signature(entry->sha1, data, size, type); free(data); return err; } -- 1.4.2.g99d7d -- VGER BF report: U 0.528006 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-03 21:48 ` Junio C Hamano @ 2006-09-03 22:00 ` Linus Torvalds 2006-09-03 22:16 ` Linus Torvalds 2006-09-03 22:34 ` Junio C Hamano 0 siblings, 2 replies; 52+ messages in thread From: Linus Torvalds @ 2006-09-03 22:00 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sun, 3 Sep 2006, Junio C Hamano wrote: > > Quite bad. For the kernel archive of today (I usually am nearly > fully packed): Ok. Is it less painful if it just checks the zlib CRC (and that the SHA1 _exists_ for a delta - although I guess we check that indirectly by just accepting the delta in the first place)? That combination should still be a fairly strong check, of course. Then we could have something like [repack] check=[none|weak|default|strong] where the "none" check would be to just copy the data as-is, the "weak" (aka "default") would check just the CRC, and the strong one would unpack the whole object and check the SHA1.. Linus -- VGER BF report: U 0.5 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-03 22:00 ` Linus Torvalds @ 2006-09-03 22:16 ` Linus Torvalds 2006-09-03 22:34 ` Junio C Hamano 1 sibling, 0 replies; 52+ messages in thread From: Linus Torvalds @ 2006-09-03 22:16 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sun, 3 Sep 2006, Linus Torvalds wrote: > > Ok. Is it less painful if it just checks the zlib CRC (and that the SHA1 > _exists_ for a delta - although I guess we check that indirectly by just > accepting the delta in the first place)? That combination should still be > a fairly strong check, of course. Thinking some more about it, doing the stupid "apply full delta chain and check the final SHA1" is horribly inefficient, because if you have a repo that packs well, you'd expect to have a lot of things with a 10-deep delta chain. And doing it the silly way means that you'll do each object independently, ie for a 10-deep chain you'd unpack the base object ten times, and apply the first delta 9 times, the second one 8 times etc etc. And each time you'd deflate everything, since we don't keep a cache of actual object contents. So I'd expect that with full SHA1 checking, you'd end up doing ~45 deflates for the ten-object chain, instead of doing just 10. So it should hopefully be _much_ cheaper to just check the zlib CRC, not because the "apply delta" and "calculate sha1" are necessarily all that expensive, but because the unoptimized chain-unpacking is doing so much unnecessary work. Linus -- VGER BF report: U 0.5 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-03 22:00 ` Linus Torvalds 2006-09-03 22:16 ` Linus Torvalds @ 2006-09-03 22:34 ` Junio C Hamano 2006-09-04 4:06 ` Junio C Hamano 1 sibling, 1 reply; 52+ messages in thread From: Junio C Hamano @ 2006-09-03 22:34 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > On Sun, 3 Sep 2006, Junio C Hamano wrote: >> >> Quite bad. For the kernel archive of today (I usually am nearly >> fully packed): > > Ok. Is it less painful if it just checks the zlib CRC... I haven't checked myself but somebody said that zlib CRC is of preimage so we would need to incur inflate cost anyway if that is the case. But I think it may be a reasonable comproise to assume that an existing delta that inflates properly would apply to its base object, and if we can assume that we do not have to check the inflated xdelta data. Oops, not really, there is no check other than the pack overall SHA-1 checksum that protects the 20-byte base object name recorded in the pack. > ... and that the SHA1 > _exists_ for a delta - although I guess we check that indirectly by just > accepting the delta in the first place)? That combination should still be > a fairly strong check, of course. Another thing the current check is _not_ doing is for this pathological case: - .idx table says the pack entry is N bytes - unpack_entry_gently() used in the revalidate code uses the usual codepath that says "here is the start of the pack entry; inflate using as much data as you need"; .idx is somehow wrong and it needed N+M bytes where 0 < M. - we copy out N bytes because we belive .idx. -- VGER BF report: U 0.626721 ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-03 22:34 ` Junio C Hamano @ 2006-09-04 4:06 ` Junio C Hamano 2006-09-04 15:19 ` Linus Torvalds 0 siblings, 1 reply; 52+ messages in thread From: Junio C Hamano @ 2006-09-04 4:06 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Junio C Hamano <junkio@cox.net> writes: > Linus Torvalds <torvalds@osdl.org> writes: > >> Ok. Is it less painful if it just checks the zlib CRC... > > I haven't checked myself but somebody said that zlib CRC is of > preimage so we would need to incur inflate cost anyway if that > is the case. But I think it may be a reasonable comproise to > assume that an existing delta that inflates properly would apply > to its base object, and if we can assume that we do not have to > check the inflated xdelta data. > ... > Another thing the current check is _not_ doing is for this > pathological case: > > - .idx table says the pack entry is N bytes > > - unpack_entry_gently() used in the revalidate code uses the > usual codepath that says "here is the start of the pack > entry; inflate using as much data as you need"; .idx is > somehow wrong and it needed N+M bytes where 0 < M. > > - we copy out N bytes because we belive .idx. So here is a patch against "master" which contains none of the earlier patches in this thread. When copying from an existing pack and when copying from a loose object with new style header, the code makes sure that the piece we are going to copy out inflates well and inflate() consumes the data in full while doing so. The check to see if the xdelta really apply is quite expensive as you described, because you would need to have the image of the base object which can be represented as a delta against something else. The same repack takes this much with this code. Total 301361, written 301361 (delta 238935), reused 300995 (delta 238663) 6.66user 0.67system 0:07.40elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+62728minor)pagefaults 0swaps --- diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c index 46f524d..149fa28 100644 --- a/builtin-pack-objects.c +++ b/builtin-pack-objects.c @@ -65,6 +65,7 @@ static unsigned char pack_file_sha1[20]; static int progress = 1; static volatile sig_atomic_t progress_update; static int window = 10; +static int pack_to_stdout; /* * The object names in objects array are hashed with this hashtable, @@ -242,6 +243,82 @@ static int encode_header(enum object_typ return n; } +static int check_inflate(unsigned char *data, unsigned long len, unsigned long expect) +{ + z_stream stream; + unsigned char fakebuf[4096]; + int st; + + memset(&stream, 0, sizeof(stream)); + stream.next_in = data; + stream.avail_in = len; + stream.next_out = fakebuf; + stream.avail_out = sizeof(fakebuf); + inflateInit(&stream); + + while (1) { + st = inflate(&stream, Z_FINISH); + if (st == Z_STREAM_END || st == Z_OK) { + st = (stream.total_out == expect && + stream.total_in == len) ? 0 : -1; + break; + } + if (st != Z_BUF_ERROR) { + st = -1; + break; + } + stream.next_out = fakebuf; + stream.avail_out = sizeof(fakebuf); + } + inflateEnd(&stream); + return st; +} + +/* + * we are going to reuse the existing pack entry data. make + * sure it is not corrupt. + */ +static int revalidate_pack_entry(struct object_entry *entry, unsigned char *data, unsigned long len) +{ + enum object_type type; + unsigned long size, used; + + if (pack_to_stdout) + return 0; + + /* the caller has already called use_packed_git() for us, + * so it is safe to access the pack data from mmapped location. + * make sure the entry inflates correctly. + */ + used = unpack_object_header_gently(data, len, &type, &size); + if (!used) + return -1; + if (type == OBJ_DELTA) + used += 20; /* skip base object name */ + data += used; + len -= used; + return check_inflate(data, len, entry->size); +} + +static int revalidate_loose_object(struct object_entry *entry, + unsigned char *map, + unsigned long mapsize) +{ + /* we already know this is a loose object with new type header. */ + enum object_type type; + unsigned long size, used; + + if (pack_to_stdout) + return 0; + + used = unpack_object_header_gently(map, mapsize, &type, &size); + if (!used) + return -1; + map += used; + mapsize -= used; + return check_inflate(map, mapsize, size); +} + static unsigned long write_object(struct sha1file *f, struct object_entry *entry) { @@ -276,6 +353,9 @@ static unsigned long write_object(struct map = map_sha1_file(entry->sha1, &mapsize); if (map && !legacy_loose_object(map)) { /* We can copy straight into the pack file */ + if (revalidate_loose_object(entry, map, mapsize)) + die("corrupt loose object %s", + sha1_to_hex(entry->sha1)); sha1write(f, map, mapsize); munmap(map, mapsize); written++; @@ -286,7 +366,7 @@ static unsigned long write_object(struct munmap(map, mapsize); } - if (! to_reuse) { + if (!to_reuse) { buf = read_sha1_file(entry->sha1, type, &size); if (!buf) die("unable to read %s", sha1_to_hex(entry->sha1)); @@ -319,6 +399,9 @@ static unsigned long write_object(struct datalen = find_packed_object_size(p, entry->in_pack_offset); buf = (char *) p->pack_base + entry->in_pack_offset; + + if (revalidate_pack_entry(entry, buf, datalen)) + die("corrupt delta in pack %s", sha1_to_hex(entry->sha1)); sha1write(f, buf, datalen); unuse_packed_git(p); hdrlen = 0; /* not really */ @@ -1163,7 +1246,7 @@ static void prepare_pack(int window, int find_deltas(sorted_by_type, window+1, depth); } -static int reuse_cached_pack(unsigned char *sha1, int pack_to_stdout) +static int reuse_cached_pack(unsigned char *sha1) { static const char cache[] = "pack-cache/pack-%s.%s"; char *cached_pack, *cached_idx; @@ -1247,7 +1330,7 @@ int cmd_pack_objects(int argc, const cha { SHA_CTX ctx; char line[40 + 1 + PATH_MAX + 2]; - int depth = 10, pack_to_stdout = 0; + int depth = 10; struct object_entry **list; int num_preferred_base = 0; int i; @@ -1367,7 +1450,7 @@ int cmd_pack_objects(int argc, const cha if (progress && (nr_objects != nr_result)) fprintf(stderr, "Result has %d objects.\n", nr_result); - if (reuse_cached_pack(object_list_sha1, pack_to_stdout)) + if (reuse_cached_pack(object_list_sha1)) ; else { if (nr_result) -- VGER BF report: U 0.976258 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH] pack-objects: re-validate data we copy from elsewhere. 2006-09-04 4:06 ` Junio C Hamano @ 2006-09-04 15:19 ` Linus Torvalds 0 siblings, 0 replies; 52+ messages in thread From: Linus Torvalds @ 2006-09-04 15:19 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sun, 3 Sep 2006, Junio C Hamano wrote: > > So here is a patch against "master" which contains none of the > earlier patches in this thread. Ok, this looks good. It only adds two extra seconds for me (after fixing the patch: "unpack_object_header_gently()" was static to sha1_file.c so your patch wouldn't actually compile on its own), and it should be a lot less prone to problems. Linus ^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2006-09-05 8:12 UTC | newest] Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <9e4733910608290943g6aa79855q62b98caf4f19510@mail.gmail.com> [not found] ` <20060829165811.GB21729@spearce.org> [not found] ` <9e4733910608291037k2d9fb791v18abc19bdddf5e89@mail.gmail.com> [not found] ` <20060829175819.GE21729@spearce.org> [not found] ` <9e4733910608291155g782953bbv5df1b74878f4fcf1@mail.gmail.com> [not found] ` <20060829190548.GK21729@spearce.org> [not found] ` <9e4733910608291252q130fc723r945e6ab906ca6969@mail.gmail.com> [not found] ` <20060829232007.GC22935@spearce.org> [not found] ` <9e4733910608291807q9b896e4sdbfaa9e49de58c2b@mail.gmail.com> 2006-08-30 1:51 ` Mozilla .git tree Shawn Pearce 2006-08-30 2:25 ` Shawn Pearce 2006-08-30 2:58 ` Jon Smirl 2006-08-30 3:10 ` Shawn Pearce 2006-08-30 3:27 ` Jon Smirl 2006-08-30 5:53 ` Nicolas Pitre 2006-08-30 11:42 ` Junio C Hamano 2006-09-01 7:42 ` Junio C Hamano 2006-09-02 1:19 ` Shawn Pearce 2006-09-02 4:01 ` Junio C Hamano 2006-09-02 4:39 ` Shawn Pearce 2006-09-02 11:06 ` Junio C Hamano 2006-09-02 14:20 ` Jon Smirl 2006-09-02 17:39 ` Shawn Pearce 2006-09-02 18:56 ` Linus Torvalds 2006-09-02 20:53 ` Junio C Hamano 2006-09-02 17:44 ` Shawn Pearce 2006-09-02 2:04 ` Shawn Pearce 2006-09-02 11:02 ` Junio C Hamano 2006-09-02 17:51 ` Shawn Pearce 2006-09-02 20:55 ` Junio C Hamano 2006-09-03 3:54 ` Shawn Pearce 2006-09-01 17:45 ` A Large Angry SCM 2006-09-01 18:35 ` Linus Torvalds 2006-09-01 19:56 ` Junio C Hamano 2006-09-01 23:14 ` [PATCH] pack-objects: re-validate data we copy from elsewhere Junio C Hamano 2006-09-02 0:23 ` Linus Torvalds 2006-09-02 1:39 ` VGER BF report? Johannes Schindelin 2006-09-02 5:58 ` Sam Ravnborg 2006-09-02 1:52 ` [PATCH] pack-objects: re-validate data we copy from elsewhere Junio C Hamano 2006-09-02 3:52 ` Junio C Hamano 2006-09-02 4:52 ` Shawn Pearce 2006-09-02 9:42 ` Junio C Hamano 2006-09-02 17:43 ` Linus Torvalds 2006-09-02 10:09 ` Junio C Hamano 2006-09-02 17:54 ` Shawn Pearce 2006-09-03 21:00 ` Junio C Hamano 2006-09-04 4:10 ` Shawn Pearce 2006-09-04 5:50 ` Junio C Hamano 2006-09-04 6:44 ` Shawn Pearce 2006-09-04 7:39 ` Junio C Hamano 2006-09-03 0:27 ` Linus Torvalds 2006-09-03 0:32 ` Junio C Hamano 2006-09-05 8:12 ` Junio C Hamano 2006-09-02 18:43 ` Linus Torvalds 2006-09-02 20:56 ` Junio C Hamano 2006-09-03 21:48 ` Junio C Hamano 2006-09-03 22:00 ` Linus Torvalds 2006-09-03 22:16 ` Linus Torvalds 2006-09-03 22:34 ` Junio C Hamano 2006-09-04 4:06 ` Junio C Hamano 2006-09-04 15:19 ` Linus Torvalds
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.