All of lore.kernel.org
 help / color / mirror / Atom feed
* A look at some alternative PACK file encodings
@ 2006-09-06 21:47 A Large Angry SCM
  2006-09-06 23:23 ` Jon Smirl
       [not found] ` <9e4733910609061617m6783d6c4xaca2f9575e12d455@mail.gmail.com>
  0 siblings, 2 replies; 37+ messages in thread
From: A Large Angry SCM @ 2006-09-06 21:47 UTC (permalink / raw)
  To: git

Since I'm currently between jobs and extra time on my hands until I find
a new one, I've been experimenting with some alternative encodings for
PACK file contents.

The tables below show packing statistics for a number of actual pack
files using difference pack file encodings. All of the encodings use
object names to find the base object of a DELTA object.

Encodings:
	Pack_2 - Pack version 2 (Git current), 32 bit
	Pack_3 - Pack version 3 (Git future), 32 bit
	Gitzilla_1 - Experimental >32 bit, described below
	Gitzilla_2 - Experimental >32 bit, described below

Gitzilla_1: Per object header consisting of 2 fields: the Git object
type of the object with a flag indicating if it's delta encoded, big
endian base 128 encoding of the object length. If the object is not
delta encoded, the object header is followed by the deflated object
contents. If the object is delta encoded, the object header is followed
by the 20 byte ID of the base object, and the deflated edit list (see
below). This encoding allows for the possibility of using some of the
bits in the object type field to select one of up to 15 dictionaries.

Gitzilla_2: Per object header consisting of a single field combining the
object type and length; similar to that used in current packs but big
endian encoded. The type could be encoded either as object type with
delta flag (no room for additional object types), or the same as what
version 2 packs (room for additional object types but need to walk the
delta chain to determine type of delta encode objects). If the object is
delta encoded, the object header is followed by the 20 byte ID of the
base object, and the deflated edit list (see below).

Notice that the length encoded in the per object header for delta encode
objects is that of the decoded object, not the length of the inflated
edit list.

The edit lists used in the Gitzilla encodings consist of a sequence of
one or more hunks. Each hunk specifies either a part of the base object
to copy to the current position of the result object, or a length of
(immediate) data to copy to the current position of the result object.

Each hunk type is encode as 2 fields: the first is big endian base 128
encoded length of data to copy with bit 6 of the first byte indicating
where to copy from. The second field is the big endian base 128 encoded
offset in the base object of the copy source, or the (immediate) data to
copy.



Observations:

Most of the savings are from not including the base and result object
lengths in the deflated edit lists.

Other minor savings from joining adjacent edit hunks of the same type,
where possible.

The endian of the base 128 encoding is big because it requires less work
to _decode_ than little.

TREE objects do not delta or deflate well.

Dictionaries will need to be tailored to different object types, and
possibly to other object subsets.

The ability to get the type and length of a delta'd object without
needing to walk the delta chain is a good thing.

The use of _relative_ offsets to specify base objects would save a lot
of space.



The data:

All of the PACK files except bd5109... were rsync'd from the Git project
on kernel.org, at some point. PACK bd5109... is a mash-up of Linus'
kernel and sparse projects, Thomas' kernel history, the stable 2.6.x.y
kernel projects, Git, and gitk; all from kernel.org.

The columns are:
        encoding name
        pack file size
        size of pack file non-object overhead
        sum of packed object sizes
        byte difference in sum of packed object sizes and that of
                version 2 packs
        average byte difference per object
        average byte difference per DELTA'd object



pack-1abe018fa97f8a257a4f3a072000abf99eaa925e.pack

BLOB  :         237
COMMIT:         504
DELTA :        1108
TAG   :           6
TREE  :          64
======   ==========
Total :        1919

Gitzilla_1:     1202917        32     1202885       -6797   -3.5419   -6.1345
Gitzilla_2:     1202011        32     1201979       -7703   -4.0141   -6.9522
Pack_2    :     1209714        32     1209682           0    0.0000    0.0000
Pack_3    :     1209686        32     1209654         -28   -0.0146   -0.0253


pack-3a080dc749b452cc847d19d91981ddfcff629750.pack

BLOB  :        1020
COMMIT:        5300
DELTA :       15013
TAG   :          80
TREE  :         564
======   ==========
Total :       21977

Gitzilla_1:     6748060        32     6748028      -95768   -4.3576   -6.3790
Gitzilla_2:     6737992        32     6737960     -105836   -4.8158   -7.0496
Pack_2    :     6843828        32     6843796           0    0.0000    0.0000
Pack_3    :     6843747        32     6843715         -81   -0.0037   -0.0054


pack-3e6247e4edf3c006a037041eef28bbca8051b006.pack

BLOB  :        1062
COMMIT:        5545
DELTA :       15565
TAG   :          81
TREE  :         703
======   ==========
Total :       22956

Gitzilla_1:     6724940        32     6724908      -97643   -4.2535   -6.2732
Gitzilla_2:     6714303        32     6714271     -108280   -4.7168   -6.9566
Pack_2    :     6822583        32     6822551           0    0.0000    0.0000
Pack_3    :     6822458        32     6822426        -125   -0.0054   -0.0080


pack-6bdf691abdae3590f6a807d15ebe8ed523ef1e12.pack

BLOB  :          33
COMMIT:          36
DELTA :          64
TAG   :           0
TREE  :           4
======   ==========
Total :         137

Gitzilla_1:      136041        32      136009        -273   -1.9927   -4.2656
Gitzilla_2:      135991        32      135959        -323   -2.3577   -5.0469
Pack_2    :      136314        32      136282           0    0.0000    0.0000
Pack_3    :      136314        32      136282           0    0.0000    0.0000


pack-740c99c0be6734adbd130737dec2608dc4682761.pack

BLOB  :         181
COMMIT:         173
DELTA :         383
TAG   :           5
TREE  :          24
======   ==========
Total :         766

Gitzilla_1:      586580        32      586548       -1799   -2.3486   -4.6971
Gitzilla_2:      586264        32      586232       -2115   -2.7611   -5.5222
Pack_2    :      588379        32      588347           0    0.0000    0.0000
Pack_3    :      588370        32      588338          -9   -0.0117   -0.0235


pack-793a9e93286d6c656941977d2e5b49e28566edcd.pack

BLOB  :        1158
COMMIT:        6512
DELTA :       18715
TAG   :          90
TREE  :         820
======   ==========
Total :       27295

Gitzilla_1:     7957552        32     7957520     -120209   -4.4041   -6.4231
Gitzilla_2:     7944784        32     7944752     -132977   -4.8718   -7.1054
Pack_2    :     8077761        32     8077729           0    0.0000    0.0000
Pack_3    :     8077566        32     8077534        -195   -0.0071   -0.0104


pack-92b2a535e63e4a97eba2a13f60c3da922372f03a.pack

BLOB  :        1149
COMMIT:        5969
DELTA :       16930
TAG   :          84
TREE  :         750
======   ==========
Total :       24882

Gitzilla_1:     7243503        32     7243471     -108475   -4.3596   -6.4073
Gitzilla_2:     7231901        32     7231869     -120077   -4.8259   -7.0926
Pack_2    :     7351978        32     7351946           0    0.0000    0.0000
Pack_3    :     7351841        32     7351809        -137   -0.0055   -0.0081


pack-9f1f94e6f50261b5d99c04349bfa3d315cfb4118.pack

BLOB  :        1190
COMMIT:        6244
DELTA :       17881
TAG   :          87
TREE  :         810
======   ==========
Total :       26212

Gitzilla_1:     7920476        32     7920444     -115214   -4.3955   -6.4434
Gitzilla_2:     7908256        32     7908224     -127434   -4.8617   -7.1268
Pack_2    :     8035690        32     8035658           0    0.0000    0.0000
Pack_3    :     8035529        32     8035497        -161   -0.0061   -0.0090


pack-a875e0c11c739866b000dc9b34d1b0daead84a00.pack

BLOB  :          20
COMMIT:           9
DELTA :          15
TAG   :           1
TREE  :           3
======   ==========
Total :          48

Gitzilla_1:       71748        32       71716         -54   -1.1250   -3.6000
Gitzilla_2:       71728        32       71696         -74   -1.5417   -4.9333
Pack_2    :       71802        32       71770           0    0.0000    0.0000
Pack_3    :       71802        32       71770           0    0.0000    0.0000


pack-abbe9cdea12803663015a7e23b7ae61c64e56499.pack

BLOB  :         259
COMMIT:         342
DELTA :         705
TAG   :          13
TREE  :          47
======   ==========
Total :        1366

Gitzilla_1:     1086337        32     1086305       -3878   -2.8389   -5.5007
Gitzilla_2:     1085773        32     1085741       -4442   -3.2518   -6.3007
Pack_2    :     1090215        32     1090183           0    0.0000    0.0000
Pack_3    :     1090215        32     1090183           0    0.0000    0.0000


pack-bd5109ab74dd67d5b0a42f26688a4fa0a579e9fa.pack

BLOB  :       29212
COMMIT:      103241
DELTA :      702986
TAG   :         182
TREE  :       37102
======   ==========
Total :      872723

Gitzilla_1:   282786611        32   282786579    -3647429   -4.1794   -5.1885
Gitzilla_2:   282143551        32   282143519    -4290489   -4.9162   -6.1032
Pack_2    :   286434040        32   286434008           0    0.0000    0.0000
Pack_3    :   286404302        32   286404270      -29738   -0.0341   -0.0423


pack-d18e6cccf9e472ba05e108254c827cc7ac0fe2a9.pack

BLOB  :        1564
COMMIT:        3118
DELTA :        8288
TAG   :          51
TREE  :         365
======   ==========
Total :       13386

Gitzilla_1:     5638236        32     5638204      -44212   -3.3029   -5.3345
Gitzilla_2:     5632126        32     5632094      -50322   -3.7593   -6.0717
Pack_2    :     5682448        32     5682416           0    0.0000    0.0000
Pack_3    :     5682423        32     5682391         -25   -0.0019   -0.0030


pack-d25faec39a55935d787ae66e4f4a69d3add16417.pack

BLOB  :        1180
COMMIT:        6090
DELTA :       17458
TAG   :          85
TREE  :         761
======   ==========
Total :       25574

Gitzilla_1:     7646842        32     7646810     -112816   -4.4114   -6.4621
Gitzilla_2:     7634964        32     7634932     -124694   -4.8758   -7.1425
Pack_2    :     7759658        32     7759626           0    0.0000    0.0000
Pack_3    :     7759521        32     7759489        -137   -0.0054   -0.0078


pack-d453bb50fef8757a0d7f9fcc88429e9ce6f74077.pack

BLOB  :          28
COMMIT:          35
DELTA :          46
TAG   :           2
TREE  :           6
======   ==========
Total :         117

Gitzilla_1:      177544        32      177512        -266   -2.2735   -5.7826
Gitzilla_2:      177478        32      177446        -332   -2.8376   -7.2174
Pack_2    :      177810        32      177778           0    0.0000    0.0000
Pack_3    :      177806        32      177774          -4   -0.0342   -0.0870


pack-d83305b8608cb685ecd025fb141bf9d1588ae9c5.pack

BLOB  :          91
COMMIT:         108
DELTA :         202
TAG   :           2
TREE  :          26
======   ==========
Total :         429

Gitzilla_1:      473503        32      473471       -1112   -2.5921   -5.5050
Gitzilla_2:      473269        32      473237       -1346   -3.1375   -6.6634
Pack_2    :      474615        32      474583           0    0.0000    0.0000
Pack_3    :      474603        32      474571         -12   -0.0280   -0.0594


pack-ee22eefd6bad2b54358d52ad6221ecabdc79d6f5.pack

BLOB  :         175
COMMIT:         210
DELTA :         541
TAG   :           2
TREE  :          38
======   ==========
Total :         966

Gitzilla_1:      662650        32      662618       -3230   -3.3437   -5.9704
Gitzilla_2:      662150        32      662118       -3730   -3.8613   -6.8946
Pack_2    :      665880        32      665848           0    0.0000    0.0000
Pack_3    :      665865        32      665833         -15   -0.0155   -0.0277


pack-f0c530478b9dd6e96a69482066ec66eb5f6edbbf.pack

BLOB  :        1154
COMMIT:        5797
DELTA :       16300
TAG   :          83
TREE  :         739
======   ==========
Total :       24073

Gitzilla_1:     7212953        32     7212921     -102252   -4.2476   -6.2731
Gitzilla_2:     7201730        32     7201698     -113475   -4.7138   -6.9617
Pack_2    :     7315205        32     7315173           0    0.0000    0.0000
Pack_3    :     7315059        32     7315027        -146   -0.0061   -0.0090

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-06 21:47 A look at some alternative PACK file encodings A Large Angry SCM
@ 2006-09-06 23:23 ` Jon Smirl
  2006-09-06 23:39   ` A Large Angry SCM
                     ` (2 more replies)
       [not found] ` <9e4733910609061617m6783d6c4xaca2f9575e12d455@mail.gmail.com>
  1 sibling, 3 replies; 37+ messages in thread
From: Jon Smirl @ 2006-09-06 23:23 UTC (permalink / raw)
  To: gitzilla; +Cc: git

On 9/6/06, A Large Angry SCM <gitzilla@gmail.com> wrote:
> TREE objects do not delta or deflate well.

I can understand why they don't deflate, the path names are pretty
much unique and the sha1s are incompressible. By why don't they delta
well? Does sorting them by size mess up the delta process?

Shawn is doing some prototype work on true dictionary based
compression. I don't know how far along he is but it has potential for
taking 30% off the Mozilla pack.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-06 23:23 ` Jon Smirl
@ 2006-09-06 23:39   ` A Large Angry SCM
  2006-09-06 23:56     ` Linus Torvalds
                       ` (2 more replies)
  2006-09-07  0:40   ` Nicolas Pitre
  2006-09-07  5:21   ` Shawn Pearce
  2 siblings, 3 replies; 37+ messages in thread
From: A Large Angry SCM @ 2006-09-06 23:39 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

Jon Smirl wrote:
> On 9/6/06, A Large Angry SCM <gitzilla@gmail.com> wrote:
>> TREE objects do not delta or deflate well.
> 
> I can understand why they don't deflate, the path names are pretty
> much unique and the sha1s are incompressible. By why don't they delta
> well? Does sorting them by size mess up the delta process?

My guess would be the TREEs would only delta well against other TREE
versions for the same path.

> Shawn is doing some prototype work on true dictionary based
> compression. I don't know how far along he is but it has potential for
> taking 30% off the Mozilla pack.

Just looking at the structures in non-BLOBS, I see a lot of potential
for the use of a set dictionaries when deflating TREEs and another set
of dictionaries when deflating COMMITs and TAGs. The low hanging fruit
is to create dictionaries of the most referenced IDs across all TREE or
COMMIT/TAG objects.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-06 23:39   ` A Large Angry SCM
@ 2006-09-06 23:56     ` Linus Torvalds
  2006-09-07  0:10       ` Jon Smirl
                         ` (2 more replies)
  2006-09-07  0:04     ` Jon Smirl
  2006-09-07  5:34     ` Shawn Pearce
  2 siblings, 3 replies; 37+ messages in thread
From: Linus Torvalds @ 2006-09-06 23:56 UTC (permalink / raw)
  To: A Large Angry SCM; +Cc: Jon Smirl, git



On Wed, 6 Sep 2006, A Large Angry SCM wrote:

> Jon Smirl wrote:
> > On 9/6/06, A Large Angry SCM <gitzilla@gmail.com> wrote:
> >> TREE objects do not delta or deflate well.
> > 
> > I can understand why they don't deflate, the path names are pretty
> > much unique and the sha1s are incompressible. By why don't they delta
> > well? Does sorting them by size mess up the delta process?
> 
> My guess would be the TREEs would only delta well against other TREE
> versions for the same path.

That's what you'd normally have in a real project, though. I wonder if 
your "pack mashup" lost the normal behaviour: we very much sort trees 
together normally, thanks to the "sort-by-filename, then by size" 
behaviour that git-pack-objects should have (for trees, the size normally 
shouldn't change, so the sorting should basically boil down to "sort the 
same directory together, keeping the ordering it had from git-rev-list").

Btw, that "keeping the ordering it had" part I'm not convinced we actually 
enforce. That would depend on the sort algorithm used by "qsort()", I 
think. So there might be room for improvement there in order to keep 
things in recency order.

> Just looking at the structures in non-BLOBS, I see a lot of potential
> for the use of a set dictionaries when deflating TREEs and another set
> of dictionaries when deflating COMMITs and TAGs. The low hanging fruit
> is to create dictionaries of the most referenced IDs across all TREE or
> COMMIT/TAG objects.

Is there any way to get zlib to just generate a suggested dictionary from 
a given set of input?

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-06 23:39   ` A Large Angry SCM
  2006-09-06 23:56     ` Linus Torvalds
@ 2006-09-07  0:04     ` Jon Smirl
  2006-09-07  5:41       ` Shawn Pearce
  2006-09-07  5:34     ` Shawn Pearce
  2 siblings, 1 reply; 37+ messages in thread
From: Jon Smirl @ 2006-09-07  0:04 UTC (permalink / raw)
  To: gitzilla; +Cc: git

On 9/6/06, A Large Angry SCM <gitzilla@gmail.com> wrote:
> Jon Smirl wrote:
> > On 9/6/06, A Large Angry SCM <gitzilla@gmail.com> wrote:
> >> TREE objects do not delta or deflate well.
> >
> > I can understand why they don't deflate, the path names are pretty
> > much unique and the sha1s are incompressible. By why don't they delta
> > well? Does sorting them by size mess up the delta process?
>
> My guess would be the TREEs would only delta well against other TREE
> versions for the same path.

I would have thought that the TREE delta code would have already taken
this into account. Is there enough info around to match up all of the
TREEs for a specific path into a delta chain?

The repack code could build a model of the tree as it is repacking,
that is what fast-import does. If you have a model of the tree then
when you change a TREE node you track the last sha1 that corresponded
to that directory path. Now you know what to diff to.

> > Shawn is doing some prototype work on true dictionary based
> > compression. I don't know how far along he is but it has potential for
> > taking 30% off the Mozilla pack.
>
> Just looking at the structures in non-BLOBS, I see a lot of potential
> for the use of a set dictionaries when deflating TREEs and another set
> of dictionaries when deflating COMMITs and TAGs. The low hanging fruit
> is to create dictionaries of the most referenced IDs across all TREE or
> COMMIT/TAG objects.

Doing this will get 4-7% according to our small tests, to get more you
need a different type of compression algorithm. For true dictionary
based compression you parse all of the input documents into words. The
words are put into a single giant huffman table. Then the huffman
codes for the words are concatenated together to form the compressed
document. This is one of the highest possible compression methods
since it takes into account you are compressing something that is
human readable. gzip is sort of like this but the dictionary is
limited to 32KB.

Search engines use true dictionary compression since it is easy to add
on search indices to the compressed data. For a search index you make
a bit vector that corresponds to each word in the dictionary and then
set a bit if that document number contains the word. The vectors and
dictionary are stored compressed. There are algorithms for doing
and/or on the compressed vectors. The and/or stage lets you identify
likely hits, you retrieve those and then do an exact match on the
search terms to be sure.

-- 
Jon Smirl
jonsmirl@gmail.com



-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07  0:10       ` Jon Smirl
@ 2006-09-07  0:06         ` David Lang
  0 siblings, 0 replies; 37+ messages in thread
From: David Lang @ 2006-09-07  0:06 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Linus Torvalds, A Large Angry SCM, git

On Wed, 6 Sep 2006, Jon Smirl wrote:

> On 9/6/06, Linus Torvalds <torvalds@osdl.org> wrote:
>> Is there any way to get zlib to just generate a suggested dictionary from
>> a given set of input?
>
> No, I asked the author. Apparently it is a hard problem, there have
> been research papers written about it.
>
> Shawn has a Perl script that makes a guess at a dictionary. That
> scripts gets 4-7% improvement. The number one thing that ended up in
> the Mozilla dictionary was the five different license versions that
> had each been copied into 50,000 files over time.

for the mozilla project it may make sense to feed all these license files from 
all over as one string to git, as an exception to your normal process of going 
file by file. if you can do this then the delta functionality should reduce 
these files to practicaly nothing.

David Lang

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-06 23:56     ` Linus Torvalds
@ 2006-09-07  0:10       ` Jon Smirl
  2006-09-07  0:06         ` David Lang
  2006-09-07  0:19       ` A Large Angry SCM
  2006-09-07  0:37       ` Nicolas Pitre
  2 siblings, 1 reply; 37+ messages in thread
From: Jon Smirl @ 2006-09-07  0:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: A Large Angry SCM, git

On 9/6/06, Linus Torvalds <torvalds@osdl.org> wrote:
> Is there any way to get zlib to just generate a suggested dictionary from
> a given set of input?

No, I asked the author. Apparently it is a hard problem, there have
been research papers written about it.

Shawn has a Perl script that makes a guess at a dictionary. That
scripts gets 4-7% improvement. The number one thing that ended up in
the Mozilla dictionary was the five different license versions that
had each been copied into 50,000 files over time.

I think what happens in practice is that when people get to the point
of trying to optimize zlib dictionaries they switch compression
methods.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-06 23:56     ` Linus Torvalds
  2006-09-07  0:10       ` Jon Smirl
@ 2006-09-07  0:19       ` A Large Angry SCM
  2006-09-07  0:45         ` Linus Torvalds
  2006-09-07  0:37       ` Nicolas Pitre
  2 siblings, 1 reply; 37+ messages in thread
From: A Large Angry SCM @ 2006-09-07  0:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jon Smirl, git

Linus Torvalds wrote:
> 
> On Wed, 6 Sep 2006, A Large Angry SCM wrote:
> 
>> Jon Smirl wrote:
>>> On 9/6/06, A Large Angry SCM <gitzilla@gmail.com> wrote:
>>>> TREE objects do not delta or deflate well.
>>> I can understand why they don't deflate, the path names are pretty
>>> much unique and the sha1s are incompressible. By why don't they delta
>>> well? Does sorting them by size mess up the delta process?
>> My guess would be the TREEs would only delta well against other TREE
>> versions for the same path.
> 
> That's what you'd normally have in a real project, though. I wonder if 
> your "pack mashup" lost the normal behaviour: we very much sort trees 
> together normally, thanks to the "sort-by-filename, then by size" 
> behaviour that git-pack-objects should have (for trees, the size normally 
> shouldn't change, so the sorting should basically boil down to "sort the 
> same directory together, keeping the ordering it had from git-rev-list").

The mashup is just all the projects in a single repository with a bushy
refs tree so I can view the updates in a single gitk window.

The sorting by name, then by path may be breaking the object version
relationship for wide graphs.

> Btw, that "keeping the ordering it had" part I'm not convinced we actually 
> enforce. That would depend on the sort algorithm used by "qsort()", I 
> think. So there might be room for improvement there in order to keep 
> things in recency order.

qsort() is not stable.

>> Just looking at the structures in non-BLOBS, I see a lot of potential
>> for the use of a set dictionaries when deflating TREEs and another set
>> of dictionaries when deflating COMMITs and TAGs. The low hanging fruit
>> is to create dictionaries of the most referenced IDs across all TREE or
>> COMMIT/TAG objects.
>
> Is there any way to get zlib to just generate a suggested dictionary from 
> a given set of input?

The docs suggest "no".

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-06 23:56     ` Linus Torvalds
  2006-09-07  0:10       ` Jon Smirl
  2006-09-07  0:19       ` A Large Angry SCM
@ 2006-09-07  0:37       ` Nicolas Pitre
  2 siblings, 0 replies; 37+ messages in thread
From: Nicolas Pitre @ 2006-09-07  0:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: A Large Angry SCM, Jon Smirl, git

On Wed, 6 Sep 2006, Linus Torvalds wrote:

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-06 23:23 ` Jon Smirl
  2006-09-06 23:39   ` A Large Angry SCM
@ 2006-09-07  0:40   ` Nicolas Pitre
  2006-09-07  0:59     ` Jon Smirl
                       ` (2 more replies)
  2006-09-07  5:21   ` Shawn Pearce
  2 siblings, 3 replies; 37+ messages in thread
From: Nicolas Pitre @ 2006-09-07  0:40 UTC (permalink / raw)
  To: Jon Smirl; +Cc: gitzilla, git

On Wed, 6 Sep 2006, Jon Smirl wrote:

> Shawn is doing some prototype work on true dictionary based
> compression. I don't know how far along he is but it has potential for
> taking 30% off the Mozilla pack.

BTW I'm half-way done with support for deltas which base object is 
referenced with an offset in the pack instead of a hash.  It is quite 
messy though since it touches pretty core code all over the place when 
it comes to fetching objects out of a pack.


Nicolas

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07  0:19       ` A Large Angry SCM
@ 2006-09-07  0:45         ` Linus Torvalds
  0 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2006-09-07  0:45 UTC (permalink / raw)
  To: A Large Angry SCM; +Cc: Jon Smirl, git



On Wed, 6 Sep 2006, A Large Angry SCM wrote:
> 
> > Btw, that "keeping the ordering it had" part I'm not convinced we actually 
> > enforce. That would depend on the sort algorithm used by "qsort()", I 
> > think. So there might be room for improvement there in order to keep 
> > things in recency order.
> 
> qsort() is not stable.

Well, quicksort isn't, but a lot of systems don't actually use quicksort 
to implement qsort() - despite the name.

I think, for example, that glibc uses a merge-sort when there are lots of 
entries (to avoid any potential bad behaviour with quicksort), and that 
should be stable. But I didn't actually check..

In fact, I didn't even think of this issue originally, but it might 
actually be worthwhile doing our own sort, exactly because we'll otherwise 
have different systems generating different pack-files due to issues like 
this.

So even if we don't actually _care_ whether the sorting is stable or not, 
we might well care that we always get the same results, regardless of what 
C library we have.

> > Is there any way to get zlib to just generate a suggested dictionary from 
> > a given set of input?
> 
> The docs suggest "no".

Oh, well.

			Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07  0:40   ` Nicolas Pitre
@ 2006-09-07  0:59     ` Jon Smirl
  2006-09-07  2:30       ` Nicolas Pitre
  2006-09-07  2:33       ` A Large Angry SCM
  2006-09-07  1:11     ` Junio C Hamano
  2006-09-07  4:33     ` Shawn Pearce
  2 siblings, 2 replies; 37+ messages in thread
From: Jon Smirl @ 2006-09-07  0:59 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: gitzilla, git

On 9/6/06, Nicolas Pitre <nico@cam.org> wrote:
> On Wed, 6 Sep 2006, Jon Smirl wrote:
>
> > Shawn is doing some prototype work on true dictionary based
> > compression. I don't know how far along he is but it has potential for
> > taking 30% off the Mozilla pack.
>
> BTW I'm half-way done with support for deltas which base object is
> referenced with an offset in the pack instead of a hash.  It is quite
> messy though since it touches pretty core code all over the place when
> it comes to fetching objects out of a pack.

Would it help to change all of the references in the pack from sha1 to
encoded relative offsets? Then collect all of the object fetch code
into a single subroutine which would change it algorithm depending on
which type of pack it is operating on. Now the pack wouldn't mix
reference types, they would all be encoded relative or sha1.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07  0:40   ` Nicolas Pitre
  2006-09-07  0:59     ` Jon Smirl
@ 2006-09-07  1:11     ` Junio C Hamano
  2006-09-07  2:47       ` Nicolas Pitre
  2006-09-07  4:33     ` Shawn Pearce
  2 siblings, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2006-09-07  1:11 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git

Nicolas Pitre <nico@cam.org> writes:

> On Wed, 6 Sep 2006, Jon Smirl wrote:
>
>> Shawn is doing some prototype work on true dictionary based
>> compression. I don't know how far along he is but it has potential for
>> taking 30% off the Mozilla pack.
>
> BTW I'm half-way done with support for deltas which base object is 
> referenced with an offset in the pack instead of a hash.  It is quite 
> messy though since it touches pretty core code all over the place when 
> it comes to fetching objects out of a pack.

I wonder how you made unpack-objects to work with it.  Do you
store an extra bit in the packed stream to say "this object will
be used as a delta base for later objects so remember its offset
and resulting SHA1", or do you just remember every entry as you
unpack?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07  0:59     ` Jon Smirl
@ 2006-09-07  2:30       ` Nicolas Pitre
  2006-09-07  2:33       ` A Large Angry SCM
  1 sibling, 0 replies; 37+ messages in thread
From: Nicolas Pitre @ 2006-09-07  2:30 UTC (permalink / raw)
  To: Jon Smirl; +Cc: gitzilla, git

On Wed, 6 Sep 2006, Jon Smirl wrote:

> On 9/6/06, Nicolas Pitre <nico@cam.org> wrote:
> > On Wed, 6 Sep 2006, Jon Smirl wrote:
> >
> > > Shawn is doing some prototype work on true dictionary based
> > > compression. I don't know how far along he is but it has potential for
> > > taking 30% off the Mozilla pack.
> >
> > BTW I'm half-way done with support for deltas which base object is
> > referenced with an offset in the pack instead of a hash.  It is quite
> > messy though since it touches pretty core code all over the place when
> > it comes to fetching objects out of a pack.
> 
> Would it help to change all of the references in the pack from sha1 to
> encoded relative offsets? Then collect all of the object fetch code
> into a single subroutine which would change it algorithm depending on
> which type of pack it is operating on. Now the pack wouldn't mix
> reference types, they would all be encoded relative or sha1.

No.  The problem is more about sha1_file.c using a mix of sha1 refs 
and/or pack,offset tuples to reference deltas.  I converted most of them 
to pack,offset tuples.  This also has a mnice side effect of having less 
stack usage when recursing down a delta chain.

When a delta with sha1 reference is encountered the code turns that 
reference into the appropriate offset with a binary search in the pack 
index.  When a delta with offset reference is encountered then no search 
in the index is needed.

This is not really complicated but only spread all over so that the 
diffstat is a bit scary.


Nicolas

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07  0:59     ` Jon Smirl
  2006-09-07  2:30       ` Nicolas Pitre
@ 2006-09-07  2:33       ` A Large Angry SCM
  1 sibling, 0 replies; 37+ messages in thread
From: A Large Angry SCM @ 2006-09-07  2:33 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Nicolas Pitre, git

Jon Smirl wrote:
> On 9/6/06, Nicolas Pitre <nico@cam.org> wrote:
>> On Wed, 6 Sep 2006, Jon Smirl wrote:
>>
>> > Shawn is doing some prototype work on true dictionary based
>> > compression. I don't know how far along he is but it has potential for
>> > taking 30% off the Mozilla pack.
>>
>> BTW I'm half-way done with support for deltas which base object is
>> referenced with an offset in the pack instead of a hash.  It is quite
>> messy though since it touches pretty core code all over the place when
>> it comes to fetching objects out of a pack.
> 
> Would it help to change all of the references in the pack from sha1 to
> encoded relative offsets? Then collect all of the object fetch code
> into a single subroutine which would change it algorithm depending on
> which type of pack it is operating on. Now the pack wouldn't mix
> reference types, they would all be encoded relative or sha1.
> 

Support for 'thin' packs would pretty much require mixing IDs and
(relative) offsets in the same pack file.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07  1:11     ` Junio C Hamano
@ 2006-09-07  2:47       ` Nicolas Pitre
  0 siblings, 0 replies; 37+ messages in thread
From: Nicolas Pitre @ 2006-09-07  2:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, 6 Sep 2006, Junio C Hamano wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > BTW I'm half-way done with support for deltas which base object is 
> > referenced with an offset in the pack instead of a hash.  It is quite 
> > messy though since it touches pretty core code all over the place when 
> > it comes to fetching objects out of a pack.
> 
> I wonder how you made unpack-objects to work with it.  Do you
> store an extra bit in the packed stream to say "this object will
> be used as a delta base for later objects so remember its offset
> and resulting SHA1", or do you just remember every entry as you
> unpack?

Every entries.  Much simpler that way.

And the client must ask explicitly whether or not a fetched pack can 
have base objects referenced by sha1 or offset.  So if you're really 
really tight on memory (like if you cannot afford the extra 50MB of RAM 
or so for the Mozilla repo for example) then this could be turned off.

But if we decide to not explode packs upon a fetch but keep the 
downloaded pack as is (after validation) then index generation will come 
for free.


Nicolas

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07  0:40   ` Nicolas Pitre
  2006-09-07  0:59     ` Jon Smirl
  2006-09-07  1:11     ` Junio C Hamano
@ 2006-09-07  4:33     ` Shawn Pearce
  2006-09-07  5:27       ` Junio C Hamano
  2 siblings, 1 reply; 37+ messages in thread
From: Shawn Pearce @ 2006-09-07  4:33 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jon Smirl, gitzilla, git

Nicolas Pitre <nico@cam.org> wrote:
> On Wed, 6 Sep 2006, Jon Smirl wrote:
> 
> > Shawn is doing some prototype work on true dictionary based
> > compression. I don't know how far along he is but it has potential for
> > taking 30% off the Mozilla pack.
> 
> BTW I'm half-way done with support for deltas which base object is 
> referenced with an offset in the pack instead of a hash.  It is quite 
> messy though since it touches pretty core code all over the place when 
> it comes to fetching objects out of a pack.

And I'm half-way done with the 64 bit mmap sliding window.  You,
Junio and I are all hacking on the same tiny bit of core code for
different reasons.  :-)

Right now I'm struggling with merging my 32 bit sliding window onto
Junio's 64 bit index in pu.  The merge is almost done, but so is
my dinner.  So dinner first.  I don't think I'm going to get the
sliding window code out tonight.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-06 23:23 ` Jon Smirl
  2006-09-06 23:39   ` A Large Angry SCM
  2006-09-07  0:40   ` Nicolas Pitre
@ 2006-09-07  5:21   ` Shawn Pearce
  2 siblings, 0 replies; 37+ messages in thread
From: Shawn Pearce @ 2006-09-07  5:21 UTC (permalink / raw)
  To: Jon Smirl; +Cc: gitzilla, git

Jon Smirl <jonsmirl@gmail.com> wrote:
> Shawn is doing some prototype work on true dictionary based
> compression. I don't know how far along he is but it has potential for
> taking 30% off the Mozilla pack.

Not very far.  I've put some work into it but got distracted by the
sliding mmap window bit.  So I don't have anything worth sharing yet.
Which is why I've been quiet on the subject.  :-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07  4:33     ` Shawn Pearce
@ 2006-09-07  5:27       ` Junio C Hamano
  2006-09-07  5:46         ` Shawn Pearce
  0 siblings, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2006-09-07  5:27 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

Shawn Pearce <spearce@spearce.org> writes:

> And I'm half-way done with the 64 bit mmap sliding window.  You,
> Junio and I are all hacking on the same tiny bit of core code for
> different reasons.  :-)

Which makes me quite nervous, actually.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-06 23:39   ` A Large Angry SCM
  2006-09-06 23:56     ` Linus Torvalds
  2006-09-07  0:04     ` Jon Smirl
@ 2006-09-07  5:34     ` Shawn Pearce
  2 siblings, 0 replies; 37+ messages in thread
From: Shawn Pearce @ 2006-09-07  5:34 UTC (permalink / raw)
  To: A Large Angry SCM; +Cc: Jon Smirl, git

A Large Angry SCM <gitzilla@gmail.com> wrote:
> Just looking at the structures in non-BLOBS, I see a lot of potential
> for the use of a set dictionaries when deflating TREEs and another set
> of dictionaries when deflating COMMITs and TAGs. The low hanging fruit
> is to create dictionaries of the most referenced IDs across all TREE or
> COMMIT/TAG objects.

The most referenced IDs should be getting reused through deltas.
That is IDs which are highly referenced are probably referenced
in the same tree over many versions of that tree.  Since the data
isn't changing it should be getting copied by a delta copy command
rather than appearing as a literal.

The Mozilla pack appears to have the bulk of its storage taken up
by blobs (both bases and deltas).  I suspect this is because the
bases compress to approx. 50% of their original size but share a
lot of common tokens.  Those common tokens are being repeated in
every private zlib dictionary.  The same thing happens in a delta,
except here we are probably copying a lot from the base so the
average size is greatly reduced but we are still repeating tokens
in the zlib dictionary for anything that is a literal in the delta
(as it didn't appear in the base).

A large dictionary containing all tokens for the project should
greatly reduce the size of each blob, base and delta alike.  It also
lends itself to creating an efficient full text index.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
       [not found] ` <9e4733910609061617m6783d6c4xaca2f9575e12d455@mail.gmail.com>
@ 2006-09-07  5:39   ` A Large Angry SCM
  0 siblings, 0 replies; 37+ messages in thread
From: A Large Angry SCM @ 2006-09-07  5:39 UTC (permalink / raw)
  To: Jon Smirl, git

Jon Smirl wrote:
> If you need more data to play with you can download the Mozilla pack
> file from here.

Results are about the same as for the other pack files.


pack-a80bec5339b6c50e9f3c8b0dce3e2175cd89b12f.pack

BLOB  :       49860
COMMIT:      197613
DELTA :     1560153
TAG   :        1203
TREE  :      154496
======   ==========
Total :     1963325

Gitzilla_1:   443428349        32   443428317    -7775014   -3.9601
-4.9835
Gitzilla_2:   442177053        32   442177021    -9026310   -4.5975
-5.7855
Pack_2    :   451203363        32   451203331           0    0.0000
0.0000
Pack_3    :   451010859        32   451010827     -192504   -0.0980
-0.1234

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07  0:04     ` Jon Smirl
@ 2006-09-07  5:41       ` Shawn Pearce
  0 siblings, 0 replies; 37+ messages in thread
From: Shawn Pearce @ 2006-09-07  5:41 UTC (permalink / raw)
  To: Jon Smirl; +Cc: gitzilla, git

Jon Smirl <jonsmirl@gmail.com> wrote:
> The repack code could build a model of the tree as it is repacking,
> that is what fast-import does. If you have a model of the tree then
> when you change a TREE node you track the last sha1 that corresponded
> to that directory path. Now you know what to diff to.

Right.  But that's horribly expensive.  For the most part
pack-objects does a good job estimating this by hashing the trees
by their name and size into the same hash bucket.  And its fast.

Besides its better to store reverse deltas (like what CVS does)
as faster access to the more recent stuff is more interesting than
faster access to the older stuff.  pack-objects tries to do this
based on the order given by rev-list.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07  5:27       ` Junio C Hamano
@ 2006-09-07  5:46         ` Shawn Pearce
  2006-09-07 18:50           ` Junio C Hamano
  0 siblings, 1 reply; 37+ messages in thread
From: Shawn Pearce @ 2006-09-07  5:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <junkio@cox.net> wrote:
> Shawn Pearce <spearce@spearce.org> writes:
> 
> > And I'm half-way done with the 64 bit mmap sliding window.  You,
> > Junio and I are all hacking on the same tiny bit of core code for
> > different reasons.  :-)
> 
> Which makes me quite nervous, actually.

What order do you want the patches in then?

I'm willing to go before or after Nico's offset changes, and before
or after your 64 bit index changes.

Either way two of us are going to have to redo our work on top of
the others.  I'm finding that I'm basically rewriting the sliding
window code onto your 64 bit version - there's no easy merge here.
And Nico's got the same problem, he's in unpack_delta_entry too.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07  5:46         ` Shawn Pearce
@ 2006-09-07 18:50           ` Junio C Hamano
  0 siblings, 0 replies; 37+ messages in thread
From: Junio C Hamano @ 2006-09-07 18:50 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

Shawn Pearce <spearce@spearce.org> writes:

> Junio C Hamano <junkio@cox.net> wrote:
>> Shawn Pearce <spearce@spearce.org> writes:
>> 
>> > And I'm half-way done with the 64 bit mmap sliding window.  You,
>> > Junio and I are all hacking on the same tiny bit of core code for
>> > different reasons.  :-)
>> 
>> Which makes me quite nervous, actually.
>
> What order do you want the patches in then?
>
> I'm willing to go before or after Nico's offset changes, and before
> or after your 64 bit index changes.
>
> Either way two of us are going to have to redo our work on top of
> the others.  I'm finding that I'm basically rewriting the sliding
> window code onto your 64 bit version - there's no easy merge here.
> And Nico's got the same problem, he's in unpack_delta_entry too.

While 64-bit offset exercise was a good way for me to see where
the <pack_base+offset> assumption lies in the current code, and
while it would have been the right way to do so in the ideal
world, I realize that it may not be the best way to do things in
the real world for a few reasons:

 - Always doing u64 arithmetic would be expensive, especially on
   32-bit architectures, and doing so conditionally is quite an
   extra complexity.

 - Not necessarily everybody has large file offsets, so asking
   to mmap 8MB at file offset at 6GB may not be possible.  Even
   on systems with large file offsets, truly huge single file is
   cumbersome to manage than a handful files under 32-bit offset
   (e.g. using a DVD-RW to back up your packs).

 - Mozilla may not be the largest project in the world, but it
   certainly is on the larger side than majority of our target
   audience, and even its entire history, 450MB, comfortably
   fits in 32-bit space.

So while I merged the 64-bit offset change in "next" branch, I
am quite doubtful that it was a good change.  As long as we
devise an easy way to keep each packfile under manageable size,
the current 32-bit arrangement would be more beneficial and
practical than doing everything internally in 64-bit.  Larger
projects, if needed, can use multiple packs, and even a project
that _could_ pack everything under 450MB would definitely want
to use multiple packs anyway (i.e. one or more huge historical
packs and one active pack) to avoid repacking and incremental
update cost.  Your "mmap parts into windows" update is of a far
more practical value than the idx with 64-bit offsets.  So I am
very inclined to revert my 64-bit offset change from "next".

Before declaring it a failure, in order to keep the window open
for the future, we would need to make sure that we can change
things as needed to do everything we want with 32-bit offset
packs.  The changes we would need I can think of are in the
following areas:

 - "git repack" needs to be taught to limit the resulting packs
   under 32-bit offset.  I think this is a given, and we would
   need some updates to the command because it needs to be
   taught not to touch "huge historical packs" anyway.

 - "git send-pack" and "git fetch-pack" are Ok -- the receiving
   end explodes the pack that comes over the wire using
   unpack-objects into loose objects, and there is no offset
   involved during the process.  Note that pack-objects may need
   to keep the offset in the stream in u64 if we were to do the
   offset encoding of the base object, but that is pretty much
   an independent issue.

 - "git fetch-pack -k" has a problem.  The daemon or sending
   side runs rev-list piped to pack-objects and there is no
   limit in the size of the pack generated in the pipeline.
   Worse yet, the protocol does not allow more than one packs to
   be sent, so we would need a protocol update to say "the data
   you asked would result in a pack that is more than 4GB, so
   I'll send more than one packs -- here is one, here is
   another, that's all folks".  The receiving end needs to be
   taught to handle this.

Note: I am not proposing to do any of the above right now, until
there is no need for this by any real project.  We just need to
be sure that when need arises there is a way out.

The only case that can be problematic is when a single object is
larger than 4GB deflated, and at that point we would hit a very
hard wall of 32-bit offset.  But handling such a huge object
would have other problems [*1*] anyway, so I think it is Ok to
declare that there is a hard limit on deflated object size, at
least for now.

[*1*]

For example, we tend to do everything in core, but doing diffs
or anything on such a huge blob should probably be done by
swapping in only parts of it at a time).

Also we give the whole deflated size to zlib as avail_in, which
is "uInt", so on architectures where int is shorter than 64-bit
we lose (my 64-bit index patch does not deal with this).  This
however is something your "mmap parts into windows" update would
solve if it were updated to deal with the internal 64-bit
offsets.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07 17:32 ` Nicolas Pitre
@ 2006-09-07 19:22   ` linux
  0 siblings, 0 replies; 37+ messages in thread
From: linux @ 2006-09-07 19:22 UTC (permalink / raw)
  To: linux, nico; +Cc: git, gitzilla

> Well... I'm using object type 6 for deltas with offset reference 
> (instead of object type 7 that is delta with sha1 reference).  So 
> there is no extra bit needed and no backward compatibility breakage.

Ah, my apologies!  I didn't realize that you weren't constructing a
whole new pack format, or using a global flag.  In this case,
the overhead is already bought and paid for, so there's no point
to not taking advantage of it.

And thin packs should Just Work, once the builder is taught to
use type 7 objects when necessary.

> There could be of course a new object type which payload is only a sha1 
> that deltas type 6 reference. But that can be introduced at a later date 
> if it turns out to be worthwhile (i.e. the actual saving in real use 
> scenarios is worth the (small but still) added complexity).

In this case, it's only of potential benefit if there is more than
one reference to an object, and figuring it out is almost certainly
more trouble than it's worth.  Deltas usually form long skinny
chains, not high branching factor bushes.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07 17:20 ` Nicolas Pitre
@ 2006-09-07 19:16   ` linux
  0 siblings, 0 replies; 37+ messages in thread
From: linux @ 2006-09-07 19:16 UTC (permalink / raw)
  To: linux, nico; +Cc: git, gitzilla

> So I don't think we would gain that much using that encoding 
> unless/until the pack format is made completely incompatible due to 
> other changes, and that's something we should try to avoid as much as 
> possible anyway.

Ah, I'm sorry; I didn't realize that it was already used elsewhere in
the pack format and this was just code re-use.  That is indeed a good
reason to stick with what's already there.

I agree the savings are minimal, but if you're starting from scratch,
they're free, so why not use them?  I tried to show that the code
complexity is truly negligible.

I'm sorry I didn't realize the details of pack structure before.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07 14:39     ` Richard Curnow
@ 2006-09-07 17:40       ` Junio C Hamano
  0 siblings, 0 replies; 37+ messages in thread
From: Junio C Hamano @ 2006-09-07 17:40 UTC (permalink / raw)
  To: Richard Curnow; +Cc: git, linux

Richard Curnow <rc@rc0.org.uk> writes:

>  <linux <at> horizon.com> writes:
>
>> For regular packs, such objects wouldn't even be present, because
>> all base objects are in the pack itself.
>
> It would actually be useful if this restriction were lifted.

The primary reason why we require the base object to be present
in the same pack is because our assumption has always been we
mmap a pack as a whole, and we assumed that we can make sure
that the address space of a process to be enough to mmap at
least one pack (otherwise the user will be given a way to keep
size of each pack down to a size that is a mmapable piece) but
not necessarily more than one.  In other words, while unpacking
one object that is deltified, we wanted to make sure everything
that is necessary is availble in memory.  Otherwise, when
multiple packs are needed to reconstitute a single object
because the object is a delta in one pack and its base is in the
other pack, we might not be able to mmap both of them into the
address space at the same time.

The "mmap only parts of a huge packfile into map windows" update
Shawn Pearce is working on would lift that worry, so it would be
less of a reason.

Having said that, people must realize that keeping thin packs on
disk, indexed, is a major undertaking.  It can be made to work,
but it will be painful.

People can grab a thin pack via rsync or http without realizing
it is thin, try to use it and then later (possibly much later,
when a delta that is against a missing object happens to be
needed) discover some of the pack data is unusable.  To avoid
this, the dumb transports need to walk the ancestry chains of
commits and their associated objects while fetching as they do
right now, but also need to walk the delta chains to make sure
their base objects are available.

I like the proposal by Linus to have stub objects that records
what the missing base object is in the thin pack stream.  It
simplifies things for the offset representation of bases.  But
if we were to do so, I suspect indexing such a thing would need
more surgery to the .idx format.  The question is if we should
record the stub object in the .idx file.

Right now we record the beginning of data for each object
(either base or delta) without storing anything to say about
where it ends -- the pack is assumed to be dense, so the point
where next object begins should be where the current one ends.
This means that we need to build a reverse index to list the
objects in the pack in offset order, when we want to find the
end of each object efficiently.

We work this around when using a pack by not doing the full
verification.  We start decoding the header from the beginning
and in either base or delta the deflated stream sits at the end.
We inflate that deflated stream until zlib says it got enough,
and we do not check if that "enough" point stepped into the
location that is supposed to be inside the "next" object (if it
did so then it is an error in the packfile we missed to detect).
If things inflate well and delta applies cleanly, we are happy.

But pack-objects, when reusing the data from an existing pack,
wants to be careful, so it _does_ build a reverse index itself
and makes sure that inflate() ends at the location we expect it
to.  For that process to remain the correct way to determine
where each object ends, the .idx file must record the presense
of stub objects.

However, if we write the names of stub objects in the .idx file,
we have another problem.  Currently, the existence of an object
name in the .idx means that such an object exists in the
corresponding .pack; this will not hold true anymore.

So if we were to do this, we need to give one extra bit to each
of the .idx entries to mark which ones are stub and which ones
are not.  This would allow the reverse index to properly notice
where each object ends, and allow the dumb transports to walk
the .idx for missing base objects efficiently.  The code that
looks at .idx at runtime to determine which pack has needed
object needs to notice this bit.

Again, all of this can be made to work, but it will be painful.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07  9:07 linux
  2006-09-07 12:57 ` Jon Smirl
@ 2006-09-07 17:32 ` Nicolas Pitre
  2006-09-07 19:22   ` linux
  1 sibling, 1 reply; 37+ messages in thread
From: Nicolas Pitre @ 2006-09-07 17:32 UTC (permalink / raw)
  To: linux; +Cc: gitzilla, git

On Thu, 7 Sep 2006, linux@horizon.com wrote:

> > Support for 'thin' packs would pretty much require mixing IDs and
> > (relative) offsets in the same pack file.
> 
> An alternative would be to create a small "placeholder" object that
> just gives an ID, then refer to it by offset.
> 
> That would avoid the need for an id/offset bit with every offset,
> and possibly save more space if the same object was referenced
> multiple times.

Well... I'm using object type 6 for deltas with offset reference 
(instead of object type 7 that is delta with sha1 reference).  So 
there is no extra bit needed and no backward compatibility breakage.

There could be of course a new object type which payload is only a sha1 
that deltas type 6 reference. But that can be introduced at a later date 
if it turns out to be worthwhile (i.e. the actual saving in real use 
scenarios is worth the (small but still) added complexity).

> And it just seems simpler.

Well I would not say so.  Simply prefixing a delta with the sha1 of its 
base object cannot be simpler when streaming out a pack.


Nicolas

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07 12:57 ` Jon Smirl
  2006-09-07 13:34   ` linux
@ 2006-09-07 17:22   ` A Large Angry SCM
  1 sibling, 0 replies; 37+ messages in thread
From: A Large Angry SCM @ 2006-09-07 17:22 UTC (permalink / raw)
  To: Jon Smirl; +Cc: linux, git

Jon Smirl wrote:
> On 7 Sep 2006 05:07:56 -0400, linux@horizon.com <linux@horizon.com> wrote:
>> > Support for 'thin' packs would pretty much require mixing IDs and
>> > (relative) offsets in the same pack file.
>>
>> An alternative would be to create a small "placeholder" object that
>> just gives an ID, then refer to it by offset.
>>
>> That would avoid the need for an id/offset bit with every offset,
>> and possibly save more space if the same object was referenced
>> multiple times.
>>
>> And it just seems simpler.
> 
> There are 2 million objects in the Mozilla pack. This table would take:
> 2M *  (20b (sha)  + 10b(object index/overhead) = 60MB
> This 60MB is pretty much incompressible and increases download time.
> 
> Much better if storage of the sha1s can be totally eliminated and
> replaced by something smaller. Alternatively this map could be
> stripped for transmission and rebuilt locally.
> 

You've lost me. What are you attempting to do again?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07  8:41 linux
@ 2006-09-07 17:20 ` Nicolas Pitre
  2006-09-07 19:16   ` linux
  0 siblings, 1 reply; 37+ messages in thread
From: Nicolas Pitre @ 2006-09-07 17:20 UTC (permalink / raw)
  To: linux; +Cc: gitzilla, git

On Thu, 7 Sep 2006, linux@horizon.com wrote:

> A few notes:
> 
> 
> Re: base-128 encodings, it's a pet peeve of mine that meny people, even
> while trying to save space, waste it by allowing redundant encodings.
> The optimal way, assming msbit=1 means "more", is
> 
> 	0x00 -> 0		0x01 -> 1
> 	0x7f -> 127		0x80 0x00 -> 128
> 	0x80 0x7f -> 255	0x81 0x00 -> 256
> 	0xfe 0x7f -> 16383	0xff 0x00 -> 16384
> 	0xff 0x7f -> 16511	0x80 0x00 0x00 -> 16512

Indeed.  But...

Since we already use 3 bit of object type and that most objects are 
larger than 15 bytes this means with 2 bytes we have 11 bits or up to 
2047.  With your encoding that would mean 2175.  So a byte would be 
saved only for objects whose size is between 2048 and 2175.  I don't 
know what is the proportion of objects that fall into that range in the 
average pack, but even for those objects that means a reduction of less 
than 0.1% with an average deflate rate of 50%.

And we can forget about cases where the size would require a fourth byte 
or more since saving a byte in those cases is even less significant.

So I don't think we would gain that much using that encoding 
unless/until the pack format is made completely incompatible due to 
other changes, and that's something we should try to avoid as much as 
possible anyway.


Nicolas

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07 14:19     ` Jon Smirl
@ 2006-09-07 15:01       ` linux
  0 siblings, 0 replies; 37+ messages in thread
From: linux @ 2006-09-07 15:01 UTC (permalink / raw)
  To: jonsmirl, linux; +Cc: git, gitzilla

> Yes I missed the thin pack context. Has anyone tried building a thin
> pack? Having thin packs would address major concerns with the download
> time of the initial Mozilla checkout. Right now you have to download
> 450MB of data that 99% of the people are never going to look at.

Unfortunately, no, it wouldn't.  There's a second misunderstanding here:
thin packs are used by the git on-the-wire protocol to send a delta
relative to an object *that you know the recipient already has*.

This can greatly speed up incremental updates.  It does NOT have
any effect on the initial checkout, because you know the recipient
has nothing, so you have to send everything.


What you seem to be talking about is what people call a "shallow clone".
Unfortunately, making a shallow clone is difficult for reasons that
are deeply embedded in git's incremental update protocol.

When you fetch from a remote repository, it needs to figure out what
you need and what you already have.  The fundamental assumption which
the protocol uses to send this efficiently is that if you have a commit
object, you have all its ancestors, and their trees and blobs.

So you can say "I have 60d4684068ff1eec78f55b5888d0bd2d4cca1520"
(which is the commit behind tags/v2.6.18-rc5 from the Linux kernel)
and the server you're pulling from knows a lot about what you have.

The first hard part about a shallow repository is describing *efficiently*
to git-upload-pack what you don't have.

The second is figuring out what you should get.

Suppose that you clone from a repository that has a linear chain
of development, and you cut off everything before version X.
Simple enough so far.

Then you try to pull from someone whose work branched off at version
X-1.  Do you want to fetch everything back to the common ancestor?
If you want to cut off their branch, where do we cut it off?  A simple
timestamp limit?  What if you get a bogus timestamp on a commit?

This has been considered by git developers quite carefully and is NOT
an easy problem.  A solution is welcomed, but if you come up with an
easy one, please spend a few minutes considering the possibility that
the reason it's so easy is that your solution is flawed.

There are also problems giving sensible error messages if you try to
refer to a missing object and making your clone deeper if desired,
but those are secondary to the big problems.

> I'm not sure I would mix these external references in with other
> objects. Instead I would make a pair of packs, one only contains
> external reference stubs and is tiny. The second is the full version
> of the same data. That makes getting the old data easy, it just
> replaces the stub pack. The stub pack doesn't need to contain stubs
> for all the objects, only the ones referenced upstream.

Huh??  This is flatly impossible, and completely defeats the purpose.
This is one of those "Pray, Mr. Babbage, if you put into the machine
wrong figures, will the right answers come out?" questions.
I think there is a *serious* miscommunication here.

I have a delta-compressed object.  It is stored in the pack as a series
of changes to some base object.  To be of any use, it needs to identify
that base object.  If the base object identification takes the form of
an intra-pack offset (the new proposal) as opposed to a global 20-byte
object ID, then the base object needs to be stored *intra-pack*.
Within the same pack.

The whole idea is "hey, we don't allow delta-ing relative to objects
not in the pack anyway, so why waste space storing such large pointers?"

This would make the currently forbidden case of a delta relative to
an object not in the same pack completely impossible.  It's like
asking for the square root of 2 in roman numerals.

> I haven't tried doing this, does the current git code support having
> multiple pack files covering difference pieces of the project? Can we
> have an archive pack and a local current pack that get merged
> together?

Yes, it supports it perfectly.  A pack is just a bag of objects.  There is
zero required higher-level structure.  In particular, you can have a
"100% blobs" pack with no tree or commit objects at all.

Git uses the object relationships in its compression heuristics, but
it has no effect on correctness.

The only thing is that the original pack was a *self-contained* bag of
objects; every object in the pack could be reconstructed given only the
pack itself.  In particular, cross-pack deltas were forbidden.

However, then someone realized that making an exception for git protocol
updates would be easy and a Really Good Thing.  A pack that is missing
one or more base objects is called a "thin pack".

So the rule I mentioned above has already been broken.  Supporting
that is what I'm talking about.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07 13:34   ` linux
  2006-09-07 14:19     ` Jon Smirl
@ 2006-09-07 14:39     ` Richard Curnow
  2006-09-07 17:40       ` Junio C Hamano
  1 sibling, 1 reply; 37+ messages in thread
From: Richard Curnow @ 2006-09-07 14:39 UTC (permalink / raw)
  To: git

 <linux <at> horizon.com> writes:

> 
> For regular packs, such objects wouldn't even be present, because
> all base objects are in the pack itself.
> 

It would actually be useful if this restriction were lifted.

Granted, for working repositories there's not much point, because 'git repack -a
-d' can be run regularly these days.

But for repositories served by the dumb http:// transport it makes some sense. 
You don't want to run 'repack -a -d' on those, because anybody tracking them
ends up having to download the entire history again every time they pull.  But
if you pack too often, you incur the cost of the packs containing lots of base
objects - it's as though the delta chain lengths can only ever be about 1 or 2.

If the packs were thin, regular pullers would already have the base objects, and
occasional pullers would just have to download more of the intervening (thin)
packs to get the missing deltas in the middle of the chains.  (I think everyone
has to download all the .idx files in any case.)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07 13:34   ` linux
@ 2006-09-07 14:19     ` Jon Smirl
  2006-09-07 15:01       ` linux
  2006-09-07 14:39     ` Richard Curnow
  1 sibling, 1 reply; 37+ messages in thread
From: Jon Smirl @ 2006-09-07 14:19 UTC (permalink / raw)
  To: linux; +Cc: git, gitzilla

On 7 Sep 2006 09:34:56 -0400, linux@horizon.com <linux@horizon.com> wrote:
> >> An alternative would be to create a small "placeholder" object that
> >> just gives an ID, then refer to it by offset.
> >>
> >> That would avoid the need for an id/offset bit with every offset,
> >> and possibly save more space if the same object was referenced
> >> multiple times.
> >>
> >> And it just seems simpler.
>
> > There are 2 million objects in the Mozilla pack. This table would take:
> > 2M *  (20b (sha)  + 10b(object index/overhead) = 60MB
> > This 60MB is pretty much incompressible and increases download time.
> >
> > Much better if storage of the sha1s can be totally eliminated and
> > replaced by something smaller. Alternatively this map could be
> > stripped for transmission and rebuilt locally.
>
> Um, I think I wasn't clear.  Objects in a "thin" pack (for network
> updating of a different pack) that are referred to but not included
> would have stand-ins containing just the object ID.  Objects that *are*
> present would simply be present and referred to by offset as usual.

Yes I missed the thin pack context. Has anyone tried building a thin
pack? Having thin packs would address major concerns with the download
time of the initial Mozilla checkout. Right now you have to download
450MB of data that 99% of the people are never going to look at.

Thin packs would also set an 'archival' bit as discussed earlier. The
'archival' bit tells the local tools to leave these packs alone. You
don't want to accidentally trigger a 450MB down just because you did a
git-repack. Of course git would need to developer error messages
indicating that you asked for something in the archive that isn't
present.

I'm not sure I would mix these external references in with other
objects. Instead I would make a pair of packs, one only contains
external reference stubs and is tiny. The second is the full version
of the same data. That makes getting the old data easy, it just
replaces the stub pack. The stub pack doesn't need to contain stubs
for all the objects, only the ones referenced upstream.

I haven't tried doing this, does the current git code support having
multiple pack files covering difference pieces of the project? Can we
have an archive pack and a local current pack that get merged
together?

> Imagine you have a "thin" pack containing a delta to an object that the
> recipient has, so isn't in the pack.  The delta has to specify the
> base object somehow.  If the base object is in the pack, you can
> specify it by offset.  If it's not, you can either:
>
> - Generalize the base object pointer to allow an object ID option, or
> - Provide a pointer to a magic kind of "external reference" pointer
>   object.
>
> I was proposing the latter.
>
> For regular packs, such objects wouldn't even be present, because
> all base objects are in the pack itself.
>
> And, of course, you'd only create such objects if you needed to,
> if there was at least one pointer to them.
>
> Compared to putting the object ID directly in the pointer, it has
> Cost:   An extra offset pointer and object header.
>         Extra time follwoing the indirection resolving the pointer.
> Benefit: Non-indirect object pointers are a bit smaller.
>         The code is simpler.
>         Second and later references to the same external object are
>         another offset, not another 20 bytes.
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07 12:57 ` Jon Smirl
@ 2006-09-07 13:34   ` linux
  2006-09-07 14:19     ` Jon Smirl
  2006-09-07 14:39     ` Richard Curnow
  2006-09-07 17:22   ` A Large Angry SCM
  1 sibling, 2 replies; 37+ messages in thread
From: linux @ 2006-09-07 13:34 UTC (permalink / raw)
  To: jonsmirl, linux; +Cc: git, gitzilla

>> An alternative would be to create a small "placeholder" object that
>> just gives an ID, then refer to it by offset.
>>
>> That would avoid the need for an id/offset bit with every offset,
>> and possibly save more space if the same object was referenced
>> multiple times.
>>
>> And it just seems simpler.

> There are 2 million objects in the Mozilla pack. This table would take:
> 2M *  (20b (sha)  + 10b(object index/overhead) = 60MB
> This 60MB is pretty much incompressible and increases download time.
> 
> Much better if storage of the sha1s can be totally eliminated and
> replaced by something smaller. Alternatively this map could be
> stripped for transmission and rebuilt locally.

Um, I think I wasn't clear.  Objects in a "thin" pack (for network
updating of a different pack) that are referred to but not included
would have stand-ins containing just the object ID.  Objects that *are*
present would simply be present and referred to by offset as usual.

Imagine you have a "thin" pack containing a delta to an object that the
recipient has, so isn't in the pack.  The delta has to specify the
base object somehow.  If the base object is in the pack, you can
specify it by offset.  If it's not, you can either:

- Generalize the base object pointer to allow an object ID option, or
- Provide a pointer to a magic kind of "external reference" pointer
  object.

I was proposing the latter.

For regular packs, such objects wouldn't even be present, because
all base objects are in the pack itself.

And, of course, you'd only create such objects if you needed to,
if there was at least one pointer to them.

Compared to putting the object ID directly in the pointer, it has
Cost:	An extra offset pointer and object header.
	Extra time follwoing the indirection resolving the pointer.
Benefit: Non-indirect object pointers are a bit smaller.
	The code is simpler.
	Second and later references to the same external object are
	another offset, not another 20 bytes.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
  2006-09-07  9:07 linux
@ 2006-09-07 12:57 ` Jon Smirl
  2006-09-07 13:34   ` linux
  2006-09-07 17:22   ` A Large Angry SCM
  2006-09-07 17:32 ` Nicolas Pitre
  1 sibling, 2 replies; 37+ messages in thread
From: Jon Smirl @ 2006-09-07 12:57 UTC (permalink / raw)
  To: linux; +Cc: gitzilla, git

On 7 Sep 2006 05:07:56 -0400, linux@horizon.com <linux@horizon.com> wrote:
> > Support for 'thin' packs would pretty much require mixing IDs and
> > (relative) offsets in the same pack file.
>
> An alternative would be to create a small "placeholder" object that
> just gives an ID, then refer to it by offset.
>
> That would avoid the need for an id/offset bit with every offset,
> and possibly save more space if the same object was referenced
> multiple times.
>
> And it just seems simpler.

There are 2 million objects in the Mozilla pack. This table would take:
2M *  (20b (sha)  + 10b(object index/overhead) = 60MB
This 60MB is pretty much incompressible and increases download time.

Much better if storage of the sha1s can be totally eliminated and
replaced by something smaller. Alternatively this map could be
stripped for transmission and rebuilt locally.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
@ 2006-09-07  9:07 linux
  2006-09-07 12:57 ` Jon Smirl
  2006-09-07 17:32 ` Nicolas Pitre
  0 siblings, 2 replies; 37+ messages in thread
From: linux @ 2006-09-07  9:07 UTC (permalink / raw)
  To: gitzilla; +Cc: git, linux

> Support for 'thin' packs would pretty much require mixing IDs and
> (relative) offsets in the same pack file.

An alternative would be to create a small "placeholder" object that
just gives an ID, then refer to it by offset.

That would avoid the need for an id/offset bit with every offset,
and possibly save more space if the same object was referenced
multiple times.

And it just seems simpler.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: A look at some alternative PACK file encodings
@ 2006-09-07  8:41 linux
  2006-09-07 17:20 ` Nicolas Pitre
  0 siblings, 1 reply; 37+ messages in thread
From: linux @ 2006-09-07  8:41 UTC (permalink / raw)
  To: gitzilla; +Cc: git, linux

A few notes:


Re: base-128 encodings, it's a pet peeve of mine that meny people, even
while trying to save space, waste it by allowing redundant encodings.
The optimal way, assming msbit=1 means "more", is

	0x00 -> 0		0x01 -> 1
	0x7f -> 127		0x80 0x00 -> 128
	0x80 0x7f -> 255	0x81 0x00 -> 256
	0xfe 0x7f -> 16383	0xff 0x00 -> 16384
	0xff 0x7f -> 16511	0x80 0x00 0x00 -> 16512

The decoding code can be written several ways, but try:

	c = *p++;
	x = c & 127;
	while (c & 128) {
		c = *p++;
		x = ((x + 1) << 7) + (c & 127);
	}

encoding is most easily done in reverse:

	char buf[9];
	char *p = buf+8;

	*p = x & 127;
	while (x >>= 7)
		*--p = 0x80 | (--x & 127);

	write_out(p, buf + 9 - p);


If you want to do signed offsets, the easiest way to write the code
is to convert to an unsigned with the sign bit as lsbit:

	u = (s << 1) ^ -(s < 0);

And then feed the resulting unsigned number to the functions above.
Handling the sign bit specially in the encoding is messier.


And finally, regarding qsort(), stability is not guaranteed, but glibc
actually uses a stable merge sort if it can allocate the memory.  See
http://sourceware.org/cgi-bin/cvsweb.cgi/libc/stdlib/msort.c?rev=1.21&cvsroot=glibc

This leads to the slightly annoying question of whether it's worth
writing a stable sort just to make git work a little bit better
on non-glibc platforms, when it doesn't affect most current users
personally and it still works correctly with an unstable qsort.

It might be simplest to simply lift the glibc mergesort implementation,
possibly inlining the compares for efficiency.  If you touch the code,
please (my eyes!  it hurts us!) consider fixing the brace style.

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2006-09-07 19:22 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-09-06 21:47 A look at some alternative PACK file encodings A Large Angry SCM
2006-09-06 23:23 ` Jon Smirl
2006-09-06 23:39   ` A Large Angry SCM
2006-09-06 23:56     ` Linus Torvalds
2006-09-07  0:10       ` Jon Smirl
2006-09-07  0:06         ` David Lang
2006-09-07  0:19       ` A Large Angry SCM
2006-09-07  0:45         ` Linus Torvalds
2006-09-07  0:37       ` Nicolas Pitre
2006-09-07  0:04     ` Jon Smirl
2006-09-07  5:41       ` Shawn Pearce
2006-09-07  5:34     ` Shawn Pearce
2006-09-07  0:40   ` Nicolas Pitre
2006-09-07  0:59     ` Jon Smirl
2006-09-07  2:30       ` Nicolas Pitre
2006-09-07  2:33       ` A Large Angry SCM
2006-09-07  1:11     ` Junio C Hamano
2006-09-07  2:47       ` Nicolas Pitre
2006-09-07  4:33     ` Shawn Pearce
2006-09-07  5:27       ` Junio C Hamano
2006-09-07  5:46         ` Shawn Pearce
2006-09-07 18:50           ` Junio C Hamano
2006-09-07  5:21   ` Shawn Pearce
     [not found] ` <9e4733910609061617m6783d6c4xaca2f9575e12d455@mail.gmail.com>
2006-09-07  5:39   ` A Large Angry SCM
2006-09-07  8:41 linux
2006-09-07 17:20 ` Nicolas Pitre
2006-09-07 19:16   ` linux
2006-09-07  9:07 linux
2006-09-07 12:57 ` Jon Smirl
2006-09-07 13:34   ` linux
2006-09-07 14:19     ` Jon Smirl
2006-09-07 15:01       ` linux
2006-09-07 14:39     ` Richard Curnow
2006-09-07 17:40       ` Junio C Hamano
2006-09-07 17:22   ` A Large Angry SCM
2006-09-07 17:32 ` Nicolas Pitre
2006-09-07 19:22   ` linux

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.