All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: pack v4 status
@ 2007-02-28 10:04 linux
  0 siblings, 0 replies; 10+ messages in thread
From: linux @ 2007-02-28 10:04 UTC (permalink / raw)
  To: spearce; +Cc: git, linux

Just a minor note:

I you want to have an arbitrary-sized string table without special cases,
the standard binary Huffman tree construction algorithm can be extended
quite simply to a radix-256 tree, so every code is an integer number
of bytes long.

Take the input strings, and sort them in order of frequency of appearance.
Pad them at the end with dummy strings (frequency = 0) until the number
equals 1 mod 255.

Then take the last (least frequent) 256 strings off the list, and
replace them with a single prefix whose frequency is the sum of all of
the components.  Each of the selected strings will be encoded as the
prefix (whose code has not yet been decided) plus 256 possible byte values.
Insert the new prefix into the list in sorted order.

Repeat until the list has only one entry, which is represented by a
zero-length prefix.  (Since each step replaces 256 entries by 1 entry,
the number of entries never changes mod 255, so the algorithm always will
converge to one entry.)


This is generally encoded more compactly as a "canonical Huffman tree".
For each string, forget the actual encoding decided on here, and just
remember the number of bytes assigned to encode it; this is the only
number you need to record to reconstruct the tree for decoding.
Then count the number of strings to be encoded with 1, 2, 3... bytes.

To decode, read one byte.  If the byte is less than the number of 1-byte
strings, its value is the string's position in the 1-byte string table.

If the byte is >= the number of 1-byte strings, subtract the number of 1-byte
strings, multiply by 256, and add a following byte.  This is the position in
the 2-byte string table.

If the value is >= the number of 2-byte strings, subtract, multiply by
256, and add another trailing byte.  This is the position in the 3-byte
string table.  Repeat as necessary.


There are elaborations on this to bound the maximum number of bytes
required; see the zlib source for details.

If you have a significant number of unique (frequency == 1) strings,
you can encode them in-line after a common "unique string" prefix
(which can have a frequency greater than 1).


I'm not sure if you can follow all that (it helps to understand Huffmann
trees already), but I can explain at greater length if necessary.
The important thing is that it's quite straightforward and provides
a general solution.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: pack v4 status
  2007-02-27 15:50 Shawn O. Pearce
  2007-02-27 21:51 ` Linus Torvalds
@ 2007-02-28  4:13 ` Shawn O. Pearce
  1 sibling, 0 replies; 10+ messages in thread
From: Shawn O. Pearce @ 2007-02-28  4:13 UTC (permalink / raw)
  To: git; +Cc: Nicolas Pitre

"Shawn O. Pearce" <spearce@spearce.org> wrote:
> Obviously this series has a heavy hand on sha1_file.c,
> builtin-pack-objects.c, builtin-unpack-objects.c, index-pack.c.
> But it will also start to hit less obvious places like commit.c
> and tree-walk.c as we start to support walking the encoded objects
> directly.
> 
> Given the huge size of the series, and the amount of effort we are
> tossing into it, and the fact that I'm trying to make it pu-ready by
> early next week, we would appreciate it if folks could keep changes
> to the above mentioned files limited to critical bug fixes only.  :)

After rereading this message[*1*] several hours later, and also
just learning that Nico will be taking a probably much deserved
vacation soon, my concept of getting this series ready for Junio's
pu-branch by next week is just far too aggressive to be practical.

Nico and I have had a lot of off-list discussion about possible
things we can do with the various encodings and what types of
runtime optimizations they could actually support.  These have
created a lot of cases we really need to explore and confirm as
useful or verify as not-useful before submitting a series to Junio.

Improving the packfile format is a relatively non-trivial change;
we would hate to see people actually start to use a pack v4 file
only to have the encoding change *again* in the near future due to
a better optimization being selected and employed.

That said, I would like to keep the development openly visible
(hence its on repo.or.cz) and would like to encourage people to send
ideas, suggestions, etc.  I think it is very helpful when everyone
knows what is actively being worked on, so they can contribute to
projects they may be interested in, and avoid causing conflicts
with those projects when possible.  ;-)


*1*: Yea, I really do read my own email when vger finally gets
     around to bouncing it back to me.  ;-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: pack v4 status
  2007-02-27 22:36     ` Junio C Hamano
@ 2007-02-28  3:45       ` Shawn O. Pearce
  0 siblings, 0 replies; 10+ messages in thread
From: Shawn O. Pearce @ 2007-02-28  3:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nicolas Pitre, Linus Torvalds, git

Junio C Hamano <junkio@cox.net> wrote:
> Nicolas Pitre <nico@cam.org> writes:
> > The idea is to deal with only tree objects containing the 64K most 
> > frequently used base names and fall back to the current tree object 
> > encoding for objects that couldn't be represented that way.
> 
> Ah, I was wondering the same thing as Linus after seeing shawn
> talked about the 2-byte prefix on #git. Falling back to an
> alternate encoding for rarer cases makes sense.

Right.  Git is already fast, and already compresses the object data
very well.  But I think we can make things faster without violating
the basic assumptions of "whole project history", and it just turns
out that those encodings are also making the data smaller for the
common case of human maintained source code.  Which of course is
one of the primary uses for Git, but is obviously not the only use.

In the worst case scenario we'll be doing exactly what we are
doing today with regards to encoding. That performance and disk
space usage is already known and considered "very, very fast" and
"very small".  ;-)

In the best case scenario (human managed source like linux.git,
git.git) we'll scream with pack v4.  The rev-list stats I posted
from just the tree encoding switch not only saved 3 MiB of disk
space but improved total running time by 12.5%.  Nico and I know
we can still do better.

With 15k basenames in linux.git we're filling only 23.6% of the
available namespace within a single packfile.  I think that by the
time we have enough basenames to break 64K we'll be several years
out and be talking about historical packs vs. active packs.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: pack v4 status
  2007-02-27 22:32   ` Nicolas Pitre
  2007-02-27 22:36     ` Junio C Hamano
@ 2007-02-28  1:19     ` Nicolas Pitre
  1 sibling, 0 replies; 10+ messages in thread
From: Nicolas Pitre @ 2007-02-28  1:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Shawn O. Pearce, git

On Tue, 27 Feb 2007, Nicolas Pitre wrote:

> For reference the GIT tree itself has 585 unique names.
> 
> The Linux kernel has 12263 of them.

OK I lied.  In fact those are the frequency of the most used name 
amongst all tree objects, not the number of unique names.

The real number of unique path components in the Linux repo is 16073.

For the GIT tree it is 1444.

Just for fun here's the 10 most used path components in the Linux kernel 
with their number of hits:

12263:  Makefile
5541:   Kconfig
4050:   kernel
3960:   net
3818:   lib
3785:   mm
3665:   crypto
3337:   sound
3185:   include
3181:   README


Nicolas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: pack v4 status
  2007-02-27 22:32   ` Nicolas Pitre
@ 2007-02-27 22:36     ` Junio C Hamano
  2007-02-28  3:45       ` Shawn O. Pearce
  2007-02-28  1:19     ` Nicolas Pitre
  1 sibling, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2007-02-27 22:36 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Shawn O. Pearce, git

Nicolas Pitre <nico@cam.org> writes:

> On Tue, 27 Feb 2007, Linus Torvalds wrote:
>
>> On Tue, 27 Feb 2007, Shawn O. Pearce wrote:
>> >  ...
>> > All trees are then converted to use a 22 byte record format:
>> > 
>> >   - 2 byte network byte order index into the string pool
>> >   - 20 byte SHA-1
>> 
>> Umm. Am I missing something, or is this totally braindamaged?
>> 
>> Are you really expecting there to never be more than 64k basenames? Trust 
>> me, that's a totally broken assumption. Anything that tracks generated 
>> stuff will _easily_ have several tens of thousands of random filenames 
>> even in a single tree, much less over the whole history of the repository.
>
> The idea is to deal with only tree objects containing the 64K most 
> frequently used base names and fall back to the current tree object 
> encoding for objects that couldn't be represented that way.

Ah, I was wondering the same thing as Linus after seeing shawn
talked about the 2-byte prefix on #git. Falling back to an
alternate encoding for rarer cases makes sense.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: pack v4 status
  2007-02-27 22:15   ` Johannes Schindelin
@ 2007-02-27 22:33     ` Nicolas Pitre
  0 siblings, 0 replies; 10+ messages in thread
From: Nicolas Pitre @ 2007-02-27 22:33 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Linus Torvalds, Shawn O. Pearce, git

On Tue, 27 Feb 2007, Johannes Schindelin wrote:

> Hi,
> 
> On Tue, 27 Feb 2007, Linus Torvalds wrote:
> 
> > On Tue, 27 Feb 2007, Shawn O. Pearce wrote:
> > > 
> > > We have thus far reformatted OBJ_TREEs with a new dictionary based
> > > compression scheme.  In this scheme we pool the filenames and modes
> > > that appear within trees into a single table within the packfile.
> > > All trees are then converted to use a 22 byte record format:
> > > 
> > >   - 2 byte network byte order index into the string pool
> > >   - 20 byte SHA-1
> > 
> > Umm. Am I missing something, or is this totally braindamaged?
> > 
> > Are you really expecting there to never be more than 64k basenames? 
> > Trust me, that's a totally broken assumption. Anything that tracks 
> > generated stuff will _easily_ have several tens of thousands of random 
> > filenames even in a single tree, much less over the whole history of the 
> > repository.
> 
> The sane thing, of course, is to use some sort of prefix coding, together 
> with an escape code.

No.  There is a fundamental reason for having a fixed size tree record.


Nicolas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: pack v4 status
  2007-02-27 21:51 ` Linus Torvalds
  2007-02-27 22:15   ` Johannes Schindelin
@ 2007-02-27 22:32   ` Nicolas Pitre
  2007-02-27 22:36     ` Junio C Hamano
  2007-02-28  1:19     ` Nicolas Pitre
  1 sibling, 2 replies; 10+ messages in thread
From: Nicolas Pitre @ 2007-02-27 22:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Shawn O. Pearce, git

On Tue, 27 Feb 2007, Linus Torvalds wrote:

> 
> 
> On Tue, 27 Feb 2007, Shawn O. Pearce wrote:
> > 
> > We have thus far reformatted OBJ_TREEs with a new dictionary based
> > compression scheme.  In this scheme we pool the filenames and modes
> > that appear within trees into a single table within the packfile.
> > All trees are then converted to use a 22 byte record format:
> > 
> >   - 2 byte network byte order index into the string pool
> >   - 20 byte SHA-1
> 
> Umm. Am I missing something, or is this totally braindamaged?
> 
> Are you really expecting there to never be more than 64k basenames? Trust 
> me, that's a totally broken assumption. Anything that tracks generated 
> stuff will _easily_ have several tens of thousands of random filenames 
> even in a single tree, much less over the whole history of the repository.

The idea is to deal with only tree objects containing the 64K most 
frequently used base names and fall back to the current tree object 
encoding for objects that couldn't be represented that way.

For reference the GIT tree itself has 585 unique names.

The Linux kernel has 12263 of them.

If we eventually find it is common and performance critical to have more 
bits to represent those indices because the number of unique path 
components far exceeds that limit with an even distribution then we 
might just add another tree encoding with a 3-byte index for those.

In the end everything translates back to the same object.


Nicolas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: pack v4 status
  2007-02-27 21:51 ` Linus Torvalds
@ 2007-02-27 22:15   ` Johannes Schindelin
  2007-02-27 22:33     ` Nicolas Pitre
  2007-02-27 22:32   ` Nicolas Pitre
  1 sibling, 1 reply; 10+ messages in thread
From: Johannes Schindelin @ 2007-02-27 22:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Shawn O. Pearce, git, Nicolas Pitre

Hi,

On Tue, 27 Feb 2007, Linus Torvalds wrote:

> On Tue, 27 Feb 2007, Shawn O. Pearce wrote:
> > 
> > We have thus far reformatted OBJ_TREEs with a new dictionary based
> > compression scheme.  In this scheme we pool the filenames and modes
> > that appear within trees into a single table within the packfile.
> > All trees are then converted to use a 22 byte record format:
> > 
> >   - 2 byte network byte order index into the string pool
> >   - 20 byte SHA-1
> 
> Umm. Am I missing something, or is this totally braindamaged?
> 
> Are you really expecting there to never be more than 64k basenames? 
> Trust me, that's a totally broken assumption. Anything that tracks 
> generated stuff will _easily_ have several tens of thousands of random 
> filenames even in a single tree, much less over the whole history of the 
> repository.

The sane thing, of course, is to use some sort of prefix coding, together 
with an escape code.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: pack v4 status
  2007-02-27 15:50 Shawn O. Pearce
@ 2007-02-27 21:51 ` Linus Torvalds
  2007-02-27 22:15   ` Johannes Schindelin
  2007-02-27 22:32   ` Nicolas Pitre
  2007-02-28  4:13 ` Shawn O. Pearce
  1 sibling, 2 replies; 10+ messages in thread
From: Linus Torvalds @ 2007-02-27 21:51 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git, Nicolas Pitre



On Tue, 27 Feb 2007, Shawn O. Pearce wrote:
> 
> We have thus far reformatted OBJ_TREEs with a new dictionary based
> compression scheme.  In this scheme we pool the filenames and modes
> that appear within trees into a single table within the packfile.
> All trees are then converted to use a 22 byte record format:
> 
>   - 2 byte network byte order index into the string pool
>   - 20 byte SHA-1

Umm. Am I missing something, or is this totally braindamaged?

Are you really expecting there to never be more than 64k basenames? Trust 
me, that's a totally broken assumption. Anything that tracks generated 
stuff will _easily_ have several tens of thousands of random filenames 
even in a single tree, much less over the whole history of the repository.

			Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* pack v4 status
@ 2007-02-27 15:50 Shawn O. Pearce
  2007-02-27 21:51 ` Linus Torvalds
  2007-02-28  4:13 ` Shawn O. Pearce
  0 siblings, 2 replies; 10+ messages in thread
From: Shawn O. Pearce @ 2007-02-27 15:50 UTC (permalink / raw)
  To: git; +Cc: Nicolas Pitre

Nico's and my packv4 topic is available from my fastimport.git fork
on repo.or.cz:

  gitweb: http://repo.or.cz/w/git/fastimport.git
  git:    git://repo.or.cz/git/fastimport.git
  branch: sp/pack4

We have thus far reformatted OBJ_TREEs with a new dictionary based
compression scheme.  In this scheme we pool the filenames and modes
that appear within trees into a single table within the packfile.
All trees are then converted to use a 22 byte record format:

  - 2 byte network byte order index into the string pool
  - 20 byte SHA-1

These trees are then stored *uncompressed* within the packfile,
but are also still stored using our standard delta system (only the
deltas for these trees are also stored uncompressed).  The resulting
savings is pretty good; on linux-2.6.git we are saving ~3.8 MiB as
a result of this encoding alone:

	141649022 pack2-linuxA.git
	137625761 pack4-linuxB.git

read_sha1_file() has been modified to unpack this new tree format
back into the canonical format; something that I think is very
unncessary for runtime given how easy it is to iterate the encoded
tree, but is still critically important for tools like git-cat-file,
git-index-pack and git-verify-pack.  Future plans are to iterate
the encoded tree directly, but performance is already faster despite
needing to reconvert the tree:

  lh=825020c3866e7312947e17a0caa9dd1a5622bafc
  git --git-dir=pack2-linux.git rev-list $lh -- include/asm-m68k
        3.97 real         3.60 user         0.15 sys
        3.98 real         3.60 user         0.15 sys
        3.98 real         3.60 user         0.15 sys
        3.98 real         3.60 user         0.15 sys
        3.98 real         3.60 user         0.15 sys

  git --git-dir=pack4-linux.git rev-list $lh -- include/asm-m68k
        3.52 real         3.17 user         0.13 sys
        3.46 real         3.17 user         0.13 sys
        3.51 real         3.17 user         0.13 sys
        3.52 real         3.18 user         0.13 sys
        3.53 real         3.16 user         0.13 sys

I'll take 500 milliseconds savings anyday, thanks!  :-)


Nico and I have only started working on commits, so the above results
still utilize the packv2 format for OBJ_COMMIT and do not take into
account any of our proposed concepts there.


The impetus for packv4 is to format the packfile in such a way that
we can work with the data faster at runtime for common operations,
like rev-list and its builtin path limiter.  We also want to make
reachability analysis (critical for packing and fsck) faster.
Any reduction in storage size is considered a bonus here, though
obviously there is some correlation between size of input data and
the time required to process it.  ;-)

The patch series for this is getting large.  Right now we are up to
32 patches in the series.  Given where we are and where we want to
go I'm predicting this series will come out at close to 100 patches.
Of course that's partly because I'm working in fairly small units,
slowly iterating the code into the final version we want.

I am constantly rebasing the sp/pack4 topic noted above, so the
patch count is not really because I'm going back and fixing things
in later patches.  Its because I'm trying to slowly iterate the
runtime side of things in digestable changes, then the packing side,
so that the system still works at every single commit in the series.
Yes, its a *BIG* set of code changed.

Obviously this series has a heavy hand on sha1_file.c,
builtin-pack-objects.c, builtin-unpack-objects.c, index-pack.c.
But it will also start to hit less obvious places like commit.c
and tree-walk.c as we start to support walking the encoded objects
directly.

Given the huge size of the series, and the amount of effort we are
tossing into it, and the fact that I'm trying to make it pu-ready by
early next week, we would appreciate it if folks could keep changes
to the above mentioned files limited to critical bug fixes only.  :)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-02-28 11:12 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-28 10:04 pack v4 status linux
  -- strict thread matches above, loose matches on Subject: below --
2007-02-27 15:50 Shawn O. Pearce
2007-02-27 21:51 ` Linus Torvalds
2007-02-27 22:15   ` Johannes Schindelin
2007-02-27 22:33     ` Nicolas Pitre
2007-02-27 22:32   ` Nicolas Pitre
2007-02-27 22:36     ` Junio C Hamano
2007-02-28  3:45       ` Shawn O. Pearce
2007-02-28  1:19     ` Nicolas Pitre
2007-02-28  4:13 ` Shawn O. Pearce

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.