All of lore.kernel.org
 help / color / mirror / Atom feed
* GSoC - Designing a faster index format
@ 2012-03-20 23:10 elton sky
  2012-03-21  1:18 ` Nguyen Thai Ngoc Duy
  2012-03-21 11:25 ` Thomas Rast
  0 siblings, 2 replies; 33+ messages in thread
From: elton sky @ 2012-03-20 23:10 UTC (permalink / raw)
  To: Git Mailing List

Hello everyone,

I am not sure if this is the right way to apply GSoC or if this
project is still available?  I just subscribed a few hours ago, don't
blame me :)

I am interested with "Designing a faster index format" project.

I am new to git, only started using git yesterday.

The reasons for applying this is:
1. I like C
2. I like doing optimization
3. I want to contribute to a open source project - to git, a plus

From the idea, I realize the problem is that index is verified and
rewritten on any operations which is unnecessary sometimes. And the
objective is to reduce the number of operations to below logN.  As I
am new to git, I  I couldn't give a detailed plan to this for now. I
should have gonna through more documents or codes but there's only one
week for application. So I have to jump up from nowhere :P

I got questions like: how each operations affect index? how cache tree
data and index is stored?
Maybe you can point me how I should catch up quickly. I went through
the article "git-for-computer-scientists", that quite makes sense.

About me:

My name is Elton Tian, I am from China. I have been living in
Australia for quite a few years. I am currently a Master student from
Australia National University. I have lots of experience in C during
my undergraduate: socket, RPC, pthread, etc.. After graduate I worked
with linux based web development for 2.5 years. We used tcl, apache
and cvs. Last year, as a student project, I had chance to benchmark
and modify hadoop distributed file system and mapreduce on a small
cluster. I have to configure and maintain the cluster by myself.

My IRC nick is "eltonsky".


Regards,
Elton

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-03-20 23:10 GSoC - Designing a faster index format elton sky
@ 2012-03-21  1:18 ` Nguyen Thai Ngoc Duy
  2012-03-21 11:25 ` Thomas Rast
  1 sibling, 0 replies; 33+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-03-21  1:18 UTC (permalink / raw)
  To: elton sky; +Cc: Git Mailing List

On Wed, Mar 21, 2012 at 6:10 AM, elton sky <eltonsky9404@gmail.com> wrote:
> From the idea, I realize the problem is that index is verified and
> rewritten on any operations which is unnecessary sometimes. And the
> objective is to reduce the number of operations to below logN.  As I
> am new to git, I  I couldn't give a detailed plan to this for now. I
> should have gonna through more documents or codes but there's only one
> week for application. So I have to jump up from nowhere :P

Understanding current index format would be a good start, I think:
Documentation/technical/index-format.txt. For reading index code, look
at read_index_from() in read-cache.c (many if not all index
manipulation are in this file)

> I got questions like: how each operations affect index?

For writing part, commands that call refresh_index() can update stat
info for many many entries. git-add, git-update, git-mv and git-rm can
add/remove entries from the index. Merge/checkout oeprations
(git-reset, git-checkout, git-merge..) can rewrite the whole index. I
think this proposal aims to speed up refresh_index and add/remove
operations, not the last one.

To speed up reading part (you can grep read_cache() to see how many
commands read index), you may need to do something with index
integrity check. Currently it calculates SHA-1 of the entire index,
then checks against the stored value at the end of index. Calculating
SHA-1 can be really expensive on big index.

> how cache tree data and index is stored?

Cache tree is stored as an optional index extension. It's also
documented in index-format.txt. Or you can look at cache-tree.[ch]
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-03-20 23:10 GSoC - Designing a faster index format elton sky
  2012-03-21  1:18 ` Nguyen Thai Ngoc Duy
@ 2012-03-21 11:25 ` Thomas Rast
  2012-03-21 12:01   ` elton sky
  1 sibling, 1 reply; 33+ messages in thread
From: Thomas Rast @ 2012-03-21 11:25 UTC (permalink / raw)
  To: elton sky; +Cc: Git Mailing List

elton sky <eltonsky9404@gmail.com> writes:

> I got questions like: how each operations affect index? how cache tree
> data and index is stored?
> Maybe you can point me how I should catch up quickly. I went through
> the article "git-for-computer-scientists", that quite makes sense.

In addition to what Nguyen Thai Ngoc Duy said, check out the
(sub)threads

  http://thread.gmane.org/gmane.comp.version-control.git/190016/focus=190132
  [origins of the GSoC project idea]

  http://thread.gmane.org/gmane.comp.version-control.git/192014/focus=192025
  [perspectives of core developers in reply to the idea]

  http://thread.gmane.org/gmane.comp.version-control.git/186244/focus=186282
  http://thread.gmane.org/gmane.comp.version-control.git/186357
  [the last few discussions about cache-tree]

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-03-21 11:25 ` Thomas Rast
@ 2012-03-21 12:01   ` elton sky
  2012-03-22 20:32     ` elton sky
  0 siblings, 1 reply; 33+ messages in thread
From: elton sky @ 2012-03-21 12:01 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy, Thomas Rast; +Cc: Git Mailing List

Hi Nguyen, Thomas

Thanks for the points &clues. Processing them...

-Elton

On Wed, Mar 21, 2012 at 10:25 PM, Thomas Rast <trast@student.ethz.ch> wrote:
> elton sky <eltonsky9404@gmail.com> writes:
>
>> I got questions like: how each operations affect index? how cache tree
>> data and index is stored?
>> Maybe you can point me how I should catch up quickly. I went through
>> the article "git-for-computer-scientists", that quite makes sense.
>
> In addition to what Nguyen Thai Ngoc Duy said, check out the
> (sub)threads
>
>  http://thread.gmane.org/gmane.comp.version-control.git/190016/focus=190132
>  [origins of the GSoC project idea]
>
>  http://thread.gmane.org/gmane.comp.version-control.git/192014/focus=192025
>  [perspectives of core developers in reply to the idea]
>
>  http://thread.gmane.org/gmane.comp.version-control.git/186244/focus=186282
>  http://thread.gmane.org/gmane.comp.version-control.git/186357
>  [the last few discussions about cache-tree]
>
> --
> Thomas Rast
> trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-03-21 12:01   ` elton sky
@ 2012-03-22 20:32     ` elton sky
  2012-03-23  0:46       ` Jakub Narebski
  2012-03-23  1:30       ` Nguyen Thai Ngoc Duy
  0 siblings, 2 replies; 33+ messages in thread
From: elton sky @ 2012-03-22 20:32 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy, Thomas Rast, Git Mailing List

Got a few questions:

1. index is used for building next commit, so it should only include
files created/modified/deleted. But I see it has all entries for
current working dir. why?

2. From read_index_from() I see the whole index is read into mem, and
write one by one (entry/ext) back to disk. This makes sense. But why
we have to compute Sha1 for all entries, especially unchanged entries?

3. how does git track updated files? Does it compare the ts between
working dir and index ? Or they are recorded somewhere?

4. When does git insert to cache tree? and when it retrieve from it?


Some early thoughts for the tree format:

We can use B tree like format. Keep the header in the beginning of the
file as is, but add file length (4bytes) and the pointer to extensions
(8bytes) into header.
Entry list follows the header. The entry starts with number of
children offsets (1 byte) followed by list of offsets (4 bytes each).
We can limit the number for balance. Other fields leave as is.
Extensions can locate in between entries.

Use Sha1 , rather than the path, as the key for each entry node. This
beats the case like 1000 files in a dir which breaks the balance of
the tree, as Thomas mentioned. If a file is updated, the old Sha1 can
be found in object dir. This also gives flexibility. We may use splay
tree, in order to move updated nodes close to the root. The downside
is full path has to be stored in entry.

Regards,
Elton

On Wed, Mar 21, 2012 at 11:01 PM, elton sky <eltonsky9404@gmail.com> wrote:
> Hi Nguyen, Thomas
>
> Thanks for the points &clues. Processing them...
>
> -Elton
>
> On Wed, Mar 21, 2012 at 10:25 PM, Thomas Rast <trast@student.ethz.ch> wrote:
>> elton sky <eltonsky9404@gmail.com> writes:
>>
>>> I got questions like: how each operations affect index? how cache tree
>>> data and index is stored?
>>> Maybe you can point me how I should catch up quickly. I went through
>>> the article "git-for-computer-scientists", that quite makes sense.
>>
>> In addition to what Nguyen Thai Ngoc Duy said, check out the
>> (sub)threads
>>
>>  http://thread.gmane.org/gmane.comp.version-control.git/190016/focus=190132
>>  [origins of the GSoC project idea]
>>
>>  http://thread.gmane.org/gmane.comp.version-control.git/192014/focus=192025
>>  [perspectives of core developers in reply to the idea]
>>
>>  http://thread.gmane.org/gmane.comp.version-control.git/186244/focus=186282
>>  http://thread.gmane.org/gmane.comp.version-control.git/186357
>>  [the last few discussions about cache-tree]
>>
>> --
>> Thomas Rast
>> trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-03-22 20:32     ` elton sky
@ 2012-03-23  0:46       ` Jakub Narebski
  2012-03-23  1:30       ` Nguyen Thai Ngoc Duy
  1 sibling, 0 replies; 33+ messages in thread
From: Jakub Narebski @ 2012-03-23  0:46 UTC (permalink / raw)
  To: elton sky; +Cc: Nguyen Thai Ngoc Duy, Thomas Rast, Git Mailing List

elton sky <eltonsky9404@gmail.com> writes:

> Got a few questions:
> 
> 1. index is used for building next commit, so it should only include
> files created/modified/deleted. But I see it has all entries for
> current working dir. why?

Because git is snapshot based, not changeset based.  Ecah commit
stores state of repository in the form of 'tree' object, which is
build out of index.
 
Also index stores extra information, like mtime, about all files
to make operations faster (skip unchanged files).  The index was
originally at the very beginning named dircache.

[...]
-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-03-22 20:32     ` elton sky
  2012-03-23  0:46       ` Jakub Narebski
@ 2012-03-23  1:30       ` Nguyen Thai Ngoc Duy
  2012-03-23 10:27         ` elton sky
  1 sibling, 1 reply; 33+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-03-23  1:30 UTC (permalink / raw)
  To: elton sky; +Cc: Thomas Rast, Git Mailing List

On Fri, Mar 23, 2012 at 3:32 AM, elton sky <eltonsky9404@gmail.com> wrote:
> Got a few questions:
>
> 1. index is used for building next commit, so it should only include
> files created/modified/deleted. But I see it has all entries for
> current working dir. why?

Jakub has answered this question.

> 2. From read_index_from() I see the whole index is read into mem, and
> write one by one (entry/ext) back to disk. This makes sense. But why
> we have to compute Sha1 for all entries, especially unchanged entries?

To catch disk corruption. If a bit is flipped anywhere in the index
and we do not detect it, we may end up creating broken commits.

> 3. how does git track updated files? Does it compare the ts between
> working dir and index ? Or they are recorded somewhere?

Check out refresh_cache_ent. At the beginning of most commands, they
call refresh_index() or refresh_cache(), which checks a file's mtime
against one stored in index (different means updated). In the worst
scenario, refresh_cache_ent may call ce_compare_data(), which computes
SHA-1 of the specified file and compare it with one stored in index.

> 4. When does git insert to cache tree? and when it retrieve from it?

cache-tree is built from scratch in some cases, when we know HEAD (or
some tree) matches index exactly (e.g. reset --hard). Usually it's
only built up at commit time (update_main_cache_tree in
builtin/commit.c).
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-03-23  1:30       ` Nguyen Thai Ngoc Duy
@ 2012-03-23 10:27         ` elton sky
  2012-03-23 11:24           ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 33+ messages in thread
From: elton sky @ 2012-03-23 10:27 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Nguyen Thai Ngoc Duy, Thomas Rast

Hi Nguyen, Jakub

Thank you for your explanations.

Just clarify question about track updated files:

On Fri, Mar 23, 2012 at 12:30 PM, Nguyen Thai Ngoc Duy
<pclouds@gmail.com> wrote:
> On Fri, Mar 23, 2012 at 3:32 AM, elton sky <eltonsky9404@gmail.com> wrote:
>> Got a few questions:
>>
>> 1. index is used for building next commit, so it should only include
>> files created/modified/deleted. But I see it has all entries for
>> current working dir. why?
>
> Jakub has answered this question.
>
>> 2. From read_index_from() I see the whole index is read into mem, and
>> write one by one (entry/ext) back to disk. This makes sense. But why
>> we have to compute Sha1 for all entries, especially unchanged entries?
>
> To catch disk corruption. If a bit is flipped anywhere in the index
> and we do not detect it, we may end up creating broken commits.
>
>> 3. how does git track updated files? Does it compare the ts between
>> working dir and index ? Or they are recorded somewhere?
>
> Check out refresh_cache_ent. At the beginning of most commands, they
> call refresh_index() or refresh_cache(), which checks a file's mtime
> against one stored in index (different means updated). In the worst
> scenario, refresh_cache_ent may call ce_compare_data(), which computes
> SHA-1 of the specified file and compare it with one stored in index.
>

This means working dir will compare each entry in index on mtime
field, to find out  if it's updated. The complexity for this operation
is O(nlogn). I assume the way of this checking is: it loops through
entries in the index, for each entry, it searches in working dir and
compare the mtime.

Because current index is a single steam of file, when it writes back
it has to write back everything sequentially. So we have to do
checksum for every entry. And I suppose this process is more time
consuming than previous step.

If we use a tree format, still, when looking for updated files, time
complexity is O(nlogn), i.e. we traverse the index entries and for
each entry we refer back to working dir. However, when we write index
back, we only need to recompute and write updated file nodes, but not
all entries. Total processing time benefit from here.

Please correct me if I am wrong.

-Elton


>> 4. When does git insert to cache tree? and when it retrieve from it?
>
> cache-tree is built from scratch in some cases, when we know HEAD (or
> some tree) matches index exactly (e.g. reset --hard). Usually it's
> only built up at commit time (update_main_cache_tree in
> builtin/commit.c).
> --
> Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-03-23 10:27         ` elton sky
@ 2012-03-23 11:24           ` Nguyen Thai Ngoc Duy
       [not found]             ` <CAKTdtZmLOzAgG0uCDcVr+O41XPX-XnoVZjsZWPN-BLjq2oG-7A@mail.gmail.com>
  0 siblings, 1 reply; 33+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-03-23 11:24 UTC (permalink / raw)
  To: elton sky; +Cc: Git Mailing List, Thomas Rast

On Fri, Mar 23, 2012 at 5:27 PM, elton sky <eltonsky9404@gmail.com> wrote:
> On Fri, Mar 23, 2012 at 12:30 PM, Nguyen Thai Ngoc Duy
> <pclouds@gmail.com> wrote:
>> On Fri, Mar 23, 2012 at 3:32 AM, elton sky <eltonsky9404@gmail.com> wrote:
>>> 3. how does git track updated files? Does it compare the ts between
>>> working dir and index ? Or they are recorded somewhere?
>>
>> Check out refresh_cache_ent. At the beginning of most commands, they
>> call refresh_index() or refresh_cache(), which checks a file's mtime
>> against one stored in index (different means updated). In the worst
>> scenario, refresh_cache_ent may call ce_compare_data(), which computes
>> SHA-1 of the specified file and compare it with one stored in index.
>>
>
> This means working dir will compare each entry in index on mtime
> field, to find out  if it's updated. The complexity for this operation
> is O(nlogn). I assume the way of this checking is: it loops through
> entries in the index, for each entry, it searches in working dir and
> compare the mtime.
>
> Because current index is a single steam of file, when it writes back
> it has to write back everything sequentially. So we have to do
> checksum for every entry. And I suppose this process is more time
> consuming than previous step.

The previous step is pretty fast on Linux in hot cache case. I don't
think we need to care about that. Some commands only care a
subdirectory (for example "git diff -- path/to/here") and only refresh
entries within that subdirectory, which further reduces refresh cost.

Which reminds me, we cannot abandon current index format. Users should
be allowed to choose which format to use. It may be hard to keep the
code support two formats while still taking advantage of the new one.
Maybe you could internally convert old format to new one in memory so
that git code only has to deal with one format, but that adds more
cost on using old format. I don't know..

> If we use a tree format, still, when looking for updated files, time
> complexity is O(nlogn), i.e. we traverse the index entries and for
> each entry we refer back to working dir. However, when we write index
> back, we only need to recompute and write updated file nodes, but not
> all entries. Total processing time benefit from here.

Yes. And if you compute checksum per-entry, not as a whole file, then
when you read an entry, you only need to verify checksum of that entry
(and index header of course). That reduces reading cost. But that
increases space (20-byte per entry if you stick with SHA-1). Maybe we
could do checksum per group (or tree) instead as a trade off between
space/time.
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
       [not found]             ` <CAKTdtZmLOzAgG0uCDcVr+O41XPX-XnoVZjsZWPN-BLjq2oG-7A@mail.gmail.com>
@ 2012-03-24  8:58               ` Nguyen Thai Ngoc Duy
       [not found]                 ` <CAKTdtZkpjVaBSkcieojKj+V7WztT3UDzjGfXyghY=S8mq+X9zw@mail.gmail.com>
  0 siblings, 1 reply; 33+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-03-24  8:58 UTC (permalink / raw)
  To: elton sky; +Cc: Git Mailing List, Thomas Rast

On Sat, Mar 24, 2012 at 2:50 PM, elton sky <eltonsky9404@gmail.com> wrote:
> Thanks again Nguyen,
>
>> Which reminds me, we cannot abandon current index format. Users should
>> be allowed to choose which format to use. It may be hard to keep the
>> code support two formats while still taking advantage of the new one.
>> Maybe you could internally convert old format to new one in memory so
>> that git code only has to deal with one format, but that adds more
>> cost on using old format. I don't know..
>
> I understand we should allow user to switch between old & new format.
> But I guess that should only happens when user init a working dir,
> isn't it? Otherwise I have to transform them back n forth. If a user
> chooses to use old format, I assume their repository is not large, so
> there should not be big delay for new format.

Users may choose to stick with old format because other git tools rely
on that (or they want to use older git versions at the same time).
Note that current index works "fine" with ~50k files in working
directory. Not huge, but not small either. Overhead on old format
should be reasonable.
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
       [not found]                   ` <CACsJy8D85thmK_5jLC7MxJtsitLr=zphKiw2miwPu7Exf7ty=Q@mail.gmail.com>
@ 2012-03-26 12:36                     ` elton sky
  2012-03-26 12:41                       ` elton sky
  2012-03-26 14:28                       ` Thomas Rast
  0 siblings, 2 replies; 33+ messages in thread
From: elton sky @ 2012-03-26 12:36 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy, Git Mailing List

Hi Nguyen,

On Mon, Mar 26, 2012 at 12:06 PM, Nguyen Thai Ngoc Duy
<pclouds@gmail.com> wrote:
> (I think this should be on git@vger as there are many experienced devs there)
>
> On Sun, Mar 25, 2012 at 11:13 AM, elton sky <eltonsky9404@gmail.com> wrote:
>> About the new format:
>>
>> The index is a single file. Entries in the index still stored
>> sequentially as old format. The difference is they are grouped into
>> blocks. A block contains many entries and they are ordered by names.
>> Blocks are also ordered by the name of the first entry. Each block
>> contains a sha1 for entries in it.
>
> If I remove an entry in the first block, because blocks are of fixed
> size, you would need to shift all entries up by one, thus update all
> blocks?
>

We need some GC here. I am not moving all blocks. Rather I would
consider merge or recycle the block. In a simple case if a block
becomes empty, I ll change the offset of new block in the header point
to this block, and make this block points to the original offset of
new block. In this way, I keep the list of empty blocks I can reuse.
If a block is not empty but not full, we better merge it with adjacent
block for efficiency. But I don't know an light way to handle that
when refreshing index. But we can always run a background thread to
rebuild the index, at some stage, while system is quiet.

> Also note the sequence format means duplication because we always
> store full path.

You are right, this keeps the size of the index as current system. Use
a tree will save disk space for sure. Need think more about this.

> --
> Duy

Attach my previous email here:

The goal of this project is :
1. verify checksum for only necessary part of index
2. keep the time complexity of most git-app below logn

About the new format:

The index is a single file. Entries in the index still stored
sequentially as old format. The difference is they are grouped into
blocks. A block contains many entries and they are ordered by names.
Blocks are also ordered by the name of the first entry. Each block
contains a sha1 for entries in it.
For using a binary search to locate the block for an entry, the
offsets of blocks are stored in the header of the index. We reserve
100 spaces for block offsets in the header. More offsets are stored in
a meta block (see below) afterwards. An offset of the first meta block
is stored.
The checksum is computed on block. After we locate the block, the
checksum is recomputed for the block. And only the this block will be
read and write back later. As the block is read into ram, it is easy
to do a binary search for entries in a block when they are in ram.
When the index doesn't have many entries, it works very similar with
current format. When more entries git-added, blocks will come into
play.

Format:

Head:
* 4-byte signature
* 4-byte version num
* 4-byte num of entries blocks
* 4-byte offset for new block
* list of offsets for blocks (e.g. 96, 14096, 8192, ..) : For binary
search. Each offset is 8 bytes, we reserve 100 x 4 = 400 bytes for
first 100 blocks. More offsets (if applicable) will be stored in a
meta blocks.
* 4-byte offset to the first meta block
* 20-byte sha1 for above and meta blocks

List of Blocks:
* sha1 for all entries
* list of entries

Meta block:
* offset to next meta block
* list of offsets

Extensions:
      TBD. Have not hacked cache tree yet. Need more knowledge of cache tree...


Block Split & Delete:
      TBD.


-Elton

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-03-26 12:36                     ` elton sky
@ 2012-03-26 12:41                       ` elton sky
  2012-03-26 14:28                       ` Thomas Rast
  1 sibling, 0 replies; 33+ messages in thread
From: elton sky @ 2012-03-26 12:41 UTC (permalink / raw)
  To: Git Mailing List

As the previous email is hidden in the trimmed area, just resend it:


About the new format:

The index is a single file. Entries in the index still stored
sequentially as old format. The difference is they are grouped into
blocks. A block contains many entries and they are ordered by names.
Blocks are also ordered by the name of the first entry. Each block
contains a sha1 for entries in it.
For using a binary search to locate the block for an entry, the
offsets of blocks are stored in the header of the index. We reserve
100 spaces for block offsets in the header. More offsets are stored in
a meta block (see below) afterwards. An offset of the first meta block
is stored.
The checksum is computed on block. After we locate the block, the
checksum is recomputed for the block. And only the this block will be
read and write back later. As the block is read into ram, it is easy
to do a binary search for entries in a block when they are in ram.
When the index doesn't have many entries, it works very similar with
current format. When more entries git-added, blocks will come into
play.

Format:

Head:
- 4-byte signature
- 4-byte version num
- 4-byte num of entries blocks
- 4-byte offset for new block
- list of offsets for blocks (e.g. 96, 14096, 8192, ..) : For binary
search. Each offset is 8 bytes, we reserve 100 x 4 = 400 bytes for
first 100 blocks. More offsets (if applicable) will be stored in a
meta blocks.
- 4-byte offset to the first meta block
- 20-byte sha1 for above and meta blocks

List of Blocks:
- sha1 for all entries
- list of entries

Meta block:
- offset to next meta block
- list of offsets

Extensions:
      TBD. Have not hacked cache tree yet. Need more knowledge of cache tree...


Block Split & Delete:
      TBD.

Regards,
Elton

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-03-26 12:36                     ` elton sky
  2012-03-26 12:41                       ` elton sky
@ 2012-03-26 14:28                       ` Thomas Rast
  2012-03-26 15:25                         ` Nguyen Thai Ngoc Duy
  2012-03-26 16:19                         ` Nguyen Thai Ngoc Duy
  1 sibling, 2 replies; 33+ messages in thread
From: Thomas Rast @ 2012-03-26 14:28 UTC (permalink / raw)
  To: elton sky; +Cc: Nguyen Thai Ngoc Duy, Git Mailing List

elton sky <eltonsky9404@gmail.com> writes:

> On Mon, Mar 26, 2012 at 12:06 PM, Nguyen Thai Ngoc Duy
> <pclouds@gmail.com> wrote:
>> (I think this should be on git@vger as there are many experienced devs there)
>>
>> On Sun, Mar 25, 2012 at 11:13 AM, elton sky <eltonsky9404@gmail.com> wrote:
>>> About the new format:
>>>
>>> The index is a single file. Entries in the index still stored
>>> sequentially as old format. The difference is they are grouped into
>>> blocks. A block contains many entries and they are ordered by names.
>>> Blocks are also ordered by the name of the first entry. Each block
>>> contains a sha1 for entries in it.
>>
>> If I remove an entry in the first block, because blocks are of fixed
>> size, you would need to shift all entries up by one, thus update all
>> blocks?
>
> We need some GC here. I am not moving all blocks. Rather I would
> consider merge or recycle the block. In a simple case if a block
> becomes empty, I ll change the offset of new block in the header point
> to this block, and make this block points to the original offset of
> new block. In this way, I keep the list of empty blocks I can reuse.
[...]

Doesn't that venture into database land?

If we go that far, wouldn't it be better to use a proper database
library?  All other things being equal, writing such complex code from
scratch is probably not a good idea.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-03-26 14:28                       ` Thomas Rast
@ 2012-03-26 15:25                         ` Nguyen Thai Ngoc Duy
  2012-03-26 16:08                           ` Shawn Pearce
  2012-03-26 16:19                         ` Nguyen Thai Ngoc Duy
  1 sibling, 1 reply; 33+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-03-26 15:25 UTC (permalink / raw)
  To: Thomas Rast; +Cc: elton sky, Git Mailing List

On Mon, Mar 26, 2012 at 9:28 PM, Thomas Rast <trast@student.ethz.ch> wrote:
> elton sky <eltonsky9404@gmail.com> writes:
>
>> On Mon, Mar 26, 2012 at 12:06 PM, Nguyen Thai Ngoc Duy
>> <pclouds@gmail.com> wrote:
>>> (I think this should be on git@vger as there are many experienced devs there)
>>>
>>> On Sun, Mar 25, 2012 at 11:13 AM, elton sky <eltonsky9404@gmail.com> wrote:
>>>> About the new format:
>>>>
>>>> The index is a single file. Entries in the index still stored
>>>> sequentially as old format. The difference is they are grouped into
>>>> blocks. A block contains many entries and they are ordered by names.
>>>> Blocks are also ordered by the name of the first entry. Each block
>>>> contains a sha1 for entries in it.
>>>
>>> If I remove an entry in the first block, because blocks are of fixed
>>> size, you would need to shift all entries up by one, thus update all
>>> blocks?
>>
>> We need some GC here. I am not moving all blocks. Rather I would
>> consider merge or recycle the block. In a simple case if a block
>> becomes empty, I ll change the offset of new block in the header point
>> to this block, and make this block points to the original offset of
>> new block. In this way, I keep the list of empty blocks I can reuse.
> [...]
>
> Doesn't that venture into database land?
>
> If we go that far, wouldn't it be better to use a proper database
> library?  All other things being equal, writing such complex code from
> scratch is probably not a good idea.

If there's a library that fits our needs (including linking
statically). I think we've come close to sqlite file format [1]. But
sqlite comes with sql engine, transactional updates... that we don't
need. Another obvious source for inspiration is file systems, but I
dare not go that way.

[1] http://www.sqlite.org/fileformat2.html
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-03-26 15:25                         ` Nguyen Thai Ngoc Duy
@ 2012-03-26 16:08                           ` Shawn Pearce
  2012-03-27  2:49                             ` elton sky
  2012-03-27  6:31                             ` Nguyen Thai Ngoc Duy
  0 siblings, 2 replies; 33+ messages in thread
From: Shawn Pearce @ 2012-03-26 16:08 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Thomas Rast, elton sky, Git Mailing List

On Mon, Mar 26, 2012 at 08:25, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Mon, Mar 26, 2012 at 9:28 PM, Thomas Rast <trast@student.ethz.ch> wrote:
>> elton sky <eltonsky9404@gmail.com> writes:
>>
>>> On Mon, Mar 26, 2012 at 12:06 PM, Nguyen Thai Ngoc Duy
>>> <pclouds@gmail.com> wrote:
>>>> (I think this should be on git@vger as there are many experienced devs there)
>>>>
>>>> On Sun, Mar 25, 2012 at 11:13 AM, elton sky <eltonsky9404@gmail.com> wrote:
>>>>> About the new format:
>>>>>
>>>>> The index is a single file. Entries in the index still stored
>>>>> sequentially as old format. The difference is they are grouped into
>>>>> blocks. A block contains many entries and they are ordered by names.
>>>>> Blocks are also ordered by the name of the first entry. Each block
>>>>> contains a sha1 for entries in it.
>>>>
>>>> If I remove an entry in the first block, because blocks are of fixed
>>>> size, you would need to shift all entries up by one, thus update all
>>>> blocks?
>>>
>>> We need some GC here. I am not moving all blocks. Rather I would
>>> consider merge or recycle the block. In a simple case if a block
>>> becomes empty, I ll change the offset of new block in the header point
>>> to this block, and make this block points to the original offset of
>>> new block. In this way, I keep the list of empty blocks I can reuse.
>> [...]
>>
>> Doesn't that venture into database land?
>>
>> If we go that far, wouldn't it be better to use a proper database
>> library?  All other things being equal, writing such complex code from
>> scratch is probably not a good idea.
>
> If there's a library that fits our needs (including linking
> statically). I think we've come close to sqlite file format [1]. But
> sqlite comes with sql engine, transactional updates... that we don't
> need. Another obvious source for inspiration is file systems, but I
> dare not go that way.
>
> [1] http://www.sqlite.org/fileformat2.html

Or use LevelDb[2]. Its BSD license. Uses an immutable file format, but
writes updates to new smaller files and eventually collapses
everything back together into a bigger file. This can be a
dramatically simpler approach than dealing with your own free block
system inside of a single file. Its only real downside is needing to
periodically pay a penalty to rewrite the whole index. But this
rewrite is going to be faster than the time it takes to rewrite the
pack files for the same repository, which git gc or git repack
handles. So I don't think its actually a problem for the index.

You might even be able to take a two level approach to compacting the
LevelDb database (or something like it). In a minor compaction you
compact all of the files except the huge base file, leaving you with 2
files. A huge base file that contains the first tree the user checked
out, and a second smaller file containing any differences they have
since the initial checkout (this may just be updated stat data for a
handful of files that differed across two branches as they switched
back and forth). During a git gc or git repack, add a new stage to
collapse the base file and everything else into a single new base file
as a major compaction.

[2] http://code.google.com/p/leveldb/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-03-26 14:28                       ` Thomas Rast
  2012-03-26 15:25                         ` Nguyen Thai Ngoc Duy
@ 2012-03-26 16:19                         ` Nguyen Thai Ngoc Duy
  2012-03-27  3:20                           ` elton sky
  1 sibling, 1 reply; 33+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-03-26 16:19 UTC (permalink / raw)
  To: Thomas Rast; +Cc: elton sky, Git Mailing List

On Mon, Mar 26, 2012 at 9:28 PM, Thomas Rast <trast@student.ethz.ch> wrote:
> Doesn't that venture into database land?

How about this (a bit like memory management). Maybe it's simpler than
a database and fits us better.

The header consists of crc32 and three uint32_t, one points to the
root tree, one the first extension block, the last one the free list
at the end of the file. The rest of the file contains sizable blocks.
There can be free space between them. Free spaces (offset and size)
are recorded at the end of the file, pointed in header. The header's
crc32 covers the header and free list.

When we need a new block, we look up in free list. If we cannot find a
suitable space, we append to the end of the file (moving free list
further to keep it always the end of the file). Removing a block means
marking it in free list. We only truncate if there is free space at
the end. Operations that we know will scratch the whole index are our
opportunity to rewrite the index and make it compact again. No random
garbage collection (iow disk is cheap).

A block starts with a signature (a tree block, or an extension...). A
tree block consists of:

 - uint32_t tree object's size
 - sha-1 of tree object
 - crc32 of the rest of the block except tree object
 - maybe reference counter of a block can be refered by many blocks??
 - tree object (i.e. something that tree-walk.c can parse)
 - other index attributes, stored separately in the same order as in
tree object above, uint32_t block offset of subdirectories.

An extension block basically consists of what we have now in an
extension plus uint32_t offset to the next extension block, so we can
keep track of all extensions. crc32 is used for extension blocks.

This way we only need to verify checksum of the header (and free list)
and blocks we visit. We don't need cache-tree extension because it's
part of the format. There will be headache with unpack-trees.c because
of entry order change. But in the end we would use the same order tree
objects are using now, much simpler for us.
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-03-26 16:08                           ` Shawn Pearce
@ 2012-03-27  2:49                             ` elton sky
  2012-03-27  3:34                               ` David Barr
  2012-03-27  6:31                             ` Nguyen Thai Ngoc Duy
  1 sibling, 1 reply; 33+ messages in thread
From: elton sky @ 2012-03-27  2:49 UTC (permalink / raw)
  To: Shawn Pearce, Git Mailing List

Thanks Shawn,

> Or use LevelDb[2]. Its BSD license. Uses an immutable file format, but
> writes updates to new smaller files and eventually collapses
> everything back together into a bigger file. This can be a
> dramatically simpler approach than dealing with your own free block
> system inside of a single file. Its only real downside is needing to
> periodically pay a penalty to rewrite the whole index. But this
> rewrite is going to be faster than the time it takes to rewrite the
> pack files for the same repository, which git gc or git repack
> handles. So I don't think its actually a problem for the index.
>
> You might even be able to take a two level approach to compacting the
> LevelDb database (or something like it). In a minor compaction you
> compact all of the files except the huge base file, leaving you with 2
> files. A huge base file that contains the first tree the user checked
> out, and a second smaller file containing any differences they have
> since the initial checkout (this may just be updated stat data for a
> handful of files that differed across two branches as they switched
> back and forth). During a git gc or git repack, add a new stage to
> collapse the base file and everything else into a single new base file
> as a major compaction.
>
> [2] http://code.google.com/p/leveldb/

I don't know leveldb, but like to have a look.
Just realize this solution is kinda popular. HDFS also uses the
similar image file with edit file format for its file block index.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-03-26 16:19                         ` Nguyen Thai Ngoc Duy
@ 2012-03-27  3:20                           ` elton sky
  2012-03-27  6:43                             ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 33+ messages in thread
From: elton sky @ 2012-03-27  3:20 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Git Mailing List

Hi Nguyen,

Thanks for the idea. just a few questions

On Tue, Mar 27, 2012 at 3:19 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Mon, Mar 26, 2012 at 9:28 PM, Thomas Rast <trast@student.ethz.ch> wrote:
>> Doesn't that venture into database land?
>
> How about this (a bit like memory management). Maybe it's simpler than
> a database and fits us better.
>
> The header consists of crc32 and three uint32_t, one points to the
> root tree, one the first extension block, the last one the free list
> at the end of the file. The rest of the file contains sizable blocks.
> There can be free space between them. Free spaces (offset and size)
> are recorded at the end of the file, pointed in header. The header's
> crc32 covers the header and free list.

How do you record free spaces at the end of file? Are you gonna have a
fixed size for the index and reserve space for free spaces offsets.

>
> When we need a new block, we look up in free list. If we cannot find a
> suitable space, we append to the end of the file (moving free list
> further to keep it always the end of the file). Removing a block means
> marking it in free list. We only truncate if there is free space at
> the end. Operations that we know will scratch the whole index are our
> opportunity to rewrite the index and make it compact again. No random
> garbage collection (iow disk is cheap).
>

I agree with you. Maybe we just ignore free spaces in the index and
let a background thread to compact it.

> A block starts with a signature (a tree block, or an extension...). A
> tree block consists of:
>
>  - uint32_t tree object's size
>  - sha-1 of tree object
>  - crc32 of the rest of the block except tree object
>  - maybe reference counter of a block can be refered by many blocks??
>  - tree object (i.e. something that tree-walk.c can parse)

Do you mean each block contains a tree and all its blobs? So the tree
object here, effectively a dir, also contains files in the dir ? In
this way, some blocks can be very big.

>  - other index attributes, stored separately in the same order as in
> tree object above, uint32_t block offset of subdirectories.

There can be many sub dirs some times. But maybe not a prob.

As tree object and  offset of subdirectories are variables, how do you
make a block resizable?

>
> An extension block basically consists of what we have now in an
> extension plus uint32_t offset to the next extension block, so we can
> keep track of all extensions. crc32 is used for extension blocks.
>
> This way we only need to verify checksum of the header (and free list)
> and blocks we visit. We don't need cache-tree extension because it's
> part of the format. There will be headache with unpack-trees.c because
> of entry order change. But in the end we would use the same order tree
> objects are using now, much simpler for us.
> --
> Duy

Cheers,
Elton

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-03-27  2:49                             ` elton sky
@ 2012-03-27  3:34                               ` David Barr
  2012-03-27  6:33                                 ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 33+ messages in thread
From: David Barr @ 2012-03-27  3:34 UTC (permalink / raw)
  To: elton sky; +Cc: Shawn Pearce, Git Mailing List

On Tue, Mar 27, 2012 at 1:49 PM, elton sky <eltonsky9404@gmail.com> wrote:
> Thanks Shawn,
>
>> Or use LevelDb[2]. Its BSD license. Uses an immutable file format, but
>> writes updates to new smaller files and eventually collapses
>> everything back together into a bigger file. This can be a
>> dramatically simpler approach than dealing with your own free block
>> system inside of a single file. Its only real downside is needing to
>> periodically pay a penalty to rewrite the whole index. But this
>> rewrite is going to be faster than the time it takes to rewrite the
>> pack files for the same repository, which git gc or git repack
>> handles. So I don't think its actually a problem for the index.
>>
>> You might even be able to take a two level approach to compacting the
>> LevelDb database (or something like it). In a minor compaction you
>> compact all of the files except the huge base file, leaving you with 2
>> files. A huge base file that contains the first tree the user checked
>> out, and a second smaller file containing any differences they have
>> since the initial checkout (this may just be updated stat data for a
>> handful of files that differed across two branches as they switched
>> back and forth). During a git gc or git repack, add a new stage to
>> collapse the base file and everything else into a single new base file
>> as a major compaction.
>>
>> [2] http://code.google.com/p/leveldb/
>
> I don't know leveldb, but like to have a look.
> Just realize this solution is kinda popular. HDFS also uses the
> similar image file with edit file format for its file block index.

Another implementation in this general class is TinyCDB[1].
It is <1600 lines of plain C. Too few to be complete?
It is a derivative of DJB's CDB[2].

[1] http://www.corpit.ru/mjt/tinycdb.html
[2] http://cr.yp.to/cdb.html
--
David Barr

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-03-26 16:08                           ` Shawn Pearce
  2012-03-27  2:49                             ` elton sky
@ 2012-03-27  6:31                             ` Nguyen Thai Ngoc Duy
  1 sibling, 0 replies; 33+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-03-27  6:31 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Thomas Rast, elton sky, Git Mailing List

On Mon, Mar 26, 2012 at 11:08 PM, Shawn Pearce <spearce@spearce.org> wrote:
>> [1] http://www.sqlite.org/fileformat2.html
>
> Or use LevelDb[2]. Its BSD license. Uses an immutable file format, but
> writes updates to new smaller files and eventually collapses
> everything back together into a bigger file. This can be a
> dramatically simpler approach than dealing with your own free block
> system inside of a single file. Its only real downside is needing to
> periodically pay a penalty to rewrite the whole index. But this
> rewrite is going to be faster than the time it takes to rewrite the
> pack files for the same repository, which git gc or git repack
> handles. So I don't think its actually a problem for the index.

Cool. I had an experiment with it. A database is created where are
keys  `git ls-files` on linux-2.6. A few things after the experiment:

 - we need to link to libstdc++.so. I still hope to avoid any new
runtime dependencies
 - I use gettimeofday to time some operations. On linux-2.6,
read_cache() costs 27ms. leveldb_open() alone takes 90ms. Iterating
over all keys takes ~200ms.

Performance wise it does not look very good but maybe I'm just not
doing it right.

> [2] http://code.google.com/p/leveldb/
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-03-27  3:34                               ` David Barr
@ 2012-03-27  6:33                                 ` Nguyen Thai Ngoc Duy
  2012-03-29  9:45                                   ` Jeff King
  0 siblings, 1 reply; 33+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-03-27  6:33 UTC (permalink / raw)
  To: David Barr; +Cc: elton sky, Shawn Pearce, Git Mailing List

On Tue, Mar 27, 2012 at 10:34 AM, David Barr <davidbarr@google.com> wrote:
> Another implementation in this general class is TinyCDB[1].
> It is <1600 lines of plain C. Too few to be complete?
> It is a derivative of DJB's CDB[2].
>
> [1] http://www.corpit.ru/mjt/tinycdb.html

"CDB is a constant database, that is, it cannot be updated at a
runtime, only rebuilt.". It does not sound promising to me. I have not
read the description carefully though.

> [2] http://cr.yp.to/cdb.html
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-03-27  3:20                           ` elton sky
@ 2012-03-27  6:43                             ` Nguyen Thai Ngoc Duy
  2012-04-02 11:50                               ` elton sky
  0 siblings, 1 reply; 33+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-03-27  6:43 UTC (permalink / raw)
  To: elton sky; +Cc: Git Mailing List

On Tue, Mar 27, 2012 at 10:20 AM, elton sky <eltonsky9404@gmail.com> wrote:
> On Tue, Mar 27, 2012 at 3:19 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
>> The header consists of crc32 and three uint32_t, one points to the
>> root tree, one the first extension block, the last one the free list
>> at the end of the file. The rest of the file contains sizable blocks.
>> There can be free space between them. Free spaces (offset and size)
>> are recorded at the end of the file, pointed in header. The header's
>> crc32 covers the header and free list.
>
> How do you record free spaces at the end of file? Are you gonna have a
> fixed size for the index and reserve space for free spaces offsets.

A list of (offset,size) with (0,0) to terminate. No index can shrink
or expand at will. Free list is always at the end of the index.

>> A block starts with a signature (a tree block, or an extension...). A
>> tree block consists of:
>>
>>  - uint32_t tree object's size
>>  - sha-1 of tree object
>>  - crc32 of the rest of the block except tree object
>>  - maybe reference counter of a block can be refered by many blocks??
>>  - tree object (i.e. something that tree-walk.c can parse)
>
> Do you mean each block contains a tree and all its blobs? So the tree
> object here, effectively a dir, also contains files in the dir ? In
> this way, some blocks can be very big.

No, the tree object contains pathname, mode and SHA-1 of its entries,
one level only (try "git ls-tree HEAD"). If an entry is a directory
and we have not built it yet, we won't have its sha-1, so it will be
zero (similar to invalid cache-tree).

>>  - other index attributes, stored separately in the same order as in
>> tree object above, uint32_t block offset of subdirectories.
>
> There can be many sub dirs some times. But maybe not a prob.
>
> As tree object and  offset of subdirectories are variables, how do you
> make a block resizable?

If there are free space right after it, it can be expanded. Otherwise
we need to move the block elsewhere and update its parent about its
new offset, then mark where the block was as free space.
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-03-27  6:33                                 ` Nguyen Thai Ngoc Duy
@ 2012-03-29  9:45                                   ` Jeff King
  0 siblings, 0 replies; 33+ messages in thread
From: Jeff King @ 2012-03-29  9:45 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy
  Cc: David Barr, elton sky, Shawn Pearce, Git Mailing List

On Tue, Mar 27, 2012 at 01:33:33PM +0700, Nguyen Thai Ngoc Duy wrote:

> On Tue, Mar 27, 2012 at 10:34 AM, David Barr <davidbarr@google.com> wrote:
> > Another implementation in this general class is TinyCDB[1].
> > It is <1600 lines of plain C. Too few to be complete?
> > It is a derivative of DJB's CDB[2].
> >
> > [1] http://www.corpit.ru/mjt/tinycdb.html
> 
> "CDB is a constant database, that is, it cannot be updated at a
> runtime, only rebuilt.". It does not sound promising to me. I have not
> read the description carefully though.

No, you are right. I did some work with cdb many years ago. It optimizes
for lookup by spending time building an optimal hash table at generation
time. There is no way to add or modify an entry short of rewriting the
complete contents of the database, which is exactly what we are trying
to get away from with the current index format.

-Peff

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-03-27  6:43                             ` Nguyen Thai Ngoc Duy
@ 2012-04-02 11:50                               ` elton sky
  2012-04-02 12:31                                 ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 33+ messages in thread
From: elton sky @ 2012-04-02 11:50 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Git Mailing List

Hi Nguyen,

Still have some questions on your idea:

On Tuesday, March 27, 2012, Nguyen Thai Ngoc Duy wrote:
>
> On Tue, Mar 27, 2012 at 10:20 AM, elton sky <eltonsky9404@gmail.com> wrote:
> > On Tue, Mar 27, 2012 at 3:19 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> >> The header consists of crc32 and three uint32_t, one points to the
> >> root tree, one the first extension block, the last one the free list
> >> at the end of the file. The rest of the file contains sizable blocks.
> >> There can be free space between them. Free spaces (offset and size)
> >> are recorded at the end of the file, pointed in header. The header's
> >> crc32 covers the header and free list.
> >
> > How do you record free spaces at the end of file? Are you gonna have a
> > fixed size for the index and reserve space for free spaces offsets.
>
> A list of (offset,size) with (0,0) to terminate. No index can shrink
> or expand at will. Free list is always at the end of the index.
>
> >> A block starts with a signature (a tree block, or an extension...). A
> >> tree block consists of:
> >>
> >>  - uint32_t tree object's size
> >>  - sha-1 of tree object
> >>  - crc32 of the rest of the block except tree object
> >>  - maybe reference counter of a block can be refered by many blocks??
> >>  - tree object (i.e. something that tree-walk.c can parse)
> >
> > Do you mean each block contains a tree and all its blobs? So the tree
> > object here, effectively a dir, also contains files in the dir ? In
> > this way, some blocks can be very big.
>
> No, the tree object contains pathname, mode and SHA-1 of its entries,
> one level only (try "git ls-tree HEAD"). If an entry is a directory
> and we have not built it yet, we won't have its sha-1, so it will be
> zero (similar to invalid cache-tree).
>

Correct me if I am wrong, I assume:
* Although you only listed attributes in tree block, in index, we have
both tree and blob block.
* Sha1 of a tree block is computed by hashing all the Sha1s of blobs
and sub trees in the directory.
* Like cache tree struct, each tree object contains list of offsets to
children blocks.

How do you store blob blocks ? Is each blob object a block? just like
a tree object is a block? If so, each blob contains a sha1, the
overhead is high?

If we compute hash for a tree, and this tree happen to have 500
children do I have to access all 500 blocks to get their Sha1s?


>
> >>  - other index attributes, stored separately in the same order as in
> >> tree object above, uint32_t block offset of subdirectories.
> >
> > There can be many sub dirs some times. But maybe not a prob.
> >
> > As tree object and  offset of subdirectories are variables, how do you
> > make a block resizable?
>
> If there are free space right after it, it can be expanded. Otherwise
> we need to move the block elsewhere and update its parent about its
> new offset, then mark where the block was as free space.
> --
> Duy



Also I ran some quick test on git-add over kernel 2.6. When I do "time
git add .":time git add .

cmd_add: validate_pathspec takes : 0 ms
read_index_from: xmmap&close takes : 0 ms
read_index_from: verify_hdr takes : 26 ms
read_index_from: create inmem struct takes : 4 ms
read_index: read_index_from takes : 31 ms
read_directory: qsort takes : 0 ms
fill_directory: read_directory takes : 97 ms
cmd_add: prune dir takes : 0 ms
cmd_add: add_files_to_cache takes : 37 ms
cmd_add: add_files takes : 0 ms

real 0m0.172s
user 0m0.120s
sys 0m0.050s

And when I ran "time git add arch/ia64" :

cmd_add: validate_pathspec takes : 0 ms
read_index_from: xmmap&close takes : 0 ms
read_index_from: verify_hdr takes : 20 ms
read_index_from: create inmem struct takes : 4 ms
read_index: read_index_from takes : 25 ms
read_directory: read_directory_recursive takes : 10 ms
read_directory: qsort takes : 0 ms
fill_directory: read_directory takes : 10 ms
cmd_add: fill_directory takes : 10 ms
cmd_add: prune dir takes : 0 ms
cmd_add: add_files_to_cache takes : 1 ms

real 0m0.043s
user 0m0.040s
sys 0m0.000s

In both cases, the time for sha1 is quite stable (~20ms).
fill_directory drops as I specify a sub directory, which make sense.
As Junio suggested, the sha1 time (verify_hdr) is a mix a read and
sha1. And this part is our focus to optimize, isn't it? But, as growth
of the whole repo, the processing time is getting dominated by
fill_directory (if we use '.', it takes 97ms) rather than verify_hdr.
In current system, The time complexity for fill_directory is nlogn (n
is number of objects). git recursively go thru sub directories and
files in it and check against current index. When searching index, it
uses binary search which makes it lg(n). If this is the case, will use
a producer/consumer model help?

Cheers,
Elton

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-04-02 11:50                               ` elton sky
@ 2012-04-02 12:31                                 ` Nguyen Thai Ngoc Duy
  2012-04-02 14:27                                   ` Shawn Pearce
  2012-04-04  8:26                                   ` elton sky
  0 siblings, 2 replies; 33+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-04-02 12:31 UTC (permalink / raw)
  To: elton sky; +Cc: Git Mailing List, Thomas Rast

On Mon, Apr 02, 2012 at 09:50:53PM +1000, elton sky wrote:
> Hi Nguyen,
> 
> Still have some questions on your idea:
> 
> On Tuesday, March 27, 2012, Nguyen Thai Ngoc Duy wrote:
> > >> A block starts with a signature (a tree block, or an extension...). A
> > >> tree block consists of:
> > >>
> > >>  - uint32_t tree object's size
> > >>  - sha-1 of tree object
> > >>  - crc32 of the rest of the block except tree object
> > >>  - maybe reference counter of a block can be refered by many blocks??
> > >>  - tree object (i.e. something that tree-walk.c can parse)
> > >
> > > Do you mean each block contains a tree and all its blobs? So the tree
> > > object here, effectively a dir, also contains files in the dir ? In
> > > this way, some blocks can be very big.
> >
> > No, the tree object contains pathname, mode and SHA-1 of its entries,
> > one level only (try "git ls-tree HEAD"). If an entry is a directory
> > and we have not built it yet, we won't have its sha-1, so it will be
> > zero (similar to invalid cache-tree).
> >
> 
> Correct me if I am wrong, I assume:
> * Although you only listed attributes in tree block, in index, we have
> both tree and blob block.

In current index, tree is implied in path names. We only store blob sha-1.

> * Sha1 of a tree block is computed by hashing all the Sha1s of blobs
> and sub trees in the directory.
> * Like cache tree struct, each tree object contains list of offsets to
> children blocks.
> 
> How do you store blob blocks ? Is each blob object a block? just like
> a tree object is a block? If so, each blob contains a sha1, the
> overhead is high?
> 
> If we compute hash for a tree, and this tree happen to have 500
> children do I have to access all 500 blocks to get their Sha1s?

We don't store blobs in index. We only need their sha-1, which is
computed and content stored in object database at "git add".

By the way, I revised my new index format a little bit, see the end of
this email. It may work, or may not. Food for thoughts.

> Also I ran some quick test on git-add over kernel 2.6. When I do "time
> git add .":time git add .
> 
> cmd_add: validate_pathspec takes : 0 ms
> read_index_from: xmmap&close takes : 0 ms
> read_index_from: verify_hdr takes : 26 ms
> read_index_from: create inmem struct takes : 4 ms
> read_index: read_index_from takes : 31 ms
> read_directory: qsort takes : 0 ms
> fill_directory: read_directory takes : 97 ms
> cmd_add: prune dir takes : 0 ms
> cmd_add: add_files_to_cache takes : 37 ms
> cmd_add: add_files takes : 0 ms
> 
> real 0m0.172s
> user 0m0.120s
> sys 0m0.050s
> 
> And when I ran "time git add arch/ia64" :
> 
> cmd_add: validate_pathspec takes : 0 ms
> read_index_from: xmmap&close takes : 0 ms
> read_index_from: verify_hdr takes : 20 ms
> read_index_from: create inmem struct takes : 4 ms
> read_index: read_index_from takes : 25 ms
> read_directory: read_directory_recursive takes : 10 ms
> read_directory: qsort takes : 0 ms
> fill_directory: read_directory takes : 10 ms
> cmd_add: fill_directory takes : 10 ms
> cmd_add: prune dir takes : 0 ms
> cmd_add: add_files_to_cache takes : 1 ms
> 
> real 0m0.043s
> user 0m0.040s
> sys 0m0.000s
> 
> In both cases, the time for sha1 is quite stable (~20ms).
> fill_directory drops as I specify a sub directory, which make sense.
> As Junio suggested, the sha1 time (verify_hdr) is a mix a read and
> sha1. And this part is our focus to optimize, isn't it?

I think so. But until we can read just parts of index, we still have
to verify integrity for the whole index, which takes more or less the
same amount of time you see and should be proportional to index size
(or the number of entries in index). Either we shrink the index, or go
with cheaper checksum, or both.

> But, as growth
> of the whole repo, the processing time is getting dominated by
> fill_directory (if we use '.', it takes 97ms) rather than verify_hdr.

{read,fill}_directory is not always used (for example, "git diff" does
not need it). Meanwhile, as working directory grows, index size grows,
verify_hdr() will take longer.

> In current system, The time complexity for fill_directory is nlogn (n
> is number of objects). git recursively go thru sub directories and
> files in it and check against current index. When searching index, it
> uses binary search which makes it lg(n). If this is the case, will use
> a producer/consumer model help?

I think fill_directory is dominated by kernel time (read_dir,
stat...), there's little thing we can do there. Anyway I'm pretty sure
fill_directory is out of scope. It's just one of the code that uses
index.

-- 8< --
GIT index format
================

This format replaces the old "DIRC" format. Compared to the old
format, which is essentially a sorted list of pathnames, this one:

 - is tree-based
 - use crc32 as checksum
 - only verify integrity on parts that git accesses, instead of whole
   file
 - append changes to the end
 - allow index versioning

Updates can be made directly to the index by appending to the end. The
index traversed by locating the root tree block from the trailer. When
a path is updated, all related tree blocks are updated and appended to
the end, then a new trailer (with generation increased by one) is
written to conclude the index.

The index size will increase continuously. At some point, we will need
to repack it. Let assume a tree block is 64k on average and a path
generally consists of 3 path components.  That means an entry update
adds 192k and we can do about 80 updates before index reaches 16M (in
addition to initial index size).

At 16M or when trailer generation hits a limit (the limit can be
configurable), we rewrite the index to reduce its size. Some heavy
operations can also be used to rewrite index, such as checkout or
reset.

The index integrity is verified by crc32. One crc32 covers header and
trailer. Each block has its own crc32. When the index is found
corrupt, we could try to roll back to latest good version by looking
for trailers from bottom up. Even when the index is not corrupt, users
can still look back this way for older index versions.

= The git index file has the following format

   - A 8-byte header consisting of

     4-byte signature:
       The signature is { 'T', 'R', 'E', 'E' }

     4-byte version number:
       The current supported versions are 1.

   - A number of blocks of variable size

      1-byte block type

      3-byte content size in byte

      block content

      4-byte crc32 of all above

   - A 18-byte trailer consisting of

      4-byte trailer signature:
        The signature is { 'R', 'O', 'O', 'T' }

      2-byte generation:
         The first trailer is 0, the second 1 and so on.

      4-byte root block offset

      4-byte extension table offset:
        Zero means no extension

      4-byte checksum:
        CRC32 of the header and the trailer (excluding this field)

== Tree block

  A tree block contains a (maybe invalid) tree object and extra
  information of its companion in working directory. Tree block has
  block type 'T'.

  Tree block content is basically the list of non-recursive entries in
  specified path, with all attributes we store in the index now. There
  are a few changes though to intergrate cache-tree and allow
  bsearch() on mmap'd block.

  A tree block content consists of

  - 4-byte tree object size

  - 20-byte SHA-1 of the cached tree object

  - a list attributes corresponding to tree object's item, in the same
    order.  These attributes are the same as in DIRC entry format
    except that entry name is removed, and a tree block offset is
    added in case the item is a directory.

    32-bit ctime seconds, the last time a file's metadata changed
      this is stat(2) data
  
    32-bit ctime nanosecond fractions
      this is stat(2) data
  
    32-bit mtime seconds, the last time a file's data changed
      this is stat(2) data
  
    32-bit mtime nanosecond fractions
      this is stat(2) data
  
    32-bit dev
      this is stat(2) data
  
    32-bit ino
      this is stat(2) data
  
    32-bit mode, split into (high to low bits)
  
      4-bit object type
        valid values in binary are 1000 (regular file), 1010 (symbolic link)
        and 1110 (gitlink)
  
      3-bit unused
  
      9-bit unix permission. Only 0755 and 0644 are valid for regular files.
      Symbolic links and gitlinks have value 0 in this field.
  
    32-bit uid
      this is stat(2) data
  
    32-bit gid
      this is stat(2) data
  
    32-bit file size
      This is the on-disk size from stat(2), truncated to 32-bit.
  
    160-bit SHA-1 for the represented object if blobs or the offset
      to another tree block if trees

    A 32-bit 'flags' field split into (high to low bits)
  
      1-bit assume-valid flag
  
      1-bit extended flag (must be zero in version 2)
  
      2-bit stage (during merge)
  
      12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
      is stored in this field.
  
      1-bit skip-worktree flag (used by sparse checkout)
  
      1-bit intent-to-add flag (used by "git add -N")

      14-bit unused, must be zero

    A 16-bit offset, relative to the beginning of this block, to the
      pathname of this entry. FIXME: make it 32-bit, relative to the
      beginning of the file, so that we can reuse pathnames from other
      (old) blocks?

  - a list of NUL-terminated pathnames, pointed to from the 16-bit offset
    above. This list does not have to be of the same order as the attribute
    list. The reason this is separated from the attribute list is to make
    attribute list fixed size, searchable using bsearch().

== Extension table block

 Extension table has block type 'X'. It consists of a series of 4-byte
 extension block offset.

== Extension block

 Extension block has block type 'E'. Extension content is the same as
 in the old format.
-- 8< --
--
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-04-02 12:31                                 ` Nguyen Thai Ngoc Duy
@ 2012-04-02 14:27                                   ` Shawn Pearce
  2012-04-02 15:12                                     ` Nguyen Thai Ngoc Duy
  2012-04-04  8:26                                   ` elton sky
  1 sibling, 1 reply; 33+ messages in thread
From: Shawn Pearce @ 2012-04-02 14:27 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: elton sky, Git Mailing List, Thomas Rast

On Mon, Apr 2, 2012 at 08:31, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> -- 8< --
> GIT index format
> ================
>
> This format replaces the old "DIRC" format. Compared to the old
> format, which is essentially a sorted list of pathnames, this one:
>
>  - is tree-based
>  - use crc32 as checksum
>  - only verify integrity on parts that git accesses, instead of whole
>   file
>  - append changes to the end
>  - allow index versioning
>
> Updates can be made directly to the index by appending to the end. The
> index traversed by locating the root tree block from the trailer. When
> a path is updated, all related tree blocks are updated and appended to
> the end, then a new trailer (with generation increased by one) is
> written to conclude the index.
>
> The index size will increase continuously. At some point, we will need
> to repack it. Let assume a tree block is 64k on average and a path
> generally consists of 3 path components.  That means an entry update
> adds 192k and we can do about 80 updates before index reaches 16M (in
> addition to initial index size).

Only 3 path components? Java sources can easily have 8-10 with a long
Maven and Java package implied prefix. This will increase the
frequency of rewrites of the index file.

> At 16M or when trailer generation hits a limit (the limit can be
> configurable), we rewrite the index to reduce its size. Some heavy
> operations can also be used to rewrite index, such as checkout or
> reset.
>
> The index integrity is verified by crc32. One crc32 covers header and
> trailer. Each block has its own crc32. When the index is found
> corrupt, we could try to roll back to latest good version by looking
> for trailers from bottom up. Even when the index is not corrupt, users
> can still look back this way for older index versions.

How do you deal with a partially written append to the index file?
E.g. if a prior update crashes or the filesystem doesn't write
everything out before power failure, you need to find the last good
trailer block in the file.

> = The git index file has the following format
>
>   - A 8-byte header consisting of
>
>     4-byte signature:
>       The signature is { 'T', 'R', 'E', 'E' }
>
>     4-byte version number:
>       The current supported versions are 1.

Why not DIRC version 4?

>   - A number of blocks of variable size
>
>      1-byte block type
>
>      3-byte content size in byte
>
>      block content

So you are limiting the size of a canonical tree now? Currently there
is no limit on the size a tree. But here the entire index structure
plus set of names must be under 16 MiB. Granted no project probably
hits that limit, but you are painting us into a corner with an upper
limit here that doesn't look like it will be easy to increase.

>      4-byte crc32 of all above
>
>   - A 18-byte trailer consisting of
>
>      4-byte trailer signature:
>        The signature is { 'R', 'O', 'O', 'T' }
>
>      2-byte generation:
>         The first trailer is 0, the second 1 and so on.
>
>      4-byte root block offset
>
>      4-byte extension table offset:
>        Zero means no extension
>
>      4-byte checksum:
>        CRC32 of the header and the trailer (excluding this field)

See above my question about how to find the last good trailer if the
last append attempt was incomplete.

> == Tree block
>
>  A tree block contains a (maybe invalid) tree object and extra
>  information of its companion in working directory. Tree block has
>  block type 'T'.
>
>  Tree block content is basically the list of non-recursive entries in
>  specified path, with all attributes we store in the index now. There
>  are a few changes though to intergrate cache-tree and allow
>  bsearch() on mmap'd block.
>
>  A tree block content consists of
>
>  - 4-byte tree object size
>
>  - 20-byte SHA-1 of the cached tree object
>
>  - a list attributes corresponding to tree object's item, in the same
>    order.  These attributes are the same as in DIRC entry format
>    except that entry name is removed, and a tree block offset is
>    added in case the item is a directory.
>
>    32-bit ctime seconds, the last time a file's metadata changed
>      this is stat(2) data
>
>    32-bit ctime nanosecond fractions
>      this is stat(2) data
>
>    32-bit mtime seconds, the last time a file's data changed
>      this is stat(2) data
>
>    32-bit mtime nanosecond fractions
>      this is stat(2) data
>
>    32-bit dev
>      this is stat(2) data
>
>    32-bit ino
>      this is stat(2) data
>
>    32-bit mode, split into (high to low bits)
>
>      4-bit object type
>        valid values in binary are 1000 (regular file), 1010 (symbolic link)
>        and 1110 (gitlink)
>
>      3-bit unused
>
>      9-bit unix permission. Only 0755 and 0644 are valid for regular files.
>      Symbolic links and gitlinks have value 0 in this field.
>
>    32-bit uid
>      this is stat(2) data
>
>    32-bit gid
>      this is stat(2) data
>
>    32-bit file size
>      This is the on-disk size from stat(2), truncated to 32-bit.
>
>    160-bit SHA-1 for the represented object if blobs or the offset
>      to another tree block if trees
>
>    A 32-bit 'flags' field split into (high to low bits)
>
>      1-bit assume-valid flag
>
>      1-bit extended flag (must be zero in version 2)
>
>      2-bit stage (during merge)
>
>      12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
>      is stored in this field.
>
>      1-bit skip-worktree flag (used by sparse checkout)
>
>      1-bit intent-to-add flag (used by "git add -N")
>
>      14-bit unused, must be zero
>
>    A 16-bit offset, relative to the beginning of this block, to the
>      pathname of this entry. FIXME: make it 32-bit, relative to the
>      beginning of the file, so that we can reuse pathnames from other
>      (old) blocks?

16 bit offset doesn't work well in a block that can be as large as 2^24.

If you reuse a path name list at the start of the file, how do you
handle new names?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-04-02 14:27                                   ` Shawn Pearce
@ 2012-04-02 15:12                                     ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 33+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-04-02 15:12 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: elton sky, Git Mailing List, Thomas Rast

On Mon, Apr 2, 2012 at 9:27 PM, Shawn Pearce <spearce@spearce.org> wrote:
> On Mon, Apr 2, 2012 at 08:31, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
>> The index size will increase continuously. At some point, we will need
>> to repack it. Let assume a tree block is 64k on average and a path
>> generally consists of 3 path components.  That means an entry update
>> adds 192k and we can do about 80 updates before index reaches 16M (in
>> addition to initial index size).
>
> Only 3 path components? Java sources can easily have 8-10 with a long
> Maven and Java package implied prefix. This will increase the
> frequency of rewrites of the index file.

Yes, but in java case, users could adjust the rewrite limits to make
it less often. The index will be bigger, but because we mmap it and
only access parts of it, index size does not matter much.

>> At 16M or when trailer generation hits a limit (the limit can be
>> configurable), we rewrite the index to reduce its size. Some heavy
>> operations can also be used to rewrite index, such as checkout or
>> reset.
>>
>> The index integrity is verified by crc32. One crc32 covers header and
>> trailer. Each block has its own crc32. When the index is found
>> corrupt, we could try to roll back to latest good version by looking
>> for trailers from bottom up. Even when the index is not corrupt, users
>> can still look back this way for older index versions.
>
> How do you deal with a partially written append to the index file?
> E.g. if a prior update crashes or the filesystem doesn't write
> everything out before power failure, you need to find the last good
> trailer block in the file.

By looking for the trailer signature "ROOT" from bottom up, then
verify if it's still good (i.e. verifying all trees) from there.
Repeat until we find a good one.

>> = The git index file has the following format
>>
>>   - A 8-byte header consisting of
>>
>>     4-byte signature:
>>       The signature is { 'T', 'R', 'E', 'E' }
>>
>>     4-byte version number:
>>       The current supported versions are 1.
>
> Why not DIRC version 4?

I thought of that, but because I don't keep header format the same as
v3, I thought signature should change too. But this is really not
important at this stage.

>>   - A number of blocks of variable size
>>
>>      1-byte block type
>>
>>      3-byte content size in byte
>>
>>      block content
>
> So you are limiting the size of a canonical tree now? Currently there
> is no limit on the size a tree. But here the entire index structure
> plus set of names must be under 16 MiB. Granted no project probably
> hits that limit, but you are painting us into a corner with an upper
> limit here that doesn't look like it will be easy to increase.

We can introduce a new block type, not a nice approach though. Not
saving block size is probably ok too. We would need something to mark
end-of-block. I wanted to save block size to do crc32 quickly without
parsing the block, but instead, we could make block parsing faster and
not worry about it.

>> == Tree block
>>
>>  A tree block contains a (maybe invalid) tree object and extra
>>  information of its companion in working directory. Tree block has
>>  block type 'T'.
>>
>>  Tree block content is basically the list of non-recursive entries in
>>  specified path, with all attributes we store in the index now. There
>>  are a few changes though to intergrate cache-tree and allow
>>  bsearch() on mmap'd block.
>>
>>  A tree block content consists of
>>
>>  - 4-byte tree object size
>>
>>  - 20-byte SHA-1 of the cached tree object
>>
>>  - a list attributes corresponding to tree object's item, in the same
>>    order.  These attributes are the same as in DIRC entry format
>>    except that entry name is removed, and a tree block offset is
>>    added in case the item is a directory.
>>
>> ...
>>
>>    A 16-bit offset, relative to the beginning of this block, to the
>>      pathname of this entry. FIXME: make it 32-bit, relative to the
>>      beginning of the file, so that we can reuse pathnames from other
>>      (old) blocks?
>
> 16 bit offset doesn't work well in a block that can be as large as 2^24.

No it doesn't. That's the implication of using 16-bit offsets.

> If you reuse a path name list at the start of the file, how do you
> handle new names?

We don't drop this part even if we reuse a few path names elsewhere.
New path names can be put here. But I'm not really sure if reusing
names gains us anything because at least it breaks locality and
complicates handling code. We already save quite a bit by not
duplicating parent prefix.
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-04-02 12:31                                 ` Nguyen Thai Ngoc Duy
  2012-04-02 14:27                                   ` Shawn Pearce
@ 2012-04-04  8:26                                   ` elton sky
  2012-04-04 12:20                                     ` Nguyen Thai Ngoc Duy
  1 sibling, 1 reply; 33+ messages in thread
From: elton sky @ 2012-04-04  8:26 UTC (permalink / raw)
  To: Git Mailing List, Nguyen Thai Ngoc Duy

Hi Nguyen,

A few questions,

> -- 8< --
> GIT index format
> ================
>
> This format replaces the old "DIRC" format. Compared to the old
> format, which is essentially a sorted list of pathnames, this one:
>
>  - is tree-based
>  - use crc32 as checksum
>  - only verify integrity on parts that git accesses, instead of whole
>   file
>  - append changes to the end
>  - allow index versioning
>
> Updates can be made directly to the index by appending to the end. The
> index traversed by locating the root tree block from the trailer. When
> a path is updated, all related tree blocks are updated and appended to
> the end, then a new trailer (with generation increased by one) is
> written to conclude the index.
>
> The index size will increase continuously. At some point, we will need
> to repack it. Let assume a tree block is 64k on average and a path
> generally consists of 3 path components.  That means an entry update
> adds 192k and we can do about 80 updates before index reaches 16M (in
> addition to initial index size).
>
> At 16M or when trailer generation hits a limit (the limit can be
> configurable), we rewrite the index to reduce its size. Some heavy
> operations can also be used to rewrite index, such as checkout or
> reset.
>
> The index integrity is verified by crc32. One crc32 covers header and
> trailer. Each block has its own crc32. When the index is found
> corrupt, we could try to roll back to latest good version by looking
> for trailers from bottom up. Even when the index is not corrupt, users
> can still look back this way for older index versions.
>

I am not sure how the trailer works.
I assume there can be multiple trailers, each update will generate a
new one. Every trailer will point to the root tree (i.e. all trailers
point to the same block?). So if there are some changes to root, like
rename, trailers all point to the latest root block?

Is the index looks like :
| HEADER | TREE BLOCKS | TRAILER |  TREE BLOCKS | TRAILER | TREE
BLOCKS | TRAILER | ...

Blocks and trailers are interleaved. The index starts from a few
blocks (git add file1 file2 file3 ..) and expands as it goes. If file1
is updated, the tree block containing file1 is updated and appended.
(At this point, 2 versions of tree blocks containing file is in index
?) How do you organize these 2 block in a tree ?

Appended blocks are also a tree or just a list. If it's a list, it
needs O(n) read time. If it's like a sub tree, I assume it's small,
because I guess there won't be many changes each time. If it's too
small then lgn -> n, and in total read time -> n.

> = The git index file has the following format
>
>   - A 8-byte header consisting of
>
>     4-byte signature:
>       The signature is { 'T', 'R', 'E', 'E' }
>
>     4-byte version number:
>       The current supported versions are 1.
>
>   - A number of blocks of variable size
>
>      1-byte block type
>
>      3-byte content size in byte
>
>      block content
>
>      4-byte crc32 of all above
>
>   - A 18-byte trailer consisting of
>
>      4-byte trailer signature:
>        The signature is { 'R', 'O', 'O', 'T' }
>
>      2-byte generation:
>         The first trailer is 0, the second 1 and so on.
>
>      4-byte root block offset
>
>      4-byte extension table offset:
>        Zero means no extension
>
>      4-byte checksum:
>        CRC32 of the header and the trailer (excluding this field)
>
> == Tree block
>
>  A tree block contains a (maybe invalid) tree object and extra
>  information of its companion in working directory. Tree block has
>  block type 'T'.
>
>  Tree block content is basically the list of non-recursive entries in
>  specified path, with all attributes we store in the index now. There
>  are a few changes though to intergrate cache-tree and allow
>  bsearch() on mmap'd block.
>
>  A tree block content consists of
>
>  - 4-byte tree object size
>
>  - 20-byte SHA-1 of the cached tree object
>
>  - a list attributes corresponding to tree object's item, in the same
>    order.  These attributes are the same as in DIRC entry format
>    except that entry name is removed, and a tree block offset is
>    added in case the item is a directory.
>
>    32-bit ctime seconds, the last time a file's metadata changed
>      this is stat(2) data
>
>    32-bit ctime nanosecond fractions
>      this is stat(2) data
>
>    32-bit mtime seconds, the last time a file's data changed
>      this is stat(2) data
>
>    32-bit mtime nanosecond fractions
>      this is stat(2) data
>
>    32-bit dev
>      this is stat(2) data
>
>    32-bit ino
>      this is stat(2) data
>
>    32-bit mode, split into (high to low bits)
>
>      4-bit object type
>        valid values in binary are 1000 (regular file), 1010 (symbolic link)
>        and 1110 (gitlink)
>
>      3-bit unused
>
>      9-bit unix permission. Only 0755 and 0644 are valid for regular files.
>      Symbolic links and gitlinks have value 0 in this field.
>
>    32-bit uid
>      this is stat(2) data
>
>    32-bit gid
>      this is stat(2) data
>
>    32-bit file size
>      This is the on-disk size from stat(2), truncated to 32-bit.
>
>    160-bit SHA-1 for the represented object if blobs or the offset
>      to another tree block if trees
>
>    A 32-bit 'flags' field split into (high to low bits)
>
>      1-bit assume-valid flag
>
>      1-bit extended flag (must be zero in version 2)
>
>      2-bit stage (during merge)
>
>      12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
>      is stored in this field.
>
>      1-bit skip-worktree flag (used by sparse checkout)
>
>      1-bit intent-to-add flag (used by "git add -N")
>
>      14-bit unused, must be zero
>
>    A 16-bit offset, relative to the beginning of this block, to the
>      pathname of this entry. FIXME: make it 32-bit, relative to the
>      beginning of the file, so that we can reuse pathnames from other
>      (old) blocks?
>

It's nice to enable it for bsearch in a block by separate pathname.
If all names are shared by all blocks, this pathname tree will be
loaded for every operation. I guess the load&hash is expensive.

>  - a list of NUL-terminated pathnames, pointed to from the 16-bit offset
>    above. This list does not have to be of the same order as the attribute
>    list. The reason this is separated from the attribute list is to make
>    attribute list fixed size, searchable using bsearch().
>
> == Extension table block
>
>  Extension table has block type 'X'. It consists of a series of 4-byte
>  extension block offset.
>
> == Extension block
>
>  Extension block has block type 'E'. Extension content is the same as
>  in the old format.
> -- 8< --
> --
> Duy

-Elton

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-04-04  8:26                                   ` elton sky
@ 2012-04-04 12:20                                     ` Nguyen Thai Ngoc Duy
  2012-04-04 16:22                                       ` elton sky
  0 siblings, 1 reply; 33+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-04-04 12:20 UTC (permalink / raw)
  To: elton sky; +Cc: Git Mailing List

On Wed, Apr 4, 2012 at 3:26 PM, elton sky <eltonsky9404@gmail.com> wrote:
> I am not sure how the trailer works.
> I assume there can be multiple trailers, each update will generate a
> new one. Every trailer will point to the root tree (i.e. all trailers
> point to the same block?). So if there are some changes to root, like
> rename, trailers all point to the latest root block?

Each trailer points to the whole new tree. Because trees are
immutable, changing in a tree meangs creating a new one and will also
make a new parent tree (to point to the updated tree because old
parent will always point to old tree). This eventually leads to root
tree change, recorded by the trailer.

> Is the index looks like :
> | HEADER | TREE BLOCKS | TRAILER |  TREE BLOCKS | TRAILER | TREE
> BLOCKS | TRAILER | ...
>
> Blocks and trailers are interleaved. The index starts from a few
> blocks (git add file1 file2 file3 ..) and expands as it goes. If file1
> is updated, the tree block containing file1 is updated and appended.
> (At this point, 2 versions of tree blocks containing file is in index
> ?) How do you organize these 2 block in a tree ?

I leave them where they are. They will be indirectly referenced by two
different roots. At that point we have to new full trees, sharing many
subtrees except the one that contains file1 and its ancestors. This
makes it possible to access an old index version by traversing from an
older trailer. Heavy "add -p" users may like this.

> Appended blocks are also a tree or just a list. If it's a list, it
> needs O(n) read time. If it's like a sub tree, I assume it's small,
> because I guess there won't be many changes each time. If it's too
> small then lgn -> n, and in total read time -> n.

It's trees all the way down. I'm not sure why read time is related
here. You read it by traversing from root tree to leaves, no matter
old or new root. Appended trees may push trees farther away and
increase seek time. Other than that, I don't see significant read
performance degradation (really crowded trees may degrade a little bit
because we need to read trees in addition to leaves, but I don't think
it's a big problem).
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-04-04 12:20                                     ` Nguyen Thai Ngoc Duy
@ 2012-04-04 16:22                                       ` elton sky
  2012-04-06  3:13                                         ` elton sky
  0 siblings, 1 reply; 33+ messages in thread
From: elton sky @ 2012-04-04 16:22 UTC (permalink / raw)
  To: Git Mailing List, Nguyen Thai Ngoc Duy

Hello,

Some updates for Nguyen's index:

On Wed, Apr 4, 2012 at 10:20 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Wed, Apr 4, 2012 at 3:26 PM, elton sky <eltonsky9404@gmail.com> wrote:
>> I am not sure how the trailer works.
>> I assume there can be multiple trailers, each update will generate a
>> new one. Every trailer will point to the root tree (i.e. all trailers
>> point to the same block?). So if there are some changes to root, like
>> rename, trailers all point to the latest root block?
>
> Each trailer points to the whole new tree. Because trees are
> immutable, changing in a tree meangs creating a new one and will also
> make a new parent tree (to point to the updated tree because old
> parent will always point to old tree). This eventually leads to root
> tree change, recorded by the trailer.
>
>> Is the index looks like :
>> | HEADER | TREE BLOCKS | TRAILER |  TREE BLOCKS | TRAILER | TREE
>> BLOCKS | TRAILER | ...
>>

Once an update happened to a block, all parent blocks to root will be
copied to the end of the index. And the updated block will be at leaf
of the new tree. Finding this path costs logn time anyway. This is no
harm for read for the whole tree. But in order to find all previous
changes to a block, we have to go through all trailers and trees.

Otherwise, just modify the original tree. Let the parent points to the
updated block and let the updated block points to the old block:

parent
   |
  V
updated            old                  old
block (v3)  -->   block(v2)  -->  block (v1)
   |
  V
child
blocks

A version number in a tree block is used to track the changes.
In this way, there's still no harm to read, and it's more easy to
trace the change history of a block. Also, we don't need to create
interleaved blocks and trailers. There's only one trailer in the end
of file. We also need to add another offset points to previous
version.

Trailer is kept at the end of index, as its size is variable. It
contains offset to root and list of free spaces.

Changes to format:

> = The git index file has the following format

As is, except there's only one trailer now. And trailer contains list
of free spaces.

>- A 18-byte trailer consisting of
>
>      4-byte trailer signature:
>        The signature is { 'R', 'O', 'O', 'T' }
>
>      2-byte generation:
>         The first trailer is 0, the second 1 and so on.
>
>      4-byte root block offset
>
>      4-byte extension table offset:
>        Zero means no extension
>
       list of free spaces
       - 4 byte offset
       - 2 byte length

>      4-byte checksum:
>        CRC32 of the header and the trailer (excluding this field)

Free space list is read/written in whole for each operation, together
with trailer.

>
> == Tree block
> ...

Above as is.

- 1 byte version num

- 4 byte offset to previous version block

>
>    160-bit SHA-1 for the represented object if blobs or the offset
>      to another tree block if trees
>
>    A 32-bit 'flags' field split into (high to low bits)
>
>      1-bit assume-valid flag
>
>      1-bit extended flag (must be zero in version 2)
>
>      2-bit stage (during merge)
>
>      12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
>      is stored in this field.
>
>      1-bit skip-worktree flag (used by sparse checkout)
>
>      1-bit intent-to-add flag (used by "git add -N")
>
>      14-bit unused, must be zero
>
>    A 16-bit offset, relative to the beginning of this block, to the
>      pathname of this entry. FIXME: make it 32-bit, relative to the
>      beginning of the file, so that we can reuse pathnames from other
>      (old) blocks?
>

-Elton

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-04-04 16:22                                       ` elton sky
@ 2012-04-06  3:13                                         ` elton sky
  2012-04-06  3:15                                           ` elton sky
  0 siblings, 1 reply; 33+ messages in thread
From: elton sky @ 2012-04-06  3:13 UTC (permalink / raw)
  To: Git Mailing List, Nguyen Thai Ngoc Duy

Thank you everyone for ideas, clues, explanations and questions.
Collectively I wrote my proposal. This one is mostly based on Duy's
suggestion with minor changes.


Problem:
Current index store all files in a list. This implies that:
1. Each operation has to read the whole index and write the whole index back.
2. computing the checksum for the whole index.
These become expensive when the repo is large.

Requirement:
Suppose n is the number objects in a repo
-Keep the read/write time <= O(logn).
-Keep the checksum computed against only necessary objects.
-New format is easy to parse.
-Backward compatible.
-Potentially use faster hash method.

Proposed solution:

Store the repo structure in a canonical tree. Each directory is a tree
block. A tree block contains blobs and offsets to sub directories. It
has its own checksum. A read/write will be done on tree block base. A
tree block also contains an offset points to its previous version (if
there's one).

The root of offset is stored in trailer, which stays at the end of
file. Each update creates a new trailer which points to the new tree.
(details below)

To save the pain of modify a tree block and track the free spaces in
the index, changed blocks are appended at the end. Based on the
assumption that user won't change too many files each time, for an
updated file, all its parent blocks to Root was copied and appended to
index. In other words, all traversed blocks are copied. The offset of
previous version of the updated block is stored in the new block. A
new generation number is stored in copied and updated blocks. Other
blocks are not copied. They are referenced by offsets. After update a
new trailer is created at the end. In this way, there's no harm to
read, and makes write fast.

Trailer stores the offset of previous trailer. It makes tracking old
versions easy.

Each operation will load and rewrite the header and visited trailer .

Checksum for non identifier purpose will use crc32. Otherwise it uses sha1.

For compatibility, old format of index will be transformed to new
format in the first operation.

==
Index format:

- A 8-byte header consisting of

    4-byte signature:
      The signature is { 'T', 'R', 'E', 'E' }

    4-byte version number:
      The current supported versions are 4.

  - A number of blocks of variable size

     1-byte block type

     3-byte content size in byte

     block content

     4-byte crc32 of all above

  - A 20-byte trailer consisting of

     4-byte trailer signature:
       The signature is { 'R', 'O', 'O', 'T' }

     4-byte root block offset

     4-byte extension table offset:
       Zero means no extension

     4-byte offset to previous trailer

     4-byte checksum:
       CRC32 of the header and the trailer (excluding this field)

==
Tree block:

Tree block content is basically the list of entries in a specified
path, with all attributes we store in the index now. This entry list
is sorted by pathname. For doing a bsearch in the list, pathnames are
stored at the end of block, which makes the size of entry fixed. The
pathname is pointed from each entry with 2 byte offset (relative to a
block). This should not be problem as a block is never modified.

It stores the generation number and the offset to old block. It also
integrates the content of cache-tree.

A tree block content consists of

 - 4-byte tree object size

 - 20-byte SHA-1 of the cached tree object

- checkpoint : interleave with items in the block, 1 for every 100 items
	4-byte offset to next checkpoint
	
 - a list attributes corresponding to tree object's item, in the same
   order. These attributes are the same as in DIRC entry format
   except that entry name is removed, and a tree block offset is
   added in case the item is a directory.

   32-bit ctime seconds, the last time a file's metadata changed
     this is stat(2) data

   32-bit ctime nanosecond fractions
     this is stat(2) data

   32-bit mtime seconds, the last time a file's data changed
     this is stat(2) data

   32-bit mtime nanosecond fractions
     this is stat(2) data

   32-bit dev
     this is stat(2) data

   32-bit ino
     this is stat(2) data

   32-bit mode, split into (high to low bits)

     4-bit object type
       valid values in binary are 1000 (regular file), 1010 (symbolic link)
       and 1110 (gitlink)

     3-bit unused

     9-bit unix permission. Only 0755 and 0644 are valid for regular files.
     Symbolic links and gitlinks have value 0 in this field.

   32-bit uid
     this is stat(2) data

   32-bit gid
     this is stat(2) data

   32-bit file size
     This is the on-disk size from stat(2), truncated to 32-bit.

   160-bit SHA-1 for the represented object if blobs or the offset
     to another tree block if trees

   A 32-bit 'flags' field split into (high to low bits)

     1-bit assume-valid flag

     1-bit extended flag (must be zero in version 2)

     2-bit stage (during merge)

     12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
     is stored in this field.

     1-bit skip-worktree flag (used by sparse checkout)

     1-bit intent-to-add flag (used by "git add -N")

     14-bit unused, must be zero

      A 16-bit offset, relative to the beginning of this block, to the
pathname of this entry

- 2-byte generation number, starts from 1

- 4-byte previous version offset

 - a list of NUL-terminated pathnames, pointed to from the 16-bit offset
   above

== Extension table block

Extension table has block type 'X'. It consists of a series of 4-byte
extension block offset.

== Extension block

Extension block has block type 'E'. Extension content is the same as
in the old format.



Time line:

24/04 ~ 21/05: get familiar with code base and revise proposal
benchmark with linux kernel on major operations
write prototype to prove feasibility of proposed solution
consult mailing list & irc

22/05 ~ 25/06 write code, test and benchmark
	modify tree lib
	modify index format operations
	modify git operations
	transform from old to new

26/06 ~ 30/07 revise things according to benchmark

31/07 ~ 13/08 update documentation
	

About me

My name is Elton Tian, I am from China. I have been living in
Australia for quite a few years. I am currently a Master student from
Australia National University. After graduate I worked on linux based
web development (using tcl) for 2.5 years. I have been programming
with c, c#, java, tcl, php, javascript and shell script. But I prefer
c, which gives me the feeling of full control over the program. I am
interested in data intensive computing. I played with hadoop and
mapreduce since 2010. I maintained my 3 node cluster behind my desk.
As linus hates cvs with a passion, I still want to mention I was using
cvs in work, don't like it though. And I guess this is the good chance
to get over it.

On Thu, Apr 5, 2012 at 2:22 AM, elton sky <eltonsky9404@gmail.com> wrote:
> Hello,
>
> Some updates for Nguyen's index:
>
> On Wed, Apr 4, 2012 at 10:20 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
>> On Wed, Apr 4, 2012 at 3:26 PM, elton sky <eltonsky9404@gmail.com> wrote:
>>> I am not sure how the trailer works.
>>> I assume there can be multiple trailers, each update will generate a
>>> new one. Every trailer will point to the root tree (i.e. all trailers
>>> point to the same block?). So if there are some changes to root, like
>>> rename, trailers all point to the latest root block?
>>
>> Each trailer points to the whole new tree. Because trees are
>> immutable, changing in a tree meangs creating a new one and will also
>> make a new parent tree (to point to the updated tree because old
>> parent will always point to old tree). This eventually leads to root
>> tree change, recorded by the trailer.
>>
>>> Is the index looks like :
>>> | HEADER | TREE BLOCKS | TRAILER |  TREE BLOCKS | TRAILER | TREE
>>> BLOCKS | TRAILER | ...
>>>
>
> Once an update happened to a block, all parent blocks to root will be
> copied to the end of the index. And the updated block will be at leaf
> of the new tree. Finding this path costs logn time anyway. This is no
> harm for read for the whole tree. But in order to find all previous
> changes to a block, we have to go through all trailers and trees.
>
> Otherwise, just modify the original tree. Let the parent points to the
> updated block and let the updated block points to the old block:
>
> parent
>    |
>   V
> updated            old                  old
> block (v3)  -->   block(v2)  -->  block (v1)
>    |
>   V
> child
> blocks
>
> A version number in a tree block is used to track the changes.
> In this way, there's still no harm to read, and it's more easy to
> trace the change history of a block. Also, we don't need to create
> interleaved blocks and trailers. There's only one trailer in the end
> of file. We also need to add another offset points to previous
> version.
>
> Trailer is kept at the end of index, as its size is variable. It
> contains offset to root and list of free spaces.
>
> Changes to format:
>
>> = The git index file has the following format
>
> As is, except there's only one trailer now. And trailer contains list
> of free spaces.
>
>>- A 18-byte trailer consisting of
>>
>>      4-byte trailer signature:
>>        The signature is { 'R', 'O', 'O', 'T' }
>>
>>      2-byte generation:
>>         The first trailer is 0, the second 1 and so on.
>>
>>      4-byte root block offset
>>
>>      4-byte extension table offset:
>>        Zero means no extension
>>
>        list of free spaces
>        - 4 byte offset
>        - 2 byte length
>
>>      4-byte checksum:
>>        CRC32 of the header and the trailer (excluding this field)
>
> Free space list is read/written in whole for each operation, together
> with trailer.
>
>>
>> == Tree block
>> ...
>
> Above as is.
>
> - 1 byte version num
>
> - 4 byte offset to previous version block
>
>>
>>    160-bit SHA-1 for the represented object if blobs or the offset
>>      to another tree block if trees
>>
>>    A 32-bit 'flags' field split into (high to low bits)
>>
>>      1-bit assume-valid flag
>>
>>      1-bit extended flag (must be zero in version 2)
>>
>>      2-bit stage (during merge)
>>
>>      12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
>>      is stored in this field.
>>
>>      1-bit skip-worktree flag (used by sparse checkout)
>>
>>      1-bit intent-to-add flag (used by "git add -N")
>>
>>      14-bit unused, must be zero
>>
>>    A 16-bit offset, relative to the beginning of this block, to the
>>      pathname of this entry. FIXME: make it 32-bit, relative to the
>>      beginning of the file, so that we can reuse pathnames from other
>>      (old) blocks?
>>
>
> -Elton

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-04-06  3:13                                         ` elton sky
@ 2012-04-06  3:15                                           ` elton sky
  2012-04-07  8:29                                             ` elton sky
  0 siblings, 1 reply; 33+ messages in thread
From: elton sky @ 2012-04-06  3:15 UTC (permalink / raw)
  To: Git Mailing List, Nguyen Thai Ngoc Duy

NOTE:
Please ignore the

>- checkpoint : interleave with items in the block, 1 for every 100 items
>       4-byte offset to next checkpoint

That's irrelevant.

Cheers,
Elton

On Fri, Apr 6, 2012 at 1:13 PM, elton sky <eltonsky9404@gmail.com> wrote:
> Thank you everyone for ideas, clues, explanations and questions.
> Collectively I wrote my proposal. This one is mostly based on Duy's
> suggestion with minor changes.
>
>
> Problem:
> Current index store all files in a list. This implies that:
> 1. Each operation has to read the whole index and write the whole index back.
> 2. computing the checksum for the whole index.
> These become expensive when the repo is large.
>
> Requirement:
> Suppose n is the number objects in a repo
> -Keep the read/write time <= O(logn).
> -Keep the checksum computed against only necessary objects.
> -New format is easy to parse.
> -Backward compatible.
> -Potentially use faster hash method.
>
> Proposed solution:
>
> Store the repo structure in a canonical tree. Each directory is a tree
> block. A tree block contains blobs and offsets to sub directories. It
> has its own checksum. A read/write will be done on tree block base. A
> tree block also contains an offset points to its previous version (if
> there's one).
>
> The root of offset is stored in trailer, which stays at the end of
> file. Each update creates a new trailer which points to the new tree.
> (details below)
>
> To save the pain of modify a tree block and track the free spaces in
> the index, changed blocks are appended at the end. Based on the
> assumption that user won't change too many files each time, for an
> updated file, all its parent blocks to Root was copied and appended to
> index. In other words, all traversed blocks are copied. The offset of
> previous version of the updated block is stored in the new block. A
> new generation number is stored in copied and updated blocks. Other
> blocks are not copied. They are referenced by offsets. After update a
> new trailer is created at the end. In this way, there's no harm to
> read, and makes write fast.
>
> Trailer stores the offset of previous trailer. It makes tracking old
> versions easy.
>
> Each operation will load and rewrite the header and visited trailer .
>
> Checksum for non identifier purpose will use crc32. Otherwise it uses sha1.
>
> For compatibility, old format of index will be transformed to new
> format in the first operation.
>
> ==
> Index format:
>
> - A 8-byte header consisting of
>
>    4-byte signature:
>      The signature is { 'T', 'R', 'E', 'E' }
>
>    4-byte version number:
>      The current supported versions are 4.
>
>  - A number of blocks of variable size
>
>     1-byte block type
>
>     3-byte content size in byte
>
>     block content
>
>     4-byte crc32 of all above
>
>  - A 20-byte trailer consisting of
>
>     4-byte trailer signature:
>       The signature is { 'R', 'O', 'O', 'T' }
>
>     4-byte root block offset
>
>     4-byte extension table offset:
>       Zero means no extension
>
>     4-byte offset to previous trailer
>
>     4-byte checksum:
>       CRC32 of the header and the trailer (excluding this field)
>
> ==
> Tree block:
>
> Tree block content is basically the list of entries in a specified
> path, with all attributes we store in the index now. This entry list
> is sorted by pathname. For doing a bsearch in the list, pathnames are
> stored at the end of block, which makes the size of entry fixed. The
> pathname is pointed from each entry with 2 byte offset (relative to a
> block). This should not be problem as a block is never modified.
>
> It stores the generation number and the offset to old block. It also
> integrates the content of cache-tree.
>
> A tree block content consists of
>
>  - 4-byte tree object size
>
>  - 20-byte SHA-1 of the cached tree object
>
> - checkpoint : interleave with items in the block, 1 for every 100 items
>        4-byte offset to next checkpoint
>
>  - a list attributes corresponding to tree object's item, in the same
>   order. These attributes are the same as in DIRC entry format
>   except that entry name is removed, and a tree block offset is
>   added in case the item is a directory.
>
>   32-bit ctime seconds, the last time a file's metadata changed
>     this is stat(2) data
>
>   32-bit ctime nanosecond fractions
>     this is stat(2) data
>
>   32-bit mtime seconds, the last time a file's data changed
>     this is stat(2) data
>
>   32-bit mtime nanosecond fractions
>     this is stat(2) data
>
>   32-bit dev
>     this is stat(2) data
>
>   32-bit ino
>     this is stat(2) data
>
>   32-bit mode, split into (high to low bits)
>
>     4-bit object type
>       valid values in binary are 1000 (regular file), 1010 (symbolic link)
>       and 1110 (gitlink)
>
>     3-bit unused
>
>     9-bit unix permission. Only 0755 and 0644 are valid for regular files.
>     Symbolic links and gitlinks have value 0 in this field.
>
>   32-bit uid
>     this is stat(2) data
>
>   32-bit gid
>     this is stat(2) data
>
>   32-bit file size
>     This is the on-disk size from stat(2), truncated to 32-bit.
>
>   160-bit SHA-1 for the represented object if blobs or the offset
>     to another tree block if trees
>
>   A 32-bit 'flags' field split into (high to low bits)
>
>     1-bit assume-valid flag
>
>     1-bit extended flag (must be zero in version 2)
>
>     2-bit stage (during merge)
>
>     12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
>     is stored in this field.
>
>     1-bit skip-worktree flag (used by sparse checkout)
>
>     1-bit intent-to-add flag (used by "git add -N")
>
>     14-bit unused, must be zero
>
>      A 16-bit offset, relative to the beginning of this block, to the
> pathname of this entry
>
> - 2-byte generation number, starts from 1
>
> - 4-byte previous version offset
>
>  - a list of NUL-terminated pathnames, pointed to from the 16-bit offset
>   above
>
> == Extension table block
>
> Extension table has block type 'X'. It consists of a series of 4-byte
> extension block offset.
>
> == Extension block
>
> Extension block has block type 'E'. Extension content is the same as
> in the old format.
>
>
>
> Time line:
>
> 24/04 ~ 21/05: get familiar with code base and revise proposal
> benchmark with linux kernel on major operations
> write prototype to prove feasibility of proposed solution
> consult mailing list & irc
>
> 22/05 ~ 25/06 write code, test and benchmark
>        modify tree lib
>        modify index format operations
>        modify git operations
>        transform from old to new
>
> 26/06 ~ 30/07 revise things according to benchmark
>
> 31/07 ~ 13/08 update documentation
>
>
> About me
>
> My name is Elton Tian, I am from China. I have been living in
> Australia for quite a few years. I am currently a Master student from
> Australia National University. After graduate I worked on linux based
> web development (using tcl) for 2.5 years. I have been programming
> with c, c#, java, tcl, php, javascript and shell script. But I prefer
> c, which gives me the feeling of full control over the program. I am
> interested in data intensive computing. I played with hadoop and
> mapreduce since 2010. I maintained my 3 node cluster behind my desk.
> As linus hates cvs with a passion, I still want to mention I was using
> cvs in work, don't like it though. And I guess this is the good chance
> to get over it.
>
> On Thu, Apr 5, 2012 at 2:22 AM, elton sky <eltonsky9404@gmail.com> wrote:
>> Hello,
>>
>> Some updates for Nguyen's index:
>>
>> On Wed, Apr 4, 2012 at 10:20 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
>>> On Wed, Apr 4, 2012 at 3:26 PM, elton sky <eltonsky9404@gmail.com> wrote:
>>>> I am not sure how the trailer works.
>>>> I assume there can be multiple trailers, each update will generate a
>>>> new one. Every trailer will point to the root tree (i.e. all trailers
>>>> point to the same block?). So if there are some changes to root, like
>>>> rename, trailers all point to the latest root block?
>>>
>>> Each trailer points to the whole new tree. Because trees are
>>> immutable, changing in a tree meangs creating a new one and will also
>>> make a new parent tree (to point to the updated tree because old
>>> parent will always point to old tree). This eventually leads to root
>>> tree change, recorded by the trailer.
>>>
>>>> Is the index looks like :
>>>> | HEADER | TREE BLOCKS | TRAILER |  TREE BLOCKS | TRAILER | TREE
>>>> BLOCKS | TRAILER | ...
>>>>
>>
>> Once an update happened to a block, all parent blocks to root will be
>> copied to the end of the index. And the updated block will be at leaf
>> of the new tree. Finding this path costs logn time anyway. This is no
>> harm for read for the whole tree. But in order to find all previous
>> changes to a block, we have to go through all trailers and trees.
>>
>> Otherwise, just modify the original tree. Let the parent points to the
>> updated block and let the updated block points to the old block:
>>
>> parent
>>    |
>>   V
>> updated            old                  old
>> block (v3)  -->   block(v2)  -->  block (v1)
>>    |
>>   V
>> child
>> blocks
>>
>> A version number in a tree block is used to track the changes.
>> In this way, there's still no harm to read, and it's more easy to
>> trace the change history of a block. Also, we don't need to create
>> interleaved blocks and trailers. There's only one trailer in the end
>> of file. We also need to add another offset points to previous
>> version.
>>
>> Trailer is kept at the end of index, as its size is variable. It
>> contains offset to root and list of free spaces.
>>
>> Changes to format:
>>
>>> = The git index file has the following format
>>
>> As is, except there's only one trailer now. And trailer contains list
>> of free spaces.
>>
>>>- A 18-byte trailer consisting of
>>>
>>>      4-byte trailer signature:
>>>        The signature is { 'R', 'O', 'O', 'T' }
>>>
>>>      2-byte generation:
>>>         The first trailer is 0, the second 1 and so on.
>>>
>>>      4-byte root block offset
>>>
>>>      4-byte extension table offset:
>>>        Zero means no extension
>>>
>>        list of free spaces
>>        - 4 byte offset
>>        - 2 byte length
>>
>>>      4-byte checksum:
>>>        CRC32 of the header and the trailer (excluding this field)
>>
>> Free space list is read/written in whole for each operation, together
>> with trailer.
>>
>>>
>>> == Tree block
>>> ...
>>
>> Above as is.
>>
>> - 1 byte version num
>>
>> - 4 byte offset to previous version block
>>
>>>
>>>    160-bit SHA-1 for the represented object if blobs or the offset
>>>      to another tree block if trees
>>>
>>>    A 32-bit 'flags' field split into (high to low bits)
>>>
>>>      1-bit assume-valid flag
>>>
>>>      1-bit extended flag (must be zero in version 2)
>>>
>>>      2-bit stage (during merge)
>>>
>>>      12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
>>>      is stored in this field.
>>>
>>>      1-bit skip-worktree flag (used by sparse checkout)
>>>
>>>      1-bit intent-to-add flag (used by "git add -N")
>>>
>>>      14-bit unused, must be zero
>>>
>>>    A 16-bit offset, relative to the beginning of this block, to the
>>>      pathname of this entry. FIXME: make it 32-bit, relative to the
>>>      beginning of the file, so that we can reuse pathnames from other
>>>      (old) blocks?
>>>
>>
>> -Elton

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GSoC - Designing a faster index format
  2012-04-06  3:15                                           ` elton sky
@ 2012-04-07  8:29                                             ` elton sky
  0 siblings, 0 replies; 33+ messages in thread
From: elton sky @ 2012-04-07  8:29 UTC (permalink / raw)
  To: Git Mailing List, Nguyen Thai Ngoc Duy

.... just realize I have to register and submit my proposal on gsoc
website, rather than here.

I stupidly missed that ...

silly enough..

On Fri, Apr 6, 2012 at 1:15 PM, elton sky <eltonsky9404@gmail.com> wrote:
> NOTE:
> Please ignore the
>
>>- checkpoint : interleave with items in the block, 1 for every 100 items
>>       4-byte offset to next checkpoint
>
> That's irrelevant.
>
> Cheers,
> Elton
>
> On Fri, Apr 6, 2012 at 1:13 PM, elton sky <eltonsky9404@gmail.com> wrote:
>> Thank you everyone for ideas, clues, explanations and questions.
>> Collectively I wrote my proposal. This one is mostly based on Duy's
>> suggestion with minor changes.
>>
>>
>> Problem:
>> Current index store all files in a list. This implies that:
>> 1. Each operation has to read the whole index and write the whole index back.
>> 2. computing the checksum for the whole index.
>> These become expensive when the repo is large.
>>
>> Requirement:
>> Suppose n is the number objects in a repo
>> -Keep the read/write time <= O(logn).
>> -Keep the checksum computed against only necessary objects.
>> -New format is easy to parse.
>> -Backward compatible.
>> -Potentially use faster hash method.
>>
>> Proposed solution:
>>
>> Store the repo structure in a canonical tree. Each directory is a tree
>> block. A tree block contains blobs and offsets to sub directories. It
>> has its own checksum. A read/write will be done on tree block base. A
>> tree block also contains an offset points to its previous version (if
>> there's one).
>>
>> The root of offset is stored in trailer, which stays at the end of
>> file. Each update creates a new trailer which points to the new tree.
>> (details below)
>>
>> To save the pain of modify a tree block and track the free spaces in
>> the index, changed blocks are appended at the end. Based on the
>> assumption that user won't change too many files each time, for an
>> updated file, all its parent blocks to Root was copied and appended to
>> index. In other words, all traversed blocks are copied. The offset of
>> previous version of the updated block is stored in the new block. A
>> new generation number is stored in copied and updated blocks. Other
>> blocks are not copied. They are referenced by offsets. After update a
>> new trailer is created at the end. In this way, there's no harm to
>> read, and makes write fast.
>>
>> Trailer stores the offset of previous trailer. It makes tracking old
>> versions easy.
>>
>> Each operation will load and rewrite the header and visited trailer .
>>
>> Checksum for non identifier purpose will use crc32. Otherwise it uses sha1.
>>
>> For compatibility, old format of index will be transformed to new
>> format in the first operation.
>>
>> ==
>> Index format:
>>
>> - A 8-byte header consisting of
>>
>>    4-byte signature:
>>      The signature is { 'T', 'R', 'E', 'E' }
>>
>>    4-byte version number:
>>      The current supported versions are 4.
>>
>>  - A number of blocks of variable size
>>
>>     1-byte block type
>>
>>     3-byte content size in byte
>>
>>     block content
>>
>>     4-byte crc32 of all above
>>
>>  - A 20-byte trailer consisting of
>>
>>     4-byte trailer signature:
>>       The signature is { 'R', 'O', 'O', 'T' }
>>
>>     4-byte root block offset
>>
>>     4-byte extension table offset:
>>       Zero means no extension
>>
>>     4-byte offset to previous trailer
>>
>>     4-byte checksum:
>>       CRC32 of the header and the trailer (excluding this field)
>>
>> ==
>> Tree block:
>>
>> Tree block content is basically the list of entries in a specified
>> path, with all attributes we store in the index now. This entry list
>> is sorted by pathname. For doing a bsearch in the list, pathnames are
>> stored at the end of block, which makes the size of entry fixed. The
>> pathname is pointed from each entry with 2 byte offset (relative to a
>> block). This should not be problem as a block is never modified.
>>
>> It stores the generation number and the offset to old block. It also
>> integrates the content of cache-tree.
>>
>> A tree block content consists of
>>
>>  - 4-byte tree object size
>>
>>  - 20-byte SHA-1 of the cached tree object
>>
>> - checkpoint : interleave with items in the block, 1 for every 100 items
>>        4-byte offset to next checkpoint
>>
>>  - a list attributes corresponding to tree object's item, in the same
>>   order. These attributes are the same as in DIRC entry format
>>   except that entry name is removed, and a tree block offset is
>>   added in case the item is a directory.
>>
>>   32-bit ctime seconds, the last time a file's metadata changed
>>     this is stat(2) data
>>
>>   32-bit ctime nanosecond fractions
>>     this is stat(2) data
>>
>>   32-bit mtime seconds, the last time a file's data changed
>>     this is stat(2) data
>>
>>   32-bit mtime nanosecond fractions
>>     this is stat(2) data
>>
>>   32-bit dev
>>     this is stat(2) data
>>
>>   32-bit ino
>>     this is stat(2) data
>>
>>   32-bit mode, split into (high to low bits)
>>
>>     4-bit object type
>>       valid values in binary are 1000 (regular file), 1010 (symbolic link)
>>       and 1110 (gitlink)
>>
>>     3-bit unused
>>
>>     9-bit unix permission. Only 0755 and 0644 are valid for regular files.
>>     Symbolic links and gitlinks have value 0 in this field.
>>
>>   32-bit uid
>>     this is stat(2) data
>>
>>   32-bit gid
>>     this is stat(2) data
>>
>>   32-bit file size
>>     This is the on-disk size from stat(2), truncated to 32-bit.
>>
>>   160-bit SHA-1 for the represented object if blobs or the offset
>>     to another tree block if trees
>>
>>   A 32-bit 'flags' field split into (high to low bits)
>>
>>     1-bit assume-valid flag
>>
>>     1-bit extended flag (must be zero in version 2)
>>
>>     2-bit stage (during merge)
>>
>>     12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
>>     is stored in this field.
>>
>>     1-bit skip-worktree flag (used by sparse checkout)
>>
>>     1-bit intent-to-add flag (used by "git add -N")
>>
>>     14-bit unused, must be zero
>>
>>      A 16-bit offset, relative to the beginning of this block, to the
>> pathname of this entry
>>
>> - 2-byte generation number, starts from 1
>>
>> - 4-byte previous version offset
>>
>>  - a list of NUL-terminated pathnames, pointed to from the 16-bit offset
>>   above
>>
>> == Extension table block
>>
>> Extension table has block type 'X'. It consists of a series of 4-byte
>> extension block offset.
>>
>> == Extension block
>>
>> Extension block has block type 'E'. Extension content is the same as
>> in the old format.
>>
>>
>>
>> Time line:
>>
>> 24/04 ~ 21/05: get familiar with code base and revise proposal
>> benchmark with linux kernel on major operations
>> write prototype to prove feasibility of proposed solution
>> consult mailing list & irc
>>
>> 22/05 ~ 25/06 write code, test and benchmark
>>        modify tree lib
>>        modify index format operations
>>        modify git operations
>>        transform from old to new
>>
>> 26/06 ~ 30/07 revise things according to benchmark
>>
>> 31/07 ~ 13/08 update documentation
>>
>>
>> About me
>>
>> My name is Elton Tian, I am from China. I have been living in
>> Australia for quite a few years. I am currently a Master student from
>> Australia National University. After graduate I worked on linux based
>> web development (using tcl) for 2.5 years. I have been programming
>> with c, c#, java, tcl, php, javascript and shell script. But I prefer
>> c, which gives me the feeling of full control over the program. I am
>> interested in data intensive computing. I played with hadoop and
>> mapreduce since 2010. I maintained my 3 node cluster behind my desk.
>> As linus hates cvs with a passion, I still want to mention I was using
>> cvs in work, don't like it though. And I guess this is the good chance
>> to get over it.
>>
>> On Thu, Apr 5, 2012 at 2:22 AM, elton sky <eltonsky9404@gmail.com> wrote:
>>> Hello,
>>>
>>> Some updates for Nguyen's index:
>>>
>>> On Wed, Apr 4, 2012 at 10:20 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
>>>> On Wed, Apr 4, 2012 at 3:26 PM, elton sky <eltonsky9404@gmail.com> wrote:
>>>>> I am not sure how the trailer works.
>>>>> I assume there can be multiple trailers, each update will generate a
>>>>> new one. Every trailer will point to the root tree (i.e. all trailers
>>>>> point to the same block?). So if there are some changes to root, like
>>>>> rename, trailers all point to the latest root block?
>>>>
>>>> Each trailer points to the whole new tree. Because trees are
>>>> immutable, changing in a tree meangs creating a new one and will also
>>>> make a new parent tree (to point to the updated tree because old
>>>> parent will always point to old tree). This eventually leads to root
>>>> tree change, recorded by the trailer.
>>>>
>>>>> Is the index looks like :
>>>>> | HEADER | TREE BLOCKS | TRAILER |  TREE BLOCKS | TRAILER | TREE
>>>>> BLOCKS | TRAILER | ...
>>>>>
>>>
>>> Once an update happened to a block, all parent blocks to root will be
>>> copied to the end of the index. And the updated block will be at leaf
>>> of the new tree. Finding this path costs logn time anyway. This is no
>>> harm for read for the whole tree. But in order to find all previous
>>> changes to a block, we have to go through all trailers and trees.
>>>
>>> Otherwise, just modify the original tree. Let the parent points to the
>>> updated block and let the updated block points to the old block:
>>>
>>> parent
>>>    |
>>>   V
>>> updated            old                  old
>>> block (v3)  -->   block(v2)  -->  block (v1)
>>>    |
>>>   V
>>> child
>>> blocks
>>>
>>> A version number in a tree block is used to track the changes.
>>> In this way, there's still no harm to read, and it's more easy to
>>> trace the change history of a block. Also, we don't need to create
>>> interleaved blocks and trailers. There's only one trailer in the end
>>> of file. We also need to add another offset points to previous
>>> version.
>>>
>>> Trailer is kept at the end of index, as its size is variable. It
>>> contains offset to root and list of free spaces.
>>>
>>> Changes to format:
>>>
>>>> = The git index file has the following format
>>>
>>> As is, except there's only one trailer now. And trailer contains list
>>> of free spaces.
>>>
>>>>- A 18-byte trailer consisting of
>>>>
>>>>      4-byte trailer signature:
>>>>        The signature is { 'R', 'O', 'O', 'T' }
>>>>
>>>>      2-byte generation:
>>>>         The first trailer is 0, the second 1 and so on.
>>>>
>>>>      4-byte root block offset
>>>>
>>>>      4-byte extension table offset:
>>>>        Zero means no extension
>>>>
>>>        list of free spaces
>>>        - 4 byte offset
>>>        - 2 byte length
>>>
>>>>      4-byte checksum:
>>>>        CRC32 of the header and the trailer (excluding this field)
>>>
>>> Free space list is read/written in whole for each operation, together
>>> with trailer.
>>>
>>>>
>>>> == Tree block
>>>> ...
>>>
>>> Above as is.
>>>
>>> - 1 byte version num
>>>
>>> - 4 byte offset to previous version block
>>>
>>>>
>>>>    160-bit SHA-1 for the represented object if blobs or the offset
>>>>      to another tree block if trees
>>>>
>>>>    A 32-bit 'flags' field split into (high to low bits)
>>>>
>>>>      1-bit assume-valid flag
>>>>
>>>>      1-bit extended flag (must be zero in version 2)
>>>>
>>>>      2-bit stage (during merge)
>>>>
>>>>      12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
>>>>      is stored in this field.
>>>>
>>>>      1-bit skip-worktree flag (used by sparse checkout)
>>>>
>>>>      1-bit intent-to-add flag (used by "git add -N")
>>>>
>>>>      14-bit unused, must be zero
>>>>
>>>>    A 16-bit offset, relative to the beginning of this block, to the
>>>>      pathname of this entry. FIXME: make it 32-bit, relative to the
>>>>      beginning of the file, so that we can reuse pathnames from other
>>>>      (old) blocks?
>>>>
>>>
>>> -Elton

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2012-04-07  8:30 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-20 23:10 GSoC - Designing a faster index format elton sky
2012-03-21  1:18 ` Nguyen Thai Ngoc Duy
2012-03-21 11:25 ` Thomas Rast
2012-03-21 12:01   ` elton sky
2012-03-22 20:32     ` elton sky
2012-03-23  0:46       ` Jakub Narebski
2012-03-23  1:30       ` Nguyen Thai Ngoc Duy
2012-03-23 10:27         ` elton sky
2012-03-23 11:24           ` Nguyen Thai Ngoc Duy
     [not found]             ` <CAKTdtZmLOzAgG0uCDcVr+O41XPX-XnoVZjsZWPN-BLjq2oG-7A@mail.gmail.com>
2012-03-24  8:58               ` Nguyen Thai Ngoc Duy
     [not found]                 ` <CAKTdtZkpjVaBSkcieojKj+V7WztT3UDzjGfXyghY=S8mq+X9zw@mail.gmail.com>
     [not found]                   ` <CACsJy8D85thmK_5jLC7MxJtsitLr=zphKiw2miwPu7Exf7ty=Q@mail.gmail.com>
2012-03-26 12:36                     ` elton sky
2012-03-26 12:41                       ` elton sky
2012-03-26 14:28                       ` Thomas Rast
2012-03-26 15:25                         ` Nguyen Thai Ngoc Duy
2012-03-26 16:08                           ` Shawn Pearce
2012-03-27  2:49                             ` elton sky
2012-03-27  3:34                               ` David Barr
2012-03-27  6:33                                 ` Nguyen Thai Ngoc Duy
2012-03-29  9:45                                   ` Jeff King
2012-03-27  6:31                             ` Nguyen Thai Ngoc Duy
2012-03-26 16:19                         ` Nguyen Thai Ngoc Duy
2012-03-27  3:20                           ` elton sky
2012-03-27  6:43                             ` Nguyen Thai Ngoc Duy
2012-04-02 11:50                               ` elton sky
2012-04-02 12:31                                 ` Nguyen Thai Ngoc Duy
2012-04-02 14:27                                   ` Shawn Pearce
2012-04-02 15:12                                     ` Nguyen Thai Ngoc Duy
2012-04-04  8:26                                   ` elton sky
2012-04-04 12:20                                     ` Nguyen Thai Ngoc Duy
2012-04-04 16:22                                       ` elton sky
2012-04-06  3:13                                         ` elton sky
2012-04-06  3:15                                           ` elton sky
2012-04-07  8:29                                             ` elton sky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.