All of lore.kernel.org
 help / color / mirror / Atom feed
* [GSoC] Designing a faster index format
@ 2012-03-20 21:17 Thomas Gummerer
  2012-03-21  1:29 ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Gummerer @ 2012-03-20 21:17 UTC (permalink / raw)
  To: git

Hello,

I'm Thomas Gummerer (@tgummerer on Twitter, tgummerer on IRC), 21 years old from Italy. I'm currently a 3rd year Bachelor student in Applied Computer Science at the Free University of Bolzano. I started programming in High School about 8 years ago with Pascal and then learned C and Java. For some of my projects you can visit my homepage (http://tgummerer.com/projects), most of them are from university and some personal projects I did in my free time. My blog is also on the same homepage, but not really active. Unfortunately I couldn't yet participate in any bigger open source project, although I'm interested in it basically since I started programming. 

I would be interested in the Designing a faster index format project for Google Summer of Code. I'm using git for every project where it is possible after switching to it 2-3 years ago from svn. It would be awesome to be able to contribute to it and make it faster. I think this project could be a good fit for me, because I like trying to optimize data structures and algorithms for the best performance possible.

There are some questions I would have about the project.
Has there already been any development or thoughts in that direction?
In the description databases are mentioned, but that would probably make it harder to read for other .git-reading programs?

Thanks,
Thomas Gummerer

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-03-20 21:17 [GSoC] Designing a faster index format Thomas Gummerer
@ 2012-03-21  1:29 ` Nguyen Thai Ngoc Duy
  2012-03-21  9:22   ` Thomas Gummerer
  0 siblings, 1 reply; 36+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-03-21  1:29 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git

On Wed, Mar 21, 2012 at 4:17 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> There are some questions I would have about the project.
> Has there already been any development or thoughts in that direction?

Tree-based format seems to be favored by core developers, for example
in the footnote in [1]. But there have been no attempts to realize it.
You may want to search mail archive for more.

[1] http://thread.gmane.org/gmane.comp.version-control.git/190016
-- 
Duy

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-03-21  1:29 ` Nguyen Thai Ngoc Duy
@ 2012-03-21  9:22   ` Thomas Gummerer
  2012-03-21 10:34     ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Gummerer @ 2012-03-21  9:22 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git


On Mar 21, 2012, at 2:29 AM, Nguyen Thai Ngoc Duy wrote:

> On Wed, Mar 21, 2012 at 4:17 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> There are some questions I would have about the project.
>> Has there already been any development or thoughts in that direction?
> 
> Tree-based format seems to be favored by core developers, for example
> in the footnote in [1]. But there have been no attempts to realize it.
> You may want to search mail archive for more.
> 

Thanks for your answer, so I would probably implement a tree based structure.
However I got one more question, since I'm not yet really familiar with the
code and internal structure of git, what exactly does the SHA1 over the
index exactly achieve? Is it only for checking if the index is still correct
for the next time it is used and has not been changed or is there a more
important function of it?

--
Thomas Gummerer

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-03-21  9:22   ` Thomas Gummerer
@ 2012-03-21 10:34     ` Nguyen Thai Ngoc Duy
       [not found]       ` <CALgYhfPOJpKbM__iU4KvChWeXWyyhWb2ocR-SLvrQQHNw5F5dQ@mail.gmail.com>
  2012-03-21 12:51       ` Thomas Rast
  0 siblings, 2 replies; 36+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-03-21 10:34 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git

On Wed, Mar 21, 2012 at 4:22 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>
> On Mar 21, 2012, at 2:29 AM, Nguyen Thai Ngoc Duy wrote:
>
>> On Wed, Mar 21, 2012 at 4:17 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>>> There are some questions I would have about the project.
>>> Has there already been any development or thoughts in that direction?
>>
>> Tree-based format seems to be favored by core developers, for example
>> in the footnote in [1]. But there have been no attempts to realize it.
>> You may want to search mail archive for more.
>>
>
> Thanks for your answer, so I would probably implement a tree based structure.

There may be a better approach, maybe a simple database format that we
could immitate. It's up to you to find out :)

> However I got one more question, since I'm not yet really familiar with the
> code and internal structure of git, what exactly does the SHA1 over the
> index exactly achieve? Is it only for checking if the index is still correct
> for the next time it is used and has not been changed or is there a more
> important function of it?

It makes sure the index is not corrupt (by disk faults for example). I
don't think it is used for anything else. Cheaper checksum can be used
if good enough for the index. See
http://thread.gmane.org/gmane.comp.version-control.git/190016 and the
following reply.
-- 
Duy

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
       [not found]       ` <CALgYhfPOJpKbM__iU4KvChWeXWyyhWb2ocR-SLvrQQHNw5F5dQ@mail.gmail.com>
@ 2012-03-21 11:18         ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 36+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-03-21 11:18 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git

On Wed, Mar 21, 2012 at 6:08 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> > However I got one more question, since I'm not yet really familiar with
>> > the
>> > code and internal structure of git, what exactly does the SHA1 over the
>> > index exactly achieve? Is it only for checking if the index is still
>> > correct
>> > for the next time it is used and has not been changed or is there a more
>> > important function of it?
>>
>> It makes sure the index is not corrupt (by disk faults for example). I
>> don't think it is used for anything else. Cheaper checksum can be used
>> if good enough for the index. See
>> http://thread.gmane.org/gmane.comp.version-control.git/190016 and the
>> following reply.
>
> Ok thanks. I guess that could be another way to speed things up, but for
> really big repositories probably it won't be enough and the reduction of the
> complexity as described in the proposal is the way to go.

Yeah. A good format should not require checksuming the whole index, if
we only need to read/verify a few entries, I think.

> Is the current complexity O(n)?

I think so. Have a look at read-cache.c, read_index_from for reading
and write_index for writing, as a starting point.
-- 
Duy

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-03-21 10:34     ` Nguyen Thai Ngoc Duy
       [not found]       ` <CALgYhfPOJpKbM__iU4KvChWeXWyyhWb2ocR-SLvrQQHNw5F5dQ@mail.gmail.com>
@ 2012-03-21 12:51       ` Thomas Rast
  2012-03-21 15:43         ` Thomas Gummerer
  1 sibling, 1 reply; 36+ messages in thread
From: Thomas Rast @ 2012-03-21 12:51 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: Nguyen Thai Ngoc Duy, git

Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:

> On Wed, Mar 21, 2012 at 4:22 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>>
>> On Mar 21, 2012, at 2:29 AM, Nguyen Thai Ngoc Duy wrote:
>>
>> However I got one more question, since I'm not yet really familiar with the
>> code and internal structure of git, what exactly does the SHA1 over the
>> index exactly achieve? Is it only for checking if the index is still correct
>> for the next time it is used and has not been changed or is there a more
>> important function of it?
>
> It makes sure the index is not corrupt (by disk faults for example). I
> don't think it is used for anything else. Cheaper checksum can be used
> if good enough for the index. See
> http://thread.gmane.org/gmane.comp.version-control.git/190016 and the
> following reply.

Note that switching the checksum used already requires a
backwards-incompatible change of the index format.  If we are going to
do that, I'm somewhat opposed to not also revising it along the lines
sketched by Shawn (at least).

See my reply to Elton Sky

  http://thread.gmane.org/gmane.comp.version-control.git/193550/focus=193571

for links to some threads you may want to look at.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-03-21 12:51       ` Thomas Rast
@ 2012-03-21 15:43         ` Thomas Gummerer
  2012-03-21 16:19           ` Thomas Rast
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Gummerer @ 2012-03-21 15:43 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Nguyen Thai Ngoc Duy, git


On Mar 21, 2012, at 1:51 PM, Thomas Rast wrote:

> Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
> 
>> On Wed, Mar 21, 2012 at 4:22 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>>> 
>>> On Mar 21, 2012, at 2:29 AM, Nguyen Thai Ngoc Duy wrote:
>>> 
>>> However I got one more question, since I'm not yet really familiar with the
>>> code and internal structure of git, what exactly does the SHA1 over the
>>> index exactly achieve? Is it only for checking if the index is still correct
>>> for the next time it is used and has not been changed or is there a more
>>> important function of it?
>> 
>> It makes sure the index is not corrupt (by disk faults for example). I
>> don't think it is used for anything else. Cheaper checksum can be used
>> if good enough for the index. See
>> http://thread.gmane.org/gmane.comp.version-control.git/190016 and the
>> following reply.
> 
> Note that switching the checksum used already requires a
> backwards-incompatible change of the index format.  If we are going to
> do that, I'm somewhat opposed to not also revising it along the lines
> sketched by Shawn (at least).
> 
> See my reply to Elton Sky
> 
>  http://thread.gmane.org/gmane.comp.version-control.git/193550/focus=193571
> 
> for links to some threads you may want to look at.

While reworking the index into a tree (not to mention if it's in database format), 
the backward compatibility would be broken anyway from what I understood?
Therefore taking Shawns thoughts into account should not be a lot more work
and should make it also easier to implement for the .git reading programs,
since it's easier to parse for the core it should also be easier to parse for them.

Then if changing the checksum algorithm can bring some advantage I think
it should be well worth the extra work while we break compatibility in any case.

--
Thomas Gummerer

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-03-21 15:43         ` Thomas Gummerer
@ 2012-03-21 16:19           ` Thomas Rast
  2012-03-22 22:51             ` Thomas Gummerer
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Rast @ 2012-03-21 16:19 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: Nguyen Thai Ngoc Duy, git

Thomas Gummerer <t.gummerer@gmail.com> writes:

> While reworking the index into a tree (not to mention if it's in database format), 
> the backward compatibility would be broken anyway from what I understood?
> Therefore taking Shawns thoughts into account should not be a lot more work
> and should make it also easier to implement for the .git reading programs,
> since it's easier to parse for the core it should also be easier to parse for them.
>
> Then if changing the checksum algorithm can bring some advantage I think
> it should be well worth the extra work while we break compatibility in any case.

AFAICS we are in agreement.  I was saying it's not worth breaking
compatibility for *only* the checksum change.  You are coming from the
other side: "once all the rest is done, the checksum change is easy",
which is probably true.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-03-21 16:19           ` Thomas Rast
@ 2012-03-22 22:51             ` Thomas Gummerer
  2012-03-23 10:10               ` Thomas Rast
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Gummerer @ 2012-03-22 22:51 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Nguyen Thai Ngoc Duy, git

As suggested on IRC, here is my first draft proposal. Should the proposal 
include more more technical material already, or just the general sketch of 
the project?

Designing a faster index format

-- Problem --
The current git index is pretty slow when working with large git repositories,
because the whole index has to be rewritten for nearly every operation. For
example for adding a single file with git add, even though only one single blob
sha-1 is changed in the index, the way it's currently implemented git will need
to recompute a new hash over the whole index.

-- Proposed solution --
The proposed solution is to rework the index, in order to make it possible to
have a complexity of O(log(n)), where n is the number of index entries, for 
simple operations like adding files. 

The data format will also be easily parsable in order to make it easier for
programs, that read the index (like jgit or libgit2), to adapt to the new 
format.

The solution would consist of two steps. In the first step the index on the
filesystem would be replaced, but not mmap'd for direct access, which would 
mainly save on I/O operations. In the second step the core parts would also be
made aware of the new index, as it would be mmap'd for direct access.

Another way to speed up the hashing would be to exchange the SHA-1 hash for
something faster. That may however not be feasible when keeping the current
in-memory structure.

To make the project feasible for Google Summer of Code, the plan is to
implement the first part and if this goes smoother then expected and there is
still time left, the second part should also be implemented as extra work.

-- Timeline --
24/04 - 12/05: Getting familiar with the old index
13/05 - 26/05: Come up with a exact plan for the new index and document it.
Check feasibility of exchanging SHA-1 with a faster checksum algorithm.
27/05 - 30/06: Writing the new index format to disk and making it only rewrite 
the parts that really need to be changed (Should be less writes then currently
necessary).
/* Development work will be a bit slower from 18/06 to 21/07 because at my
 * University there are exams in this period. I probably will only be able to
 * work half the hours. I'll be back up to full speed after that. */
31/06 - 28/07: Parse the index from the disk to the current in-memory format.
29/07 - 13/08: Optimize the algorithm and profile the gains. 

-- Why git --
I'm using git since about 2-3 years and wanted to contribute to it earlier, but
couldn't find the time to do it. I would also like to continue contributing
once the Summer of Code is over.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-03-22 22:51             ` Thomas Gummerer
@ 2012-03-23 10:10               ` Thomas Rast
  2012-03-25  1:28                 ` Thomas Gummerer
  2012-03-26 20:35                 ` Thomas Gummerer
  0 siblings, 2 replies; 36+ messages in thread
From: Thomas Rast @ 2012-03-23 10:10 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: Nguyen Thai Ngoc Duy, git

Hi Thomas,

Thomas Gummerer <t.gummerer@gmail.com> writes:

> 24/04 - 12/05: Getting familiar with the old index

I am aware that this falls into the period for "getting up to speed",
however I think if you spend a day or so *now* reading through
read_index, write_index and the relevant docs.  We can then give you the
brief rundown on cache_tree, and you'll have a much clearer picture of
how things fall together.

I think that should lead to a lot of enlightenment and a greatly
improved proposal.

- Thomas

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-03-23 10:10               ` Thomas Rast
@ 2012-03-25  1:28                 ` Thomas Gummerer
  2012-03-26 20:35                 ` Thomas Gummerer
  1 sibling, 0 replies; 36+ messages in thread
From: Thomas Gummerer @ 2012-03-25  1:28 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Nguyen Thai Ngoc Duy, git

I have now read through the methods and understand the cache_tree a bit 
better (thanks Thomas for the explanation :) ) and have rewritten my proposal 
draft. Any suggestions are very welcome.

Designing a faster index format

-- Problem --
The current git index is pretty slow when working with large git repositories,
because the whole index has to be rewritten for nearly every operation. For
example for adding a single file with git add, even though logically only one 
single blob sha-1 is changed in the index, git has to recompute the hash over 
the whole index and rewrite the index file.

-- Proposed solution --
The proposed solution is to rework the index, in order to make it possible to
have a complexity of O(log(n)), where n is the number of index entries, for 
simple operations like adding files. More complex operations shall also
benefit from the new index format, although O(log(n)). While it may not be
possible to get to a complexity of O(log(n)), while keeping the index format
still easy enough to parse for other programs that read the index, the amount of
data for which that the hash will be computed shall be log(n).

The problem will be solved in two phases. 

In phase one the pysical index format shall be replaced with the new index 
format, which doesn't require a full recalculation of the sha-1 hash and a full
rewrite to the disk. The new index format will be mapped to the current
in-memory structure, which will only be changed in phase two. For further 
optimization the cache-tree the sha-1 hashes shall be mapped to the index, which
should avoid cache-tree invalidations. In phase the savings will mainly be on
disk I/O.

In phase two the core parts of git would be made aware of the new index format,
changing the current in-memory structure. The in-memory structure will be left
untouched in phase one, to allow for more testing and to make sure there won't
be any problems with the new index. After this phase git takes full advantage
of the features of the new index.

Another way to speed up the hashing would be to exchange the SHA-1 hash for
something faster. That may however not be feasible when keeping the current
in-memory structure.

To make the project feasible for Google Summer of Code, the plan is to
implement phase one and if this goes smoother then expected and there is
still time left, phase two should also be implemented as bonus work.

-- Timeline --
24/04 - 12/05: Getting familiar with the old index
24/04 - 05/05: Decide on the data structure for the new index format and 
document it. Check the feasibility of exchanging SHA-1 with a faster hash
algorithm.
06/05 - 26/05: Map the current internal structure to the new index format.
27/05 - 23/06: Tie the cache-tree hashes to the index hashes.
24/06 - 21/07: Write the index to disk in the new format and make sure only the
changed parts are really written to disk.
/* Development work will be a bit slower from 18/06 to 21/07 because at my
 * University there are exams in this period. I probably will only be able to
 * work half the hours. I'll be back up to full speed after that. */
22/06 - 04/08: Parse the index from the disk to the current in-memory format.
05/08 - 13/08: Optimize the algorithm and profile the gains. 

-- Why git --
I'm using git since about 2-3 years and wanted to contribute to it earlier, but
couldn't find the time to do it. I would also like to continue contributing
once the Summer of Code is over.

-- About me --
I'm Thomas Gummerer (@tgummerer on Twitter, tgummerer on IRC), 21 years old
from Italy. I'm currently a 3rd year Bachelor student in Applied Computer 
Science at the Free University of Bolzano. I started programming in High School 
about 8 years ago with Pascal and then learned C and Java. For some of my
projects you can visit my homepage (http://tgummerer.com/projects), most of them
were for university and some personal projects I did in my free time. My blog is
also on the same homepage, but not really active. Unfortunately I couldn't yet 
participate in any bigger open source project, although I'm interested in
it basically since I started programming. 
On Mar 23, 2012, at 11:10 AM, Thomas Rast wrote:

> Hi Thomas,
> 
> Thomas Gummerer <t.gummerer@gmail.com> writes:
> 
>> 24/04 - 12/05: Getting familiar with the old index
> 
> I am aware that this falls into the period for "getting up to speed",
> however I think if you spend a day or so *now* reading through
> read_index, write_index and the relevant docs.  We can then give you the
> brief rundown on cache_tree, and you'll have a much clearer picture of
> how things fall together.
> 
> I think that should lead to a lot of enlightenment and a greatly
> improved proposal.
> 
> - Thomas
> 
> -- 
> Thomas Rast
> trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-03-23 10:10               ` Thomas Rast
  2012-03-25  1:28                 ` Thomas Gummerer
@ 2012-03-26 20:35                 ` Thomas Gummerer
  2012-03-26 21:14                   ` Junio C Hamano
  2012-03-27 11:47                   ` Thomas Rast
  1 sibling, 2 replies; 36+ messages in thread
From: Thomas Gummerer @ 2012-03-26 20:35 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Nguyen Thai Ngoc Duy, git


On Mar 23, 2012, at 11:10 AM, Thomas Rast wrote:

> Hi Thomas,
> 
> Thomas Gummerer <t.gummerer@gmail.com> writes:
> 
>> 24/04 - 12/05: Getting familiar with the old index
> 
> I am aware that this falls into the period for "getting up to speed",
> however I think if you spend a day or so *now* reading through
> read_index, write_index and the relevant docs.  We can then give you the
> brief rundown on cache_tree, and you'll have a much clearer picture of
> how things fall together.
> 
> I think that should lead to a lot of enlightenment and a greatly
> improved proposal.

I've updated my proposal draft, which now includes a idea of backward
compatibility and some changes for language problems. Any comments or
suggestions are greatly appreciated.

Designing a faster index format

-- Problem --
The current git index is pretty slow when working with large git repositories,
because the whole index has to be rewritten for nearly every operation. For
example for adding a single file with git add, even though logically only one 
single blob sha-1 is changed in the index, git has to recompute the hash over 
the whole index and rewrite the index file.

-- Proposed solution --
The proposed solution is to rework the index, in order to make it possible to
have a complexity of O(log(n)), where n is the number of index entries, for 
simple operations like adding files. More complex operations shall also
benefit from the new index format, although a complexity of O(log(n)) might not
be reached, while keeping the index format easy enough to parse for .git reading
programs. The amount of data for which that the hash will be computed however
shall be log(n). N is the number of entries in the index.

The problem will be solved in two phases. 

In phase one the pysical index format shall be replaced with the new index 
format, which does neither require a full recalculation of the sha-1 hash nor a
full rewrite of the index to the disk. The new index format will be mapped to 
the current in-memory structure, which will only be changed in phase two. For 
further optimization the cache-tree sha-1 hashes shall be mapped to the new 
index format, which should avoid cache-tree invalidations. In phase one the 
savings will mainly be on disk I/O and commands that usually cause a lot of 
cache-tree invalidations. To ensure backward compatibility, git shall keep the
ability able to read version 2/3 of the index. The user shall also have the 
possibility to configure git to write the index in either the new or the old 
format. While this will produce some code overhead, it will make the life of git 
users which don't use core git exclusively easier in the transition phase. 
If the user sets the write format to the new format and the repository is a 
already existing version 2/3 repository, the old index will be transformed to 
the new format.

In phase two the core parts of git would be made aware of the new index format,
changing the current in-memory structure. The in-memory structure will be left
untouched in phase one, to allow for more testing and to make sure there won't
be any problems with the new index. After this phase git is able to take full 
advantage of the features of the new index.

Another way to speed up the hashing would be to exchange the SHA-1 hash for
something faster. That may however not be feasible while keeping the current
in-memory structure.

To make the project feasible for Google Summer of Code, the plan is to
implement phase one and if this goes smoother then expected and there is
still time left, phase two should also be implemented as bonus work.

-- Timeline --
24/04 - 12/05: Getting familiar with the old index
24/04 - 05/05: Decide on the data structure for the new index format and 
document it. Check the feasibility of exchanging SHA-1 with a faster hash
algorithm.
06/05 - 26/05: Map the current internal structure to the new index format.
27/05 - 23/06: Tie the cache-tree hashes to the index hashes.
24/06 - 21/07: Write the index to disk in both the old and the new format
depending on the choice of the user and make sure only the changed parts are 
really written to disk in the new format.
/* Development work will be a bit slower from 18/06 to 21/07 because at my
 * University there are exams in this period. I probably will only be able to
 * work half the hours. I'll be back up to full speed after that. */
22/06 - 04/08: Parse the index from the disk to the current in-memory format.
05/08 - 10/08: Make sure the converstion from the old format to the new format
works without problems.
11/08 - 13/08: Test the new index and profile the gains compared to the old
format.

-- Why git --
I'm using git since about 2-3 years and wanted to contribute to it earlier, but
couldn't find the time to do it. I would also like to continue contributing
once the Summer of Code is over.

-- About me --
I'm Thomas Gummerer (@tgummerer on Twitter, tgummerer on IRC), 21 years old
from Italy. I'm currently a 3rd year Bachelor student in Applied Computer 
Science at the Free University of Bolzano. I started programming in High School 
about 8 years ago with Pascal and then learned C and Java. For some of my
projects you can visit my homepage (http://tgummerer.com/projects), most of them
were for university and some personal projects I did in my free time. My blog is
also on the same homepage, but not really active. Unfortunately I couldn't yet 
participate in any bigger open source project, although I'm interested in
it basically since I started programming.

--
Thomas Gummerer

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-03-26 20:35                 ` Thomas Gummerer
@ 2012-03-26 21:14                   ` Junio C Hamano
  2012-03-27 11:08                     ` Thomas Gummerer
  2012-03-27 11:47                   ` Thomas Rast
  1 sibling, 1 reply; 36+ messages in thread
From: Junio C Hamano @ 2012-03-26 21:14 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: Thomas Rast, Nguyen Thai Ngoc Duy, git

Thomas Gummerer <t.gummerer@gmail.com> writes:

> The proposed solution is to rework the index, in order to make it possible to
> have a complexity of O(log(n)), where n is the number of index entries, for 
> simple operations like adding files. More complex operations shall also
> benefit from the new index format, although a complexity of O(log(n)) might not
> be reached, while keeping the index format easy enough to parse for .git reading
> programs. The amount of data for which that the hash will be computed however
> shall be log(n). N is the number of entries in the index.

Where does log(N) come from, and is that a hard requirement?

Rephrasing your problem description, you want to address the inefficiency
that we have to write the full 1m entries to a new file in order to update
only one file when the index tracks 1m paths.  

Wouldn't the goal be more in line with that problem statement if you
instead aim to make the cost proportional to the number of entries that
are updated, regardless of how many entries exist in the index?

> In phase one the pysical index format shall be replaced with the new index 
> format, which does neither require a full recalculation of the sha-1 hash nor a
> full rewrite of the index to the disk. The new index format will be mapped to 
> the current in-memory structure, which will only be changed in phase two. For 
> further optimization the cache-tree sha-1 hashes shall be mapped to the new 
> index format, which should avoid cache-tree invalidations.

It is unclear what you meant by the last sentence.  The cache-tree
invalidation is a critical part of the machinery that allows later
write-tree to reuse the already computed subtree object names, without
having to recompute subtree object names until they are really necessary
(i.e. when writing a tree out).  By "avoiding" it, are you going to make
write-tree always recompute all the subtree object names?  Or are you
planning to keep the cached tree information always up to date by
recomputing subtree object names and keeping them in the index even when
they are not yet needed?  If the latter, how would you handle when a part
of the index contains unmerged entries?

Right now, the code that updates the in-core index records "Is the in-core
index clean, or modified?" bit and nothing else.  Without updating the
in-core structure and the codepaths that access it, how is it possible for
your phase I to selectively write out only the modified (either added,
updated or removed) part of it?  In other words, I do not think it is
realistic to expect that the core parts to stay oblivious to the new index
semantics during the "phase one".

> -- Timeline --
> 24/04 - 12/05: Getting familiar with the old index

It might make more sense to write the "proposed solution" *after* this
step is done; otherwise you wouldn't even know the problems you are trying
to solve.  That may mean that this part of the schedule may need to be
shortened or started way before the beginning of GSoC.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-03-26 21:14                   ` Junio C Hamano
@ 2012-03-27 11:08                     ` Thomas Gummerer
  0 siblings, 0 replies; 36+ messages in thread
From: Thomas Gummerer @ 2012-03-27 11:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Thomas Rast, Nguyen Thai Ngoc Duy, git


On Mar 26, 2012, at 11:14 PM, Junio C Hamano wrote:

> Thomas Gummerer <t.gummerer@gmail.com> writes:
> 
>> The proposed solution is to rework the index, in order to make it possible to
>> have a complexity of O(log(n)), where n is the number of index entries, for 
>> simple operations like adding files. More complex operations shall also
>> benefit from the new index format, although a complexity of O(log(n)) might not
>> be reached, while keeping the index format easy enough to parse for .git reading
>> programs. The amount of data for which that the hash will be computed however
>> shall be log(n). N is the number of entries in the index.
> 
> Where does log(N) come from, and is that a hard requirement?

The log(n) comes from the plan to implement the index in a tree-like structure. It
isn't a hard requirement, but it certainly shouldn't be much slower than that, otherwise
the change wouldn't make sense.

> Rephrasing your problem description, you want to address the inefficiency
> that we have to write the full 1m entries to a new file in order to update
> only one file when the index tracks 1m paths.  
> 
> Wouldn't the goal be more in line with that problem statement if you
> instead aim to make the cost proportional to the number of entries that
> are updated, regardless of how many entries exist in the index?

Making the cost proportional to the number of entries that are updated would
be the superior solution, but probably not possible to achieve.

> 
>> In phase one the pysical index format shall be replaced with the new index 
>> format, which does neither require a full recalculation of the sha-1 hash nor a
>> full rewrite of the index to the disk. The new index format will be mapped to 
>> the current in-memory structure, which will only be changed in phase two. For 
>> further optimization the cache-tree sha-1 hashes shall be mapped to the new 
>> index format, which should avoid cache-tree invalidations.
> 
> It is unclear what you meant by the last sentence.  The cache-tree
> invalidation is a critical part of the machinery that allows later
> write-tree to reuse the already computed subtree object names, without
> having to recompute subtree object names until they are really necessary
> (i.e. when writing a tree out).  By "avoiding" it, are you going to make
> write-tree always recompute all the subtree object names?  Or are you
> planning to keep the cached tree information always up to date by
> recomputing subtree object names and keeping them in the index even when
> they are not yet needed?  If the latter, how would you handle when a part
> of the index contains unmerged entries?

The plan was to do the latter, if it is feasible to keep them always up to date,
without slowing other commands down. For the unmerged entries, I don't
have a plan yet, but I'll try to come up with one, when understanding the
source code better.

> Right now, the code that updates the in-core index records "Is the in-core
> index clean, or modified?" bit and nothing else.  Without updating the
> in-core structure and the codepaths that access it, how is it possible for
> your phase I to selectively write out only the modified (either added,
> updated or removed) part of it?  In other words, I do not think it is
> realistic to expect that the core parts to stay oblivious to the new index
> semantics during the "phase one".

When it's impossible to write only the modified part of the index without
changing the in-core parts, the in-core parts will only changed in the
minimal possible way, adding only a flag for the changed parts, so only
those can be effectively written to disk.

>> -- Timeline --
>> 24/04 - 12/05: Getting familiar with the old index
> 
> It might make more sense to write the "proposed solution" *after* this
> step is done; otherwise you wouldn't even know the problems you are trying
> to solve.  That may mean that this part of the schedule may need to be
> shortened or started way before the beginning of GSoC.

Yes, I'm currently on this, reading through documentation and source code.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-03-26 20:35                 ` Thomas Gummerer
  2012-03-26 21:14                   ` Junio C Hamano
@ 2012-03-27 11:47                   ` Thomas Rast
  2012-03-29 15:21                     ` Thomas Gummerer
  1 sibling, 1 reply; 36+ messages in thread
From: Thomas Rast @ 2012-03-27 11:47 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: Nguyen Thai Ngoc Duy, git, Junio C Hamano

Hi Thomas

Thomas Gummerer <t.gummerer@gmail.com> writes:

> -- Proposed solution --
> The proposed solution is to rework the index, in order to make it possible to
> have a complexity of O(log(n)), where n is the number of index entries, for 
> simple operations like adding files. More complex operations shall also
> benefit from the new index format, although a complexity of O(log(n)) might not
> be reached, while keeping the index format easy enough to parse for .git reading
> programs. The amount of data for which that the hash will be computed however
> shall be log(n). N is the number of entries in the index.

Can you phrase this in terms of three variables: the number of entries
'n' in the index, the number of entries 'k' (or in the case of
extensions, the amount of data) that must be changed, and the size of
the index 's'?

Bear in mind that the current code uses a lock-write-rename scheme and
I'm not sure we can (or for that matter want to) change that.

That should clear up the picture somewhat and also answer Junio's concerns.

> In phase one the pysical index format shall be replaced with the new index 
> format, which does neither require a full recalculation of the sha-1 hash nor a
> full rewrite of the index to the disk. The new index format will be mapped to 
> the current in-memory structure, which will only be changed in phase two. For 
> further optimization the cache-tree sha-1 hashes shall be mapped to the new 
> index format, which should avoid cache-tree invalidations. In phase one the 
> savings will mainly be on disk I/O and commands that usually cause a lot of 
> cache-tree invalidations. To ensure backward compatibility, git shall keep the
> ability able to read version 2/3 of the index. The user shall also have the 
> possibility to configure git to write the index in either the new or the old 
> format. While this will produce some code overhead, it will make the life of git 
> users which don't use core git exclusively easier in the transition phase. 
> If the user sets the write format to the new format and the repository is a 
> already existing version 2/3 repository, the old index will be transformed to 
> the new format.

Can you split this somewhat?  My wall-of-text-detector is complaining.

- Backwards compatibility: good.  What happens after stage two, are we
  still going to support writing the old format?  If not, why?

- cache-tree: I'm not sure you can skip the invalidations, at least not
  without the tie-in.  We primarily want to be able to change the
  cache-tree data -- or for that matter any extension data -- cheaply
  without rewriting (or rehashing, see the s/n/k issue) the entire
  index.

- cache-tree tie-in: I realize I tossed this in the air without any
  hints on how you could achieve it, but since you also allocate quite
  some development time to it I think you should outline a plan on how
  to do this.

- I'm sure you can find another point :-)

> Another way to speed up the hashing would be to exchange the SHA-1 hash for
> something faster. That may however not be feasible while keeping the current
> in-memory structure.

Hum, if you look where and over what this hash is currently computed,
you'll see what the right answer to that last non-question is...

> 24/04 - 12/05: Getting familiar with the old index

What Junio said.  On top of that, I think this shouldn't take three
weeks even if you still spend some time within the project.

> 24/06 - 21/07: Write the index to disk in both the old and the new format
> depending on the choice of the user and make sure only the changed parts are 
> really written to disk in the new format.

What's interesting here is how you determine the changed part.  Are you
planning to change all users to properly mark changed entries, so that
everyone benefits (at a huge amount of code churn) or do you want to
take another approach?

> /* Development work will be a bit slower from 18/06 to 21/07 because at my
>  * University there are exams in this period. I probably will only be able to
>  * work half the hours. I'll be back up to full speed after that. */

Out of curiosity, what courses are you taking exams in?

> 22/06 - 04/08: Parse the index from the disk to the current in-memory format.
> 05/08 - 10/08: Make sure the converstion from the old format to the new format
> works without problems.
> 11/08 - 13/08: Test the new index and profile the gains compared to the old
> format.

I see this is partially overlaps the entries before it.  What's the plan here?

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-03-27 11:47                   ` Thomas Rast
@ 2012-03-29 15:21                     ` Thomas Gummerer
  2012-03-29 21:06                       ` Junio C Hamano
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Gummerer @ 2012-03-29 15:21 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Nguyen Thai Ngoc Duy, git, Junio C Hamano

After the discussion on IRC and the taking your suggestions into account
I've rewritten my proposal draft as follows.

(@Thomas Rast: Since you asked, here are the courses I'm currently 
taking: 
Team Work Management
Communication and Presentation skills
(Those are general skills, we have to do two of them)

Introduction to Artificial Intelligence
Functional and Logic Programming Languages)

Designing a faster index format

-- Problem --
The current git index is pretty slow when working with large git repositories,
because the whole index has to be rewritten for nearly every operation. For
example for adding a single file with git add, even though logically only one 
single blob sha-1 is changed in the index, git has to recompute the hash over 
the whole index and rewrite the index file.

-- Proposed solution --
The proposed solution is to rework the index, using a append-only data
structure. That shall allow for changes to the index in O(k) time, with k being
the number of files that changed. To reach this goal, the first part of the
index will be sorted and changes will always be written to the end, in the order
in which the changes occured. This last part of the index will be merged with
the rest using a heuristic, which will be the execution of git commit and git
status.

To make sure the index isn't corrupted, without calculating the sha-1 hash for
the whole index file every time something is changed. The hash is always
calculated for the whole index when merging, but when a file is changed the
sha-1 hash is only calculated for the last entry. This has some cost when
reading the index, but the performance improvements when writing the index
should more then make up for that. 

The index format shall also save on file size, by storing the paths only once,
which currently are stored for every file. This can make a big difference
especially for repositories with deep directory structures. (e.g. in the webkit
repository this change can save about 10MB on the index). In addition to that it
can also give the index a structure, which in turn makes it easier to search
through. Except for the appended part, but that will never grow big enough to
make going through it costly.

This structure in the index could also help implement fast partial loading, from
which git status and git diff could benefit from. This benefit would be fully
explored by a implementation with inotify, which Thomas Rast is working on.

In order to be able to only rewrite a part the way the lock file currently works
has to be changed. Currently the new index is always written into the lock file,
which then is renamed once the writing is complete. Since we now change to an
append only format, if the operation fails, only the last record in the index
is corrupted and we can easily recover by removing the corrupt record. When the
is merged the old lock file algorithm will be used.

To maintain the speed even if a file is already in the index, git will always
append a new record on the end, instead of finding the record in the index and
changing it. This will give us a lot of benefit in terms of speed, since
otherwise we would have to search the record (log(n)), change it (1) and update
the sha-1 hash over the whole file (n).

The current in-memory structure of git would only have to be slightly changed,
adding a flag or keeping a list of the changed files. The rest of it could be
changed in an other step, to better represent the new structure of the index.

The extensions (cache-tree and resolve undo at the moment) will be handled the
same way, adding changes to the end and merging at the same time when the rest
of the index is merged. This will give changing extension entries a complexity
of O(1) and will not be much slower when reading. Because merging will happen
at the same time as merging the rest of the index, it won't affect performance
a lot either.

To ensure backward compatibility, git shall keep the ability able to read 
version 2/3 of the index. The user shall also have the possibility to configure
git to write the index in either the new or the old format. While this will
produce some code overhead, it will make the life of git users which don't use 
core git exclusively easier in the transition phase. If the user sets the write 
format to the new format and the repository is a already existing version 2/3 
repository, the old index will be transformed to the new format. 

Backward compatibility will be broken (at least for write) once the in-memory
structure is changed. Git could still keep compatibility for reading the index
and converting it to the new in-memory format, but keeping write compatibility
would slow down the operation and mess up the code.

There were also other formats considered for the usage in the index, which I
will list below with their advantages and disadvantages.

- Database format
Having the index in a database format would be very efficient in terms of speed.
The integrity check also wouldn't be neccessary or rather could be handled by
the database. The disadvantage of using the database format however would be the
need to use a external library (or writing one, but that wouldn't make any
sense). Another problem with the database are programs like jgit and libgit2 
would probably more problems implementing it and it would make backward
compatibility unreasonably hard and mess up source code at the same time.

- Tree structure
The advantage of the tree structure over the append-only data structure that was
proposed would be that it would not require any merge or other maintainance work
work after a change of a file. The disadvantage however is that changing a file
would always require log(n) changes to the index, where n is the number of
entries in the index. Another problem might be the integrity check, which would
need either to use a hash over the whole index file or a hash on the tree,
which would however take more time for checking.

- Padded structure
Another index format that was taken into consideration was to create a padded
structure. Basically that would be an append-only like structure, but split into
section, where every section would leave some space at the end for appending.
This structure would not bring many advantages (probably the only advantage
being a bit speed up on merging and maybe slightly faster for partial-reading)
compared to the append-only structure. In the meanwhile it would probably lead
to messier code and it would be harder to parse for other programs that use
.git.

To make the project feasible for Google Summer of Code, all the features above
will be implemented, except for the partial-reading and the current in-memory
structure will be kept, ecept for the changes necessary to keep track of the
changes.

-- Timeline --
24/04 - 29/04: Document the new index format.
30/04 - 15/05: Map the current internal structure to the new index format.
16/05 - 02/06: Change the current in-memory structure to keep track of the
changed files.
02/06 - 16/06: Write the index to disk in both the old and the new format
depending on the choice of the user and make sure only the changed parts are 
really written to disk in the new format.
17/06 - 21/07: Parse the index from disk to the current in-memory format.
/* Development work will be a bit slower from 18/06 to 21/07 because at my
 * University there are exams in this period. I probably will only be able to
 * work half the hours. I'll be back up to full speed after that. */
22/07 - 10/08: Merge appended changes with the rest of the index.
11/08 - 13/08: Test the new index and profile the gains compared to the old
format.

-- Why git --
I'm using git since about 2-3 years and wanted to contribute to it earlier, but
couldn't find the time to do it. I would also like to continue contributing
once the Summer of Code is over.

-- About me --
I'm Thomas Gummerer (@tgummerer on Twitter, tgummerer on IRC), 21 years old
from Italy. I'm currently a 3rd year Bachelor student in Applied Computer 
Science at the Free University of Bolzano. I started programming in High School 
about 8 years ago with Pascal and then learned C and Java. For some of my
projects you can visit my homepage (http://tgummerer.com/projects), most of them
were for university and some personal projects I did in my free time. My blog is
also on the same homepage, but not really active. Unfortunately I couldn't yet 
participate in any bigger open source project, although I'm interested in
it basically since I started programming. 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-03-29 15:21                     ` Thomas Gummerer
@ 2012-03-29 21:06                       ` Junio C Hamano
  2012-03-30  5:19                         ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 36+ messages in thread
From: Junio C Hamano @ 2012-03-29 21:06 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: Thomas Rast, Nguyen Thai Ngoc Duy, git, Junio C Hamano

Thomas Gummerer <t.gummerer@gmail.com> writes:

> After the discussion on IRC and the taking your suggestions into account
> I've rewritten my proposal draft as follows.

Just a few comments after a cursory review over the draft.

> -- Problem --
> The current git index is pretty slow when working with large git repositories,
> because the whole index has to be rewritten for nearly every operation. For
> example for adding a single file with git add, even though logically only one 
> single blob sha-1 is changed in the index, git has to recompute the hash over 
> the whole index and rewrite the index file.

Is it really the time consuming part in the overall picture?  Do you have
vital statistics for various pieces to justify that you are solving the
right problem?  E.g. (not exhaustive)

 - read_index(): overhead to
   - validate the checksum of the whole index;
   - extract entries into the_index.cache[] array;

 - write_index(): overhead to
   - serialize entries into the on-disk format;
   - compute the checksum over the entire index;
   - write the whole index to disk;

 - frequency of the above two operations.  

I think the current code tries to do the checksumming, in both read and
write directions, in line with the actual I/O operation, not as a separate
phase.  You may have to instrument (read: butcher) the current code to
replace the streaming checksum code with no-op to measure the checksum
overhead.  In other words, running "time git update-index foo" from the
command line and subtracting "time sha1sum .git/index" would not be a
right way to compute it (it subtracts both disk I/O and checksum
overhead).

Also, optimizing the write codepath by penalizing the read codepath is a
wrong trade-off if the index is read far more often than it is written
during a single day of a developer.

> -- Proposed solution --
> The proposed solution is to rework the index, using a append-only data
> structure. That shall allow for changes to the index in O(k) time, with k being
> the number of files that changed. To reach this goal, the first part of the
> index will be sorted and changes will always be written to the end, in the order
> in which the changes occured. This last part of the index will be merged with
> the rest using a heuristic, which will be the execution of git commit and git
> status.
>
> To make sure the index isn't corrupted, without calculating the sha-1 hash for
> the whole index file every time something is changed.

This is not a sentence. "do Z" in "To do X without Y, do Z" is missing.

> The hash is always
> calculated for the whole index when merging, but when a file is changed the
> sha-1 hash is only calculated for the last entry.

I know this is still a loosely written draft, but "only calculated for the
last entry" is very disturbing.  When we say "entry" in the context of
index we always mean a single cache entry, but obviously that is not what
you are using this word for.  You mean the last "batch" of entries in a
log structured index file format.

The new batch in the log structured file, if it were to be protected with
the checksum, should take the checksum of the previous part into account
when computing the checksum of the new part, by chaining.  Imagine an
index file that was written twice, consisting of parts A and B, in an
append-only fashion. Somewhere in part A (probably near the end but it
does not have to be), there is a checksum to verify part A's consistency.
If the checksum for B is derived solely on the data in B, a bug could
replace the part B with different data C that satisfy its self consistency
check, but when you take A followed by C together, the result may not make
sense. One technique to avoid such mistakes is to use checksum in A and
data in B to compute checksum for B.

> This has some cost when
> reading the index, but the performance improvements when writing the index
> should more then make up for that. 

Obviously, "should more than make up" needs substanciation.

> ... In addition to that it
> can also give the index a structure, which in turn makes it easier to search
> through.

I do not think you are saying "The current index does not have a structure
so it is hard to search through. Let's give it a structure to make it easy
to search.", but then it becomes unclear what the purpose of saying "give
the index a structure" here.  The sentences in this paragraph may need to
be reworked somewhat.

> In order to be able to only rewrite a part the way the lock file currently works
> has to be changed. Currently the new index is always written into the lock file,
> which then is renamed once the writing is complete. Since we now change to an
> append only format, if the operation fails, only the last record in the index
> is corrupted and we can easily recover by removing the corrupt record.

Careful.

Two processes trying to append to the same index at the same time still
needs to be coordinated via some kind of lock.

> When the
> is merged the old lock file algorithm will be used.

-ECANNOTPARSE.

> To maintain the speed even if a file is already in the index, git will always
> append a new record on the end, instead of finding the record in the index and
> changing it.

This makes it sound as if you think you can append without reading, but I
do not think that is what you meant. You would need to know what is in the
index to deal with D/F conflicts, so you need to "find" the entry (and
paths related to it) first. The append-only arrangement allows you to
avoid updating in place, which is the plus (i.e. "changing it" part in the
above sentence is valid, "finding" is not).

> This will give us a lot of benefit in terms of speed, since
> otherwise we would have to search the record (log(n)), change it (1) and update
> the sha-1 hash over the whole file (n).

See the comment about on "search the record" part.

Of course the reader suffers because it needs to read more to learn what
the end result of replaying all the log records. The question is by how
much.

Also you would need to worry about the case where an index entry is
removed (your log record needs to be able to express "remove this path").

> Backward compatibility will be broken (at least for write) once the in-memory
> structure is changed.

That is totally backwards, isn't it?

The in-memory structure can be improved in new implementation without any
compatibility issues as long as the on-disk format is not changed.

> Git could still keep compatibility for reading the index
> and converting it to the new in-memory format, but keeping write compatibility
> would slow down the operation and mess up the code.

The new on-disk format is poorly designed if that has to happen.

> -- Timeline --
> 24/04 - 29/04: Document the new index format.
> 30/04 - 15/05: Map the current internal structure to the new index format.

That sounds very aggressive, as I am assuming that by Apr 29 you will have
all the data that justifies the "new index format" will solve whatever
problem you are solving and also justifies the "problem" you are solving
is really what we want to solve (e.g. optimizing for writers by penalizing
readers, when readers are predominant, is not solving the right problem).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-03-29 21:06                       ` Junio C Hamano
@ 2012-03-30  5:19                         ` Nguyen Thai Ngoc Duy
  2012-04-02 21:02                           ` Thomas Gummerer
  0 siblings, 1 reply; 36+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-03-30  5:19 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: Junio C Hamano, Thomas Rast, git

On Fri, Mar 30, 2012 at 4:06 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Is it really the time consuming part in the overall picture?  Do you have
> vital statistics for various pieces to justify that you are solving the
> right problem?  E.g. (not exhaustive)
>
>  - read_index(): overhead to
>   - validate the checksum of the whole index;
>   - extract entries into the_index.cache[] array;
>
>  - write_index(): overhead to
>   - serialize entries into the on-disk format;
>   - compute the checksum over the entire index;
>   - write the whole index to disk;
>
>  - frequency of the above two operations.

Also maybe the frequency of entry updates vs additions/removals. I
suspect refresh operation in some case can update a lot of entries. If
that's the case (and happens often), we may need special treatment for
it because simply appending entries might be costly.

> Also, optimizing the write codepath by penalizing the read codepath is a
> wrong trade-off if the index is read far more often than it is written
> during a single day of a developer.

I suspect so too, but some measurement has to be done there. It'd be
good if you provide a patch to collect index operation statistics.
Some of us can try it on for a few weeks. That would give us a better
picture.

On Thu, Mar 29, 2012 at 10:21 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> - Tree structure
> The advantage of the tree structure over the append-only data structure that was
> proposed would be that it would not require any merge or other maintainance work
> work after a change of a file. The disadvantage however is that changing a file
> would always require log(n) changes to the index, where n is the number of
> entries in the index. Another problem might be the integrity check, which would
> need either to use a hash over the whole index file or a hash on the tree,
> which would however take more time for checking.

I'd say it takes less time for checksuming because we only verify the
trees we read. And tree-based structure allows us to read just a
subdirectory, an advantage for multi-project repositories where people
stay in a subdir most of the time. Shawn suspected crc32 checksum over
a large chunk of data may be insufficient. By hashing tree by tree,
the chunks are significantly shorter, making crc32 viable and cheap
candidate compared to sha-1 (also note that crc32 is used to verify
compressed objects in a pack)
-- 
Duy

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-03-30  5:19                         ` Nguyen Thai Ngoc Duy
@ 2012-04-02 21:02                           ` Thomas Gummerer
  2012-04-03  8:51                             ` Michael Haggerty
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Gummerer @ 2012-04-02 21:02 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Junio C Hamano, Thomas Rast, git

After taking into consideration Junios and Duys comments, and some more
discussion on IRC, I have written a new draft of the proposal. The proposed
solution is based on a tree-structure, as the append only structure penalizes
the read path, which after doing some measurements isn't the right tradeoff.

Included in this proposal are also the index formats that were taken into
consideration but scrapped for one or another reason.

First I'd like to thank all those who helped me by discussing the ideas here
on the mailing list or over at the #git-devel irc channel. 

-- Credits --
Big thanks for discussing the ideas with me on IRC goes to charon, jast,
mhagger, shruggar, GitZilla, andrew_sayers and barrbrain. Hope I didn't forget 
anyone. Thanks also to those who discussed my ideas with me on the mailing list
(Nguyen Thai Ngoc Duy, Thomas Rast, Junio C Hamano)

And here is the proposal:
Designing a faster index format

-- Problem --
The current git index is pretty slow when working with large git repositories,
because the whole index has to be rewritten for nearly every operation. For
example for adding a single file with git add, even though logically only one 
single blob sha-1 is changed in the index, git has to recompute the hash over 
the whole index and rewrite the index file. In addition to that the speed up 
for writing the index can not come to the expense of the read operation, since 
it is executed more often throughout a normal developers day compared to
writing the index. The current index also doesn't allow simple partial reading,
which could help speed up some commands, like git status and git diff, which 
often don't need the whole index.

-- Proposed solution --
The proposed solution is to redesign the index to a B-tree based format. This
allows changes to the index in O(log(n)) time, with n being the number of
entries in the index. 

The sha1 hash, to verify the index isn't corrupted, will have to be computed 
over log(s) data in the worst case, where s is the size of the index, which 
will be bigger then the number of entries because of the structure of the 
b-tree.

The new index format will also save on the size of each entry, by storing each
path only once, and using it for all files in the same path. This will reduce
the amount of data we have to calculate the hash over, when both reading and
writing the index, thus having faster operations especially on repositories
with deep directory structures.

The tree structure will also allow for fast partial loading of the index. Since
every bucket of the tree will have it's own hash, the hash will only need to be
calculated for the parts we load. This reduces the amount of data for which the
hash will need to be calculated. Commands like git status and git diff will
benefit from this. The benefit could be fully explored by a implementation with 
inotify, of which Thomas Rast created a POC. 
(https://github.com/trast/git/commit/6c9825fdca76d01fb5a83923558831743ae477bc)

In order to be able to only rewrite the parts of the index that really changed,
the current lock file structure will have to be changed. The new lock file
will keep a journal of the changes, to be able to recover in case of a crash.
Deleting the lockfile "commits" the changes.

The in-memory structure will be modified in two steps. In the first step it
will only be modified to keep track of the changes, in order to be able to only
write the changes to disk and eliminate the need to rewrite the whole tree. In 
the second step the in-memory structure will be changed to reflect the tree 
format, getting the full benefits from the new format.

To ensure backward compatibility, git shall keep the ability to read version 
2/3 of the index. The user shall also have the possibility to configure git to 
write the index in either the new or the old format. While this will produce 
some code overhead, it will make the life of git users which don't use core git 
exclusively easier in the transition phase. If the user sets the write format to
the new format and the repository is a already existing version 2/3 repository, 
the old index will be transformed to the new format. Transformations in the 
other direction will also be possible, since the in-memory format is the same.

Once the in-memory structure is changed, making git backward compatible will
still be possible, even though it will come at the expense of the 
reading/writing time of the old index, since the tree will have to be 
constructed in memory from the on disk format, and transformed back to the flat
format when writing it back to disk.

To make the project feasible for Google Summer of Code the in-meory structure
will only be modified to keep track of the changes, not making it 
tree-structured yet and the partial loading will not be implemented.


-- Solutions that were also considered --
These solutions will not be described in as much detail as the tree-structure,
but they were also considered in the process of chosing the best index format
and will be described below.

- Append-only data structure
An append-only data structure will allow for changes to the index in O(k) time, 
with k being the number of files that changed. To reach this goal, the first 
part of the index will be sorted and changes will always be written to the end, 
in the order in which the changes occured. This last part of the index will be 
merged with the rest using a heuristic, which will be the execution of git 
commit and git status.

To make sure the index isn't corrupted, without calculating the sha1 hash for
the whole index file every time something is changed, the hash is always
calculated for the whole index when merging, but when only a single entry is 
changed the sha-1 hash is only calculated for the last change. This will 
increase the cost for reading the index to log(n) + k * log(k) where n is the 
number of entries in the sorted part of the index and k is the number of entries
in the unsorted part of the index, which will have to be merged with the rest 
of the index.

The index format shall also save on file size, by storing the paths only once,
which currently are stored for every file. This can make a big difference
especially for repositories with deep directory structures. (e.g. in the webkit
repository this change can save about 10MB on the index). In addition to that it
can also give the index a structure, which in turn makes it easier to search
through. Except for the appended part, but that should never grow big enough to
make a linear search through it costly.

With this index format the lock file could be a simple empty file, since if a
operation fails, only the last entry would be corrupted making it easy to
recover.

The current in-memory structure of git would only have to be slightly changed,
adding a flag or keeping a list of the changed files. The rest of it could be
changed in an other step, to better represent the new structure of the index.

To ensure backward compatibility, git shall keep the ability able to read 
version 2/3 of the index. The user shall also have the possibility to configure
git to write the index in either the new or the old format. While this will
produce some code overhead, it will make the life of git users which don't use 
core git exclusively easier in the transition phase. If the user sets the write 
format to the new format and the repository is a already existing version 2/3 
repository, the old index will be transformed to the new format. 

This idea was dropped, because as mentioned in the problem description reads
are more common then writes and therefore trading write speed for read speed
is not a good tradeoff.

- Database format
A database format as index structure will allow for changes to the index in
O(log(n)) time for a single change, with n always being the number of entries
in the index, assuming a b-tree index on the path in the database.

The check for the index corruption could be left to the database, which would
have about the same cost as the check in the tree-structure, in the read and
write direction. Partial loading with the right indices would also have the 
same cost as the tree-structure, by executing a simple select query.

The drawback of this solution however would be the introduction of a database
library (one could also write one, but that certainly would be overkill).
Another drawback would be that it's harder to read for programs like libgit2
and jgit and the codebase certainly wouldn't get any cleaner.

- Padded structure
Another index format that was taken into consideration was to create a padded
structure. Basically that would be an append-only like structure, but split into
sections, where every section would leave some space at the end for appending
changes. This would leave us with a complexity of O(k) where k is the number of
sections. 

This structure will bring advantages for faster partial loading, when splitting
the sections in the right way. 

The structure was however only briefly taken into consideration, since it would
have close to no advantages (except for the partial loading) compared to the
append-only structure, would make the index file larger due to the spaces for
appending and have the same drawbacks as the append-only structure.

-- Timeline --
24/04 - 01/05: Document the new index format.
02/05 - 21/05: Map the current internal structure to the new index format.
22/05 - 07/06: Change the current in-memory structure to keep track of the
changed files.
08/06 - 16/06: Write the index to disk in both the old and the new format
depending on the choice of the user and make sure only the changed parts are 
really written to disk in the new format.
17/06 - 21/07: Parse the index from disk to the current in-memory format.
/* Development work will be a bit slower from 18/06 to 21/07 because at my
 * University there are exams in this period. I probably will only be able to
 * work half the hours. I'll be back up to full speed after that. */
22/07 - 10/08: Read the new structure and map it to the current in meory format.
Make sure the old format still gets read correctly.
11/08 - 13/08: Test the new index and profile the gains compared to the old
format.

-- Why git --
I'm using git since about 2-3 years and wanted to contribute to it earlier, but
couldn't find the time to do it. I would also like to continue contributing
once the Summer of Code is over.

-- About me --
I'm Thomas Gummerer (@tgummerer on Twitter, tgummerer on IRC), 21 years old
from Italy. I'm currently a 3rd year Bachelor student in Applied Computer 
Science at the Free University of Bolzano. I started programming in High School 
about 8 years ago with Pascal and then learned C and Java. For some of my
projects you can visit my homepage (http://tgummerer.com/projects), most of them
were for university and some personal projects I did in my free time. My blog is
also on the same homepage, but not really active. Unfortunately I couldn't yet 
participate in any bigger open source project, although I'm interested in
it basically since I started programming. 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-04-02 21:02                           ` Thomas Gummerer
@ 2012-04-03  8:51                             ` Michael Haggerty
  2012-04-03 12:28                               ` Nguyen Thai Ngoc Duy
  2012-04-03 19:07                               ` Thomas Gummerer
  0 siblings, 2 replies; 36+ messages in thread
From: Michael Haggerty @ 2012-04-03  8:51 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: Nguyen Thai Ngoc Duy, Junio C Hamano, Thomas Rast, git

On 04/02/2012 11:02 PM, Thomas Gummerer wrote:
> And here is the proposal:
> Designing a faster index format
> 
> -- Problem --
> The current git index is pretty slow when working with large git repositories,
> because the whole index has to be rewritten for nearly every operation. For
> example for adding a single file with git add, even though logically only one 
> single blob sha-1 is changed in the index, git has to recompute the hash over 
> the whole index and rewrite the index file. In addition to that the speed up 
> for writing the index can not come to the expense of the read operation, since 
> it is executed more often throughout a normal developers day compared to
> writing the index. The current index also doesn't allow simple partial reading,
> which could help speed up some commands, like git status and git diff, which 
> often don't need the whole index.
> 
> -- Proposed solution --
> The proposed solution is to redesign the index to a B-tree based format. This
> allows changes to the index in O(log(n)) time, with n being the number of
> entries in the index. 

I thought that the index lock currently only blocks writers, not readers
(am I wrong?).  So given that you want to be able to mutate the index
file without rewriting the whole file, it seems to me that you have to
pick from one of these alternatives:

1. Change the locking semantics so that readers also block when the file
is locked.  This choice would have some important consequences: (a)
readers will also have to obtain a lock before starting to read.  (b) to
avoid deadlock, it will become crucial that the lock is never held
across the execution of any other git (sub-)commands that might want to
read the index.

2. Implement a file format that can be read even while it is being
mutated.  If so, please explain the data file format in more detail; in
particular, how do you plan to mutate the file in a way that does not
disturb readers?  How do you plan to read the whole index efficiently (I
imagine that reading the whole index will remain a frequent operation)?

I encourage you to include an analysis of the number of disk seeks when
you are analyzing the cost of read/write operations on the index.  This
will have a strong effect on the time for working with the index when
the disk cache is cold.  The current index requires O(1) seeks for
reading and writing, which I believe is a big part of the reason that
the current read-the-whole-index/write-the-whole-index design performs
excellently despite the amount of data that it is touching.

> [...]
> - Append-only data structure
> An append-only data structure will allow for changes to the index in O(k) time, 
> with k being the number of files that changed. To reach this goal, the first 
> part of the index will be sorted and changes will always be written to the end, 
> in the order in which the changes occured. This last part of the index will be 
> merged with the rest using a heuristic, which will be the execution of git 
> commit and git status.
> 
> To make sure the index isn't corrupted, without calculating the sha1 hash for
> the whole index file every time something is changed, the hash is always
> calculated for the whole index when merging, but when only a single entry is 
> changed the sha-1 hash is only calculated for the last change. This will 
> increase the cost for reading the index to log(n) + k * log(k) where n is the 
> number of entries in the sorted part of the index and k is the number of entries
> in the unsorted part of the index, which will have to be merged with the rest 
> of the index.

I don't understand this analysis of the reading time.  I suppose you are
assuming that you want to read the status of a single file.  But in that
case, it is enough to find the entry in the old index (O(log(n))
assuming some sort of tree structure) plus do a linear scan through the
unsorted entries (i.e., O(k), not O(k log(k))).

> [...]
> This [append-only] idea was dropped, because as mentioned in the problem description reads
> are more common then writes and therefore trading write speed for read speed
> is not a good tradeoff.

The amount of read speed that would have to be sacrificed depends on the
size of k and n.  Under the assumption that k << n, the read speed of an
append-only index file format (with periodic compaction) would be close
to optimal.  For the append-only format, k is the number of files that
have been changed since the index was last rewritten (including
duplicates if a file has been changed more than once).  Supposing that
the index is compacted on branch changes and "occasionally" when k grows
too large, I have the feeling that k will typically be quite small in
comparison to n, especially in the huge repositories that need this
optimization.

Remember that a full index rewrite/compaction will only take as long as
it currently takes for *any* change to the index (which is not *that*
terrible).  It would also be easy to estimate the size of k and n on
every operation.  Therefore, if an operation is expected to force a
large fraction of index file entries to be invalidated, it is OK for it
to force an index compaction to keep subsequent read operations fast.
And even if, once in a blue moon, somebody changes *all* of the files in
his huge repository while somehow evading compaction, the time to read
the full index would still only be a factor of two slower than the
current design.  (Granted, the time to read a single entry in this
scenario would be much longer than the log(n) that would be possible
given a pure-tree-based design.)

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-04-03  8:51                             ` Michael Haggerty
@ 2012-04-03 12:28                               ` Nguyen Thai Ngoc Duy
  2012-04-03 19:07                               ` Thomas Gummerer
  1 sibling, 0 replies; 36+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-04-03 12:28 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Thomas Gummerer, Junio C Hamano, Thomas Rast, git

On Tue, Apr 3, 2012 at 3:51 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> -- Proposed solution --
>> The proposed solution is to redesign the index to a B-tree based format. This
>> allows changes to the index in O(log(n)) time, with n being the number of
>> entries in the index.
>
> I thought that the index lock currently only blocks writers, not readers
> (am I wrong?).

I think you are right.

> So given that you want to be able to mutate the index
> file without rewriting the whole file, it seems to me that you have to
> pick from one of these alternatives:
>
> 1. Change the locking semantics so that readers also block when the file
> is locked.  This choice would have some important consequences: (a)
> readers will also have to obtain a lock before starting to read.  (b) to
> avoid deadlock, it will become crucial that the lock is never held
> across the execution of any other git (sub-)commands that might want to
> read the index.
>
> 2. Implement a file format that can be read even while it is being
> mutated.  If so, please explain the data file format in more detail; in
> particular, how do you plan to mutate the file in a way that does not
> disturb readers?  How do you plan to read the whole index efficiently (I
> imagine that reading the whole index will remain a frequent operation)?

Copy-on-write B-trees, aka btrfs. Does anybody like to go that route ;)

> I encourage you to include an analysis of the number of disk seeks when
> you are analyzing the cost of read/write operations on the index.  This
> will have a strong effect on the time for working with the index when
> the disk cache is cold.  The current index requires O(1) seeks for
> reading and writing, which I believe is a big part of the reason that
> the current read-the-whole-index/write-the-whole-index design performs
> excellently despite the amount of data that it is touching.
-- 
Duy

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-04-03  8:51                             ` Michael Haggerty
  2012-04-03 12:28                               ` Nguyen Thai Ngoc Duy
@ 2012-04-03 19:07                               ` Thomas Gummerer
  2012-04-03 20:15                                 ` david
  2012-04-05 10:43                                 ` Michael Haggerty
  1 sibling, 2 replies; 36+ messages in thread
From: Thomas Gummerer @ 2012-04-03 19:07 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Nguyen Thai Ngoc Duy, Junio C Hamano, Thomas Rast, git


On Apr 3, 2012, at 10:51 AM, Michael Haggerty wrote:

> On 04/02/2012 11:02 PM, Thomas Gummerer wrote:
>> And here is the proposal:
>> Designing a faster index format
>> 
>> -- Problem --
>> The current git index is pretty slow when working with large git repositories,
>> because the whole index has to be rewritten for nearly every operation. For
>> example for adding a single file with git add, even though logically only one 
>> single blob sha-1 is changed in the index, git has to recompute the hash over 
>> the whole index and rewrite the index file. In addition to that the speed up 
>> for writing the index can not come to the expense of the read operation, since 
>> it is executed more often throughout a normal developers day compared to
>> writing the index. The current index also doesn't allow simple partial reading,
>> which could help speed up some commands, like git status and git diff, which 
>> often don't need the whole index.
>> 
>> -- Proposed solution --
>> The proposed solution is to redesign the index to a B-tree based format. This
>> allows changes to the index in O(log(n)) time, with n being the number of
>> entries in the index. 
> 
> I thought that the index lock currently only blocks writers, not readers
> (am I wrong?).  So given that you want to be able to mutate the index
> file without rewriting the whole file, it seems to me that you have to
> pick from one of these alternatives:
> 
> 1. Change the locking semantics so that readers also block when the file
> is locked.  This choice would have some important consequences: (a)
> readers will also have to obtain a lock before starting to read.  (b) to
> avoid deadlock, it will become crucial that the lock is never held
> across the execution of any other git (sub-)commands that might want to
> read the index.
> 
> 2. Implement a file format that can be read even while it is being
> mutated.  If so, please explain the data file format in more detail; in
> particular, how do you plan to mutate the file in a way that does not
> disturb readers?  How do you plan to read the whole index efficiently (I
> imagine that reading the whole index will remain a frequent operation)?

Did not think about this first, but I will use the first alternative. The most
important thing about this project is the speed and it will be best kept with
the first alternative. I am thinking about two different kind of locks, the read
lock, which blocks writing, but not other reading processes and a write lock,
which will block both other processes to read and write.

> I encourage you to include an analysis of the number of disk seeks when
> you are analyzing the cost of read/write operations on the index.  This
> will have a strong effect on the time for working with the index when
> the disk cache is cold.  The current index requires O(1) seeks for
> reading and writing, which I believe is a big part of the reason that
> the current read-the-whole-index/write-the-whole-index design performs
> excellently despite the amount of data that it is touching.

Seeking took: 429430 (cold)
Reading took: 581 (hot)

Reading took: 1031357 (cold)
Reading took: 53850 (hot)

Writing took: 524909

This is just from a short test program I wrote. Just reads the index, and seeks
over it (obviously not directly one after another but in different scripts). The
writing time is for writing the index that was read back to disk. And this is with
a highly exaggerated number of seeks (1000). 

Also to take into consideration is that the disk cache will (nearly) never be 
cold since git always reads the index first.

> 
>> [...]
>> - Append-only data structure
>> An append-only data structure will allow for changes to the index in O(k) time, 
>> with k being the number of files that changed. To reach this goal, the first 
>> part of the index will be sorted and changes will always be written to the end, 
>> in the order in which the changes occured. This last part of the index will be 
>> merged with the rest using a heuristic, which will be the execution of git 
>> commit and git status.
>> 
>> To make sure the index isn't corrupted, without calculating the sha1 hash for
>> the whole index file every time something is changed, the hash is always
>> calculated for the whole index when merging, but when only a single entry is 
>> changed the sha-1 hash is only calculated for the last change. This will 
>> increase the cost for reading the index to log(n) + k * log(k) where n is the 
>> number of entries in the sorted part of the index and k is the number of entries
>> in the unsorted part of the index, which will have to be merged with the rest 
>> of the index.
> 
> I don't understand this analysis of the reading time.  I suppose you are
> assuming that you want to read the status of a single file.  But in that
> case, it is enough to find the entry in the old index (O(log(n))
> assuming some sort of tree structure) plus do a linear scan through the
> unsorted entries (i.e., O(k), not O(k log(k))).

The current way git operates it always reads the whole index, making it necessary
to merge the unsorted entries with the sorted part. Thinking about it it would even
be O(k log(n)), because the appended part is unsorted.
O(log(n)) + O(k) would be the complexity for loading only a single entry from the
index.

> 
>> [...]
>> This [append-only] idea was dropped, because as mentioned in the problem description reads
>> are more common then writes and therefore trading write speed for read speed
>> is not a good tradeoff.
> 
> The amount of read speed that would have to be sacrificed depends on the
> size of k and n.  Under the assumption that k << n, the read speed of an
> append-only index file format (with periodic compaction) would be close
> to optimal.  For the append-only format, k is the number of files that
> have been changed since the index was last rewritten (including
> duplicates if a file has been changed more than once).  Supposing that
> the index is compacted on branch changes and "occasionally" when k grows
> too large, I have the feeling that k will typically be quite small in
> comparison to n, especially in the huge repositories that need this
> optimization.
> 
> Remember that a full index rewrite/compaction will only take as long as
> it currently takes for *any* change to the index (which is not *that*
> terrible).  It would also be easy to estimate the size of k and n on
> every operation.  Therefore, if an operation is expected to force a
> large fraction of index file entries to be invalidated, it is OK for it
> to force an index compaction to keep subsequent read operations fast.
> And even if, once in a blue moon, somebody changes *all* of the files in
> his huge repository while somehow evading compaction, the time to read
> the full index would still only be a factor of two slower than the
> current design.  (Granted, the time to read a single entry in this
> scenario would be much longer than the log(n) that would be possible
> given a pure-tree-based design.)

Lets take another scenario. Someone changes a lot of files and then executes
some commands which would usually only read the index. In that case either you
would need the load time (O(n)) plus the time for merging (O(k*log(n)), as
described above plus the time for writing the index (O(n)) which usually wouldn't
be necessary.

In addition to that it would also be more complicated and slower to realize
the partial loading, which would be beneficial to a lot of commands. Taking that
into account the tree-based structure will be more future proof, and faster (if 
there are enough changes in a repository) then the append-only structure.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-04-03 19:07                               ` Thomas Gummerer
@ 2012-04-03 20:15                                 ` david
  2012-04-04 20:05                                   ` Thomas Gummerer
  2012-04-05 10:43                                 ` Michael Haggerty
  1 sibling, 1 reply; 36+ messages in thread
From: david @ 2012-04-03 20:15 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Michael Haggerty, Nguyen Thai Ngoc Duy, Junio C Hamano, Thomas Rast, git

On Tue, 3 Apr 2012, Thomas Gummerer wrote:

> Lets take another scenario. Someone changes a lot of files and then executes
> some commands which would usually only read the index. In that case either you
> would need the load time (O(n)) plus the time for merging (O(k*log(n)), as
> described above plus the time for writing the index (O(n)) which usually wouldn't
> be necessary.

note that in a large repository you are not that likely to change a very 
large percentage of the files with one update. yes, there are some use 
cases where this happens, but in general, the number of files changed in a 
changeset grows at a much slower rate than the total number of files in a 
repository. As projects get big they tend to get fewer across-the-board 
changes.

David Lang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-04-03 20:15                                 ` david
@ 2012-04-04 20:05                                   ` Thomas Gummerer
  2012-04-05 14:39                                     ` Noel Grandin
  2012-04-05 21:49                                     ` Thomas Rast
  0 siblings, 2 replies; 36+ messages in thread
From: Thomas Gummerer @ 2012-04-04 20:05 UTC (permalink / raw)
  To: david
  Cc: Michael Haggerty, Nguyen Thai Ngoc Duy, Junio C Hamano, Thomas Rast, git

This is another small revision of my proposal draft, taking into consideration
Michael Haggerty and David Langs comments. 

First I would again like to mention the people which helped me discussing the
ideas and gave me valuable input. 

-- Credits --
Big thanks for discussing the ideas with me on IRC goes to charon, jast,
mhagger, shruggar, GitZilla, andrew_sayers and barrbrain. Hope I didn't forget 
anyone. Thanks also to those who discussed my ideas with me on the mailing list
(Nguyen Thai Ngoc Duy, Thomas Rast, Junio C Hamano, Michael Haggerty, 
David Lang)

Designing a faster index format

-- Problem --
The current git index is pretty slow when working with large git repositories,
because the whole index has to be rewritten for nearly every operation. For
example for adding a single file with git add, even though logically only one 
single blob sha-1 is changed in the index, git has to recompute the hash over 
the whole index and rewrite the index file. In addition to that the speed up 
for writing the index can not come to the expense of the read operation, since 
it is executed more often throughout a normal developers day compared to
writing the index. The current index also doesn't allow simple partial reading,
which could help speed up some commands, like git status and git diff, which 
often don't need the whole index.

-- Proposed solution --
The proposed solution is to redesign the index to a B-tree based format. This
allows changes to the index in O(log(n)) time, with n being the number of
entries in the index. 

The sha1 hash, to verify the index isn't corrupted, will have to be computed 
over log(s) data in the worst case, where s is the size of the index, which 
will be bigger then the number of entries because of the structure of the 
b-tree.

The new index format will also save on the size of each entry, by storing each
path only once, and using it for all files in the same path. This will reduce
the amount of data we have to calculate the hash over, when both reading and
writing the index, thus having faster operations especially on repositories
with deep directory structures.

The tree structure will also allow for fast partial loading of the index. Since
every bucket of the tree will have it's own hash, the hash will only need to be
calculated for the parts we load. This reduces the amount of data for which the
hash will need to be calculated. Commands like git status and git diff will
benefit from this. The benefit could be fully explored by a implementation with 
inotify, of which Thomas Rast created a POC. 
(https://github.com/trast/git/commit/6c9825fdca76d01fb5a83923558831743ae477bc)

In order to be able to only rewrite the parts of the index that really changed,
the current lock file structure will have to be changed. The new lock file
will keep a journal of the changes, to be able to recover in case of a crash.
Deleting the lockfile "commits" the changes. To ensure that the readers never
read a partially mutated file, reading the index will respect the write lock.
Readers will also have a "read lock", which is respected by operations that
write on the index, but can be ignored by operations that read the index, such
that simultaneous reads are still possible.

The in-memory structure will be modified in two steps. In the first step it
will only be modified to keep track of the changes, in order to be able to only
write the changes to disk and eliminate the need to rewrite the whole tree. In 
the second step the in-memory structure will be changed to reflect the tree 
format, getting the full benefits from the new format.

To ensure backward compatibility, git shall keep the ability to read version 
2/3 of the index. The user shall also have the possibility to configure git to 
write the index in either the new or the old format. While this will produce 
some code overhead, it will make the life of git users which don't use core git 
exclusively easier in the transition phase. If the user sets the write format to
the new format and the repository is a already existing version 2/3 repository, 
the old index will be transformed to the new format. Transformations in the 
other direction will also be possible, since the in-memory format is the same.

Once the in-memory structure is changed, making git backward compatible will
still be possible, even though it will come at the expense of the 
reading/writing time of the old index, since the tree will have to be 
constructed in memory from the on disk format, and transformed back to the flat
format when writing it back to disk.

To make the project feasible for Google Summer of Code the in-meory structure
will only be modified to keep track of the changes, not making it 
tree-structured yet and the partial loading will not be implemented.


-- Solutions that were also considered --
These solutions will not be described in as much detail as the tree-structure,
but they were also considered in the process of chosing the best index format
and will be described below.

- Append-only data structure
An append-only data structure will allow for changes to the index in O(k) time, 
with k being the number of files that changed. To reach this goal, the first 
part of the index will be sorted and changes will always be written to the end, 
in the order in which the changes occured. This last part of the index will be 
merged with the rest using a heuristic, which will be the execution of git 
commit and git status.

To make sure the index isn't corrupted, without calculating the sha1 hash for
the whole index file every time something is changed, the hash is always
calculated for the whole index when merging, but when only a single entry is 
changed the sha-1 hash is only calculated for the last change. This will 
increase the cost for reading the index to log(n) + k * log(n) where n is the 
number of entries in the sorted part of the index and k is the number of entries
in the unsorted part of the index, which will have to be merged with the rest 
of the index.

The index format shall also save on file size, by storing the paths only once,
which currently are stored for every file. This can make a big difference
especially for repositories with deep directory structures. (e.g. in the webkit
repository this change can save about 10MB on the index). In addition to that it
can also give the index a structure, which in turn makes it easier to search
through. Except for the appended part, but that should never grow big enough to
make a linear search through it costly.

With this index format the lock file could be a simple empty file, since if a
operation fails, only the last entry would be corrupted making it easy to
recover.

The current in-memory structure of git would only have to be slightly changed,
adding a flag or keeping a list of the changed files. The rest of it could be
changed in an other step, to better represent the new structure of the index.

To ensure backward compatibility, git shall keep the ability able to read 
version 2/3 of the index. The user shall also have the possibility to configure
git to write the index in either the new or the old format. While this will
produce some code overhead, it will make the life of git users which don't use 
core git exclusively easier in the transition phase. If the user sets the write 
format to the new format and the repository is a already existing version 2/3 
repository, the old index will be transformed to the new format. 

This idea was dropped, because as mentioned in the problem description reads
are more common then writes and therefore trading write speed for read speed
is not a good tradeoff. The slowdown depends mostly on the size of the appended
(unsorted) part, which in most cases will be much less then the size of the
sorted part, but in other cases might grow quite big. I think it is preferrable
to have a structure that is usually faster on the read operation and where the
read speed will not depend on the way the user works. This is worth the slight
trade-off in write speed. In addition to that the tree-structure will also allow
faster partial loading, which makes it more future-proof.

- Database format
A database format as index structure will allow for changes to the index in
O(log(n)) time for a single change, with n always being the number of entries
in the index, assuming a b-tree index on the path in the database.

The check for the index corruption could be left to the database, which would
have about the same cost as the check in the tree-structure, in the read and
write direction. Partial loading with the right indices would also have the 
same cost as the tree-structure, by executing a simple select query.

The drawback of this solution however would be the introduction of a database
library (one could also write one, but that certainly would be overkill).
Another drawback would be that it's harder to read for programs like libgit2
and jgit and the codebase certainly wouldn't get any cleaner.

- Padded structure
Another index format that was taken into consideration was to create a padded
structure. Basically that would be an append-only like structure, but split into
sections, where every section would leave some space at the end for appending
changes. This would leave us with a complexity of O(k) where k is the number of
sections. 

This structure will bring advantages for faster partial loading, when splitting
the sections in the right way. 

The structure was however only briefly taken into consideration, since it would
have close to no advantages (except for the partial loading) compared to the
append-only structure, would make the index file larger due to the spaces for
appending and have the same drawbacks as the append-only structure.

-- Timeline --
24/04 - 01/05: Document the new index format.
02/05 - 21/05: Map the current internal structure to the new index format.
22/05 - 07/06: Change the current in-memory structure to keep track of the
changed files.
08/06 - 16/06: Write the index to disk in both the old and the new format
depending on the choice of the user and make sure only the changed parts are 
really written to disk in the new format.
17/06 - 21/07: Parse the index from disk to the current in-memory format.
/* Development work will be a bit slower from 18/06 to 21/07 because at my
 * University there are exams in this period. I probably will only be able to
 * work half the hours. I'll be back up to full speed after that. */
22/07 - 10/08: Read the new structure and map it to the current in meory format.
Make sure the old format still gets read correctly.
11/08 - 13/08: Test the new index and profile the gains compared to the old
format.

-- Why git --
I'm using git since about 2-3 years and wanted to contribute to it earlier, but
couldn't find the time to do it. I would also like to continue contributing
once the Summer of Code is over.

-- About me --
I'm Thomas Gummerer (@tgummerer on Twitter, tgummerer on IRC), 21 years old
from Italy. I'm currently a 3rd year Bachelor student in Applied Computer 
Science at the Free University of Bolzano. I started programming in High School 
about 8 years ago with Pascal and then learned C and Java. For some of my
projects you can visit my homepage (http://tgummerer.com/projects), most of them
were for university and some personal projects I did in my free time. My blog is
also on the same homepage, but not really active. Unfortunately I couldn't yet 
participate in any bigger open source project, although I'm interested in
it basically since I started programming. 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-04-03 19:07                               ` Thomas Gummerer
  2012-04-03 20:15                                 ` david
@ 2012-04-05 10:43                                 ` Michael Haggerty
  1 sibling, 0 replies; 36+ messages in thread
From: Michael Haggerty @ 2012-04-05 10:43 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: Nguyen Thai Ngoc Duy, Junio C Hamano, Thomas Rast, git

On 04/03/2012 09:07 PM, Thomas Gummerer wrote:
> On Apr 3, 2012, at 10:51 AM, Michael Haggerty wrote:
>> On 04/02/2012 11:02 PM, Thomas Gummerer wrote:
>>> - Append-only data structure
>>> [...]
>>> To make sure the index isn't corrupted, without calculating the sha1 hash for
>>> the whole index file every time something is changed, the hash is always
>>> calculated for the whole index when merging, but when only a single entry is 
>>> changed the sha-1 hash is only calculated for the last change. This will 
>>> increase the cost for reading the index to log(n) + k * log(k) where n is the 
>>> number of entries in the sorted part of the index and k is the number of entries
>>> in the unsorted part of the index, which will have to be merged with the rest 
>>> of the index.
>>
>> I don't understand this analysis of the reading time.  I suppose you are
>> assuming that you want to read the status of a single file.  But in that
>> case, it is enough to find the entry in the old index (O(log(n))
>> assuming some sort of tree structure) plus do a linear scan through the
>> unsorted entries (i.e., O(k), not O(k log(k))).
> 
> The current way git operates it always reads the whole index, making it necessary
> to merge the unsorted entries with the sorted part. Thinking about it it would even
> be O(k log(n)), because the appended part is unsorted.
> O(log(n)) + O(k) would be the complexity for loading only a single entry from the
> index.

I was confused because in your original mail you seemed to claim that
reading the whole sorted part of the index scales like O(log(n)), where
it certainly scales at least like O(n).

To read the whole index if using an append-only data structure, I would
do the following:

1. Read the file header to find where the addenda begin: one seek plus O(1).
2. Read the addenda in order (I assume each addendum to be sorted on
disk), and merge-sort the addenda together, discarding the earlier of
any duplicates: one seek plus O(k) I/O plus O(k log k) computation (this
is the worst case, if each addendum contains a single file).
3. Read the sorted part of the file in order, while merging it together
with the combined addenda: one seek plus O(n) I/O plus O(n + k) computation.

Total: 3 seeks plus O(n+k) I/O plus O(n + k log(k)) computation.

Whereas for a B-tree, it is hard to estimate the complexity because you
have provided very little detail about how you want to lay the data
structure out on disk.  But presumably the number of seeks will be
significantly larger.  And if you are not careful, the number of seeks
will approach the number of nodes in the index O(n) or perhaps the
number of added nodes (not files!) which could go something like O(k
log(k)).

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-04-04 20:05                                   ` Thomas Gummerer
@ 2012-04-05 14:39                                     ` Noel Grandin
  2012-04-05 21:49                                     ` Thomas Rast
  1 sibling, 0 replies; 36+ messages in thread
From: Noel Grandin @ 2012-04-05 14:39 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: david, Michael Haggerty, Nguyen Thai Ngoc Duy, Junio C Hamano,
	Thomas Rast, git



On 2012-04-04 22:05, Thomas Gummerer wrote:
> -- Proposed solution --
> The proposed solution is to redesign the index to a B-tree based format. This
> allows changes to the index in O(log(n)) time, with n being the number of
> entries in the index.
>
Sounds like you're re-inventing one of these:
http://code.google.com/p/leveldb/
http://code.google.com/p/high-concurrency-btree/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-04-04 20:05                                   ` Thomas Gummerer
  2012-04-05 14:39                                     ` Noel Grandin
@ 2012-04-05 21:49                                     ` Thomas Rast
  2012-04-06  3:22                                       ` Shawn Pearce
       [not found]                                       ` <878vi18eqd.fsf@thomas.inf.ethz.ch>
  1 sibling, 2 replies; 36+ messages in thread
From: Thomas Rast @ 2012-04-05 21:49 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: david, Michael Haggerty, Nguyen Thai Ngoc Duy, Junio C Hamano,
	Thomas Rast, git

Thomas Gummerer <italyhockeyfeed@gmail.com> writes:

> -- Proposed solution --
> The proposed solution is to redesign the index to a B-tree based format. This
> allows changes to the index in O(log(n)) time, with n being the number of
> entries in the index. 

> -- Solutions that were also considered --
> - Append-only data structure
> - Database format
> - Padded structure

This is quite complete already, which I think is great, but it's still
missing one "obvious" approach: a directory-tree based layout that uses
"flat" storage.  That is, the entries grouped by directory and thus
arranged into the "natural" tree, so as to allow parsing only part of
it.  But not pulling any tricks to make it easy to change; a nontrivial
change would mean a rewrite.  How good do you think that could be?

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-04-05 21:49                                     ` Thomas Rast
@ 2012-04-06  3:22                                       ` Shawn Pearce
  2012-04-06 15:11                                         ` Nguyen Thai Ngoc Duy
       [not found]                                       ` <878vi18eqd.fsf@thomas.inf.ethz.ch>
  1 sibling, 1 reply; 36+ messages in thread
From: Shawn Pearce @ 2012-04-06  3:22 UTC (permalink / raw)
  To: Thomas Rast
  Cc: Thomas Gummerer, david, Michael Haggerty, Nguyen Thai Ngoc Duy,
	Junio C Hamano, Thomas Rast, git

On Thu, Apr 5, 2012 at 14:49, Thomas Rast <trast@inf.ethz.ch> wrote:
> Thomas Gummerer <italyhockeyfeed@gmail.com> writes:
>
>> -- Proposed solution --
>> The proposed solution is to redesign the index to a B-tree based format. This
>> allows changes to the index in O(log(n)) time, with n being the number of
>> entries in the index.
>
>> -- Solutions that were also considered --
>> - Append-only data structure
>> - Database format
>> - Padded structure
>
> This is quite complete already, which I think is great, but it's still
> missing one "obvious" approach: a directory-tree based layout that uses
> "flat" storage.  That is, the entries grouped by directory and thus
> arranged into the "natural" tree, so as to allow parsing only part of
> it.  But not pulling any tricks to make it easy to change; a nontrivial
> change would mean a rewrite.  How good do you think that could be?

I have been wondering this myself. Aren't most updates to the index
just updating the stat information of an existing entry?

If so we could structure the index as flat lists for each directory
similar to a canonical tree, but with a wider field to hold not just
the SHA-1 but also the stat information of each file. If the entry is
just the component name ("foo.c" and not "src/foo.c") and the SHA-1
and stat data, you can probably protect the entire entry with a CRC-32
for each entry. Updates can be performed in place by taking the write
lock with index.lock as an empty file, then overwriting the SHA-1 and
stat field followed, by updating the CRC-32. Readers that see the
wrong CRC-32 for an entry can sleep for a short period, retry the
read, and fail after some number of attempts if they cannot get a
valid read of that entry.

Adding a new file or deleting an existing file from the index would
require a full rewrite.

Within a single tree/directory entry, it probably doesn't have to be
binary searchable. Canonical trees aren't. Linear scans through a
directory is OK, so long as the scans are broken up by the directory
tree structure just like they are in canonical trees.

Dealing with the conflict stages during merges (1, 2, 3) could be
handled by appending the conflict data at the end of the index file,
when conflicts are resolved this tail region could be truncated back
to the real end of the file. A bit could be set on the normal entry in
the trees to denote there is a conflict, and additional stage data is
expected in the tail region of the file.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-04-06  3:22                                       ` Shawn Pearce
@ 2012-04-06 15:11                                         ` Nguyen Thai Ngoc Duy
  2012-04-06 15:24                                           ` Thomas Rast
  0 siblings, 1 reply; 36+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-04-06 15:11 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Thomas Rast, Thomas Gummerer, david, Michael Haggerty,
	Junio C Hamano, Thomas Rast, git

On Fri, Apr 6, 2012 at 10:22 AM, Shawn Pearce <spearce@spearce.org> wrote:
> On Thu, Apr 5, 2012 at 14:49, Thomas Rast <trast@inf.ethz.ch> wrote:
>> This is quite complete already, which I think is great, but it's still
>> missing one "obvious" approach: a directory-tree based layout that uses
>> "flat" storage.  That is, the entries grouped by directory and thus
>> arranged into the "natural" tree, so as to allow parsing only part of
>> it.  But not pulling any tricks to make it easy to change; a nontrivial
>> change would mean a rewrite.  How good do you think that could be?
>
> I have been wondering this myself. Aren't most updates to the index
> just updating the stat information of an existing entry?
>
> If so we could structure the index as flat lists for each directory
> similar to a canonical tree, but with a wider field to hold not just
> the SHA-1 but also the stat information of each file. If the entry is
> just the component name ("foo.c" and not "src/foo.c") and the SHA-1
> and stat data, you can probably protect the entire entry with a CRC-32
> for each entry. Updates can be performed in place by taking the write
> lock with index.lock as an empty file, then overwriting the SHA-1 and
> stat field followed, by updating the CRC-32. Readers that see the
> wrong CRC-32 for an entry can sleep for a short period, retry the
> read, and fail after some number of attempts if they cannot get a
> valid read of that entry.

But does that mean the reader can have some old entries (because the
writer has not updated them yet by the time of reading) mixed with new
ones, i.e. inconsistent data?


> Adding a new file or deleting an existing file from the index would
> require a full rewrite.
>
> Within a single tree/directory entry, it probably doesn't have to be
> binary searchable. Canonical trees aren't. Linear scans through a
> directory is OK, so long as the scans are broken up by the directory
> tree structure just like they are in canonical trees.
>
> Dealing with the conflict stages during merges (1, 2, 3) could be
> handled by appending the conflict data at the end of the index file,
> when conflicts are resolved this tail region could be truncated back
> to the real end of the file. A bit could be set on the normal entry in
> the trees to denote there is a conflict, and additional stage data is
> expected in the tail region of the file.



-- 
Duy

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-04-06 15:11                                         ` Nguyen Thai Ngoc Duy
@ 2012-04-06 15:24                                           ` Thomas Rast
  2012-04-06 15:44                                             ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Rast @ 2012-04-06 15:24 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy
  Cc: Shawn Pearce, Thomas Gummerer, david, Michael Haggerty,
	Junio C Hamano, Thomas Rast, git

Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:

> On Fri, Apr 6, 2012 at 10:22 AM, Shawn Pearce <spearce@spearce.org> wrote:
>> On Thu, Apr 5, 2012 at 14:49, Thomas Rast <trast@inf.ethz.ch> wrote:
>>> This is quite complete already, which I think is great, but it's still
>>> missing one "obvious" approach: a directory-tree based layout that uses
>>> "flat" storage.  That is, the entries grouped by directory and thus
>>> arranged into the "natural" tree, so as to allow parsing only part of
>>> it.  But not pulling any tricks to make it easy to change; a nontrivial
>>> change would mean a rewrite.  How good do you think that could be?
>>
>> I have been wondering this myself. Aren't most updates to the index
>> just updating the stat information of an existing entry?
>>
>> If so we could structure the index as flat lists for each directory
>> similar to a canonical tree, but with a wider field to hold not just
>> the SHA-1 but also the stat information of each file. If the entry is
>> just the component name ("foo.c" and not "src/foo.c") and the SHA-1
>> and stat data, you can probably protect the entire entry with a CRC-32
>> for each entry. Updates can be performed in place by taking the write
>> lock with index.lock as an empty file, then overwriting the SHA-1 and
>> stat field followed, by updating the CRC-32. Readers that see the
>> wrong CRC-32 for an entry can sleep for a short period, retry the
>> read, and fail after some number of attempts if they cannot get a
>> valid read of that entry.
>
> But does that mean the reader can have some old entries (because the
> writer has not updated them yet by the time of reading) mixed with new
> ones, i.e. inconsistent data?

That can be fixed with another CRC or hash covering the CRCs for the
entries, so that a reader can notice a partial update.

But even so: do we make any promises that (say) git-add is atomic in the
sense that a reader always gets the before-update results or the
after-update results?  Non-builtins (e.g. git add -p) may make small
incremental updates to the index, so they wouldn't be atomic anyway.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-04-06 15:24                                           ` Thomas Rast
@ 2012-04-06 15:44                                             ` Nguyen Thai Ngoc Duy
  2012-04-06 17:13                                               ` Shawn Pearce
  0 siblings, 1 reply; 36+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-04-06 15:44 UTC (permalink / raw)
  To: Thomas Rast
  Cc: Shawn Pearce, Thomas Gummerer, david, Michael Haggerty,
	Junio C Hamano, git

On Fri, Apr 6, 2012 at 10:24 PM, Thomas Rast <trast@student.ethz.ch> wrote:
> But even so: do we make any promises that (say) git-add is atomic in the
> sense that a reader always gets the before-update results or the
> after-update results?  Non-builtins (e.g. git add -p) may make small
> incremental updates to the index, so they wouldn't be atomic anyway.

Take git-checkout. I'm ok with it writing to worktree all old entries,
or all new ones, but please not a mix.
-- 
Duy

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-04-06 15:44                                             ` Nguyen Thai Ngoc Duy
@ 2012-04-06 17:13                                               ` Shawn Pearce
  2012-04-06 17:23                                                 ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 36+ messages in thread
From: Shawn Pearce @ 2012-04-06 17:13 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy
  Cc: Thomas Rast, Thomas Gummerer, david, Michael Haggerty,
	Junio C Hamano, git

On Fri, Apr 6, 2012 at 08:44, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Fri, Apr 6, 2012 at 10:24 PM, Thomas Rast <trast@student.ethz.ch> wrote:
>> But even so: do we make any promises that (say) git-add is atomic in the
>> sense that a reader always gets the before-update results or the
>> after-update results?  Non-builtins (e.g. git add -p) may make small
>> incremental updates to the index, so they wouldn't be atomic anyway.
>
> Take git-checkout. I'm ok with it writing to worktree all old entries,
> or all new ones, but please not a mix.

Why, what is the big deal? git-checkout has already written the file
to the local working tree. Its now just reflecting the updated stat
information in the index. If it does that after each file was touched,
and it aborts, you still have a partially updated working tree and the
index will show some updated files as stat clean, but staged relative
to HEAD. I don't think that is any better or worse than the current
situation where the working tree is shown as locally dirty but the
index has no staged files. Either way you have an aborted checkout to
recover from by retrying, or git reset --hard HEAD.

In the retry case, checkout actually has less to do because the files
it already cleanly updated match where its going, and thus it doesn't
have to touch them again.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-04-06 17:13                                               ` Shawn Pearce
@ 2012-04-06 17:23                                                 ` Nguyen Thai Ngoc Duy
  2012-04-06 17:56                                                   ` Shawn Pearce
  0 siblings, 1 reply; 36+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-04-06 17:23 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Thomas Rast, Thomas Gummerer, david, Michael Haggerty,
	Junio C Hamano, git

On Sat, Apr 7, 2012 at 12:13 AM, Shawn Pearce <spearce@spearce.org> wrote:
> On Fri, Apr 6, 2012 at 08:44, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
>> On Fri, Apr 6, 2012 at 10:24 PM, Thomas Rast <trast@student.ethz.ch> wrote:
>>> But even so: do we make any promises that (say) git-add is atomic in the
>>> sense that a reader always gets the before-update results or the
>>> after-update results?  Non-builtins (e.g. git add -p) may make small
>>> incremental updates to the index, so they wouldn't be atomic anyway.
>>
>> Take git-checkout. I'm ok with it writing to worktree all old entries,
>> or all new ones, but please not a mix.
>
> Why, what is the big deal? git-checkout has already written the file
> to the local working tree. Its now just reflecting the updated stat
> information in the index. If it does that after each file was touched,
> and it aborts, you still have a partially updated working tree and the
> index will show some updated files as stat clean, but staged relative
> to HEAD. I don't think that is any better or worse than the current
> situation where the working tree is shown as locally dirty but the
> index has no staged files. Either way you have an aborted checkout to
> recover from by retrying, or git reset --hard HEAD.
>
> In the retry case, checkout actually has less to do because the files
> it already cleanly updated match where its going, and thus it doesn't
> have to touch them again.

OK, what about git-commit? If I read your description correctly, you
can update entry sha-1 in place  too. Running cache-tree on half old
half new index definitely creates a broken commit.

A command can also read (which does not require lock), update its
internal index, then lock and write. At that time, it may accidentally
overwrite whatever another command wrote while it was still preparing
the index in memory.
-- 
Duy

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
  2012-04-06 17:23                                                 ` Nguyen Thai Ngoc Duy
@ 2012-04-06 17:56                                                   ` Shawn Pearce
  0 siblings, 0 replies; 36+ messages in thread
From: Shawn Pearce @ 2012-04-06 17:56 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy
  Cc: Thomas Rast, Thomas Gummerer, david, Michael Haggerty,
	Junio C Hamano, git

On Fri, Apr 6, 2012 at 10:23, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Sat, Apr 7, 2012 at 12:13 AM, Shawn Pearce <spearce@spearce.org> wrote:
>> On Fri, Apr 6, 2012 at 08:44, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
>>> On Fri, Apr 6, 2012 at 10:24 PM, Thomas Rast <trast@student.ethz.ch> wrote:
>>>> But even so: do we make any promises that (say) git-add is atomic in the
>>>> sense that a reader always gets the before-update results or the
>>>> after-update results?  Non-builtins (e.g. git add -p) may make small
>>>> incremental updates to the index, so they wouldn't be atomic anyway.
>>>
>>> Take git-checkout. I'm ok with it writing to worktree all old entries,
>>> or all new ones, but please not a mix.
>>
>> Why, what is the big deal? git-checkout has already written the file
>> to the local working tree. Its now just reflecting the updated stat
>> information in the index. If it does that after each file was touched,
>> and it aborts, you still have a partially updated working tree and the
>> index will show some updated files as stat clean, but staged relative
>> to HEAD. I don't think that is any better or worse than the current
>> situation where the working tree is shown as locally dirty but the
>> index has no staged files. Either way you have an aborted checkout to
>> recover from by retrying, or git reset --hard HEAD.
>>
>> In the retry case, checkout actually has less to do because the files
>> it already cleanly updated match where its going, and thus it doesn't
>> have to touch them again.
>
> OK, what about git-commit? If I read your description correctly, you
> can update entry sha-1 in place  too.

Yes.

> Running cache-tree on half old
> half new index definitely creates a broken commit.

How is that possible? Each tree also has its own SHA-1 field. A
process trying to update a tree's SHA-1 will have to snapshot the
tree's contents from the index by copying the data into its own memory
buffer so it can compute the canonical tree data buffer, write the
object to the repository, and get the tree's SHA-1. It writes that
tree's SHA-1 back to the index as of that snapshot. If there were
concurrent updates at the same time as git commit running, its the
same race condition that already exists. You don't know exactly where
in the execution of `git commit` it takes the snapshot of the index
that it uses to make the commit by opening the file. Allowing in place
updates means the snapshot time within git commit expands to be a
larger portion of its running time.

Basically I would argue it is already not safe to be modifying the
index while git commit is running. You don't know if git commit has
already opened the index file, or will open it after the edit. The
only way to be sure right now is to make your own copy of the index
and use GIT_INDEX_FILE environment variable to make sure git commit
uses the exact index you want.

> A command can also read (which does not require lock), update its
> internal index, then lock and write. At that time, it may accidentally
> overwrite whatever another command wrote while it was still preparing
> the index in memory.

This hypothetical command already has the bug you mention. It should
be fixed no matter what we do with regards to the index format.

The *only* safe way to update the index and prevent losing
modifications made by another process is to lock the index *then* read
it, update, write back out. If you read before you take the write
lock, you can discard edits made by another process. This is
preciously the reason why the JGit library always opens, reads, then
closes the index anytime the process wants to access an entry. We need
to make sure we are viewing the correct current version. Its even more
critical when the process wants to update the index, it *must* discard
any in-memory cached data it has and re-read the index *after* the
write lock has been successfully acquired.


IMHO the risks to the update in place approach is a few things, but
none of them really are a problem:

*  Readers *must* use the retry algorithm when looking at each record
anytime the CRC-32 on an individual entry doesn't match. Retry
requires using some form of backoff, because the concurrent writer
needs to be given time to finish the writes to the storage file. If a
reader doesn't correctly implement a retry, they could see corruption.

*  Readers *must* check the CRC-32 of any entry. In fact the best way
to read an entry is memcpy() the entry's stat/SHA-1/CRC-32 from the
index into another memory buffer, compute the checksum there, and
compare. This way the reader can be certain the entry isn't mutated
after it checked the CRC-32 but before it examined a particular stat
field. Again a buggy implementation reading from the index might not
implement this strategy and complain about corruption, or silently
process data with corruption.

*  A partial write will leave a corrupted index. E.g. a process
writing a record is killed before it has a chance to fully write out
the record's data. Nobody can read that record until it is repaired.
Repair should be possible with a combination of git reset --soft to
copy the SHA-1 from HEAD and recomputing the working tree's SHA-1 to
see if the file is really clean or not. It probably isn't, and the
stat data will reflect it as dirty after the repair. We may have to
put this sort of repair logic into `git status` and `git diff` as part
of the normal "fix clean stat" pass.

*  Appending conflicting stage information to the end of the file
during a merge can be risky. The append might be partial. This can be
fixed by the user by `git reset --hard HEAD` to abort the merge. A
partial append is probably only likely when the git merge aborted
anyway and hasn't even really left you with a sane state to try and
resolve conflicts in.

*  Truncating away the conflicting stage information on the end of the
file can be risky, if the file system doesn't truncate back correctly.
But I think we can detect this and repair. If every record has a
"conflict" bit set to 0 and all records CRC-32s are valid, and we hold
the write lock, we know any conflict data on the end is bogus and
should be truncated away, so we truncate again. If truncation isn't
working correctly on this filesystem, we rewrite the entire index
file.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
       [not found]                                           ` <8D2805A4-9C5F-43A9-B3ED-0DB77341A03C@gmail.com>
@ 2012-04-19 10:49                                             ` Nguyen Thai Ngoc Duy
       [not found]                                             ` <877gxcoron.fsf@thomas.inf.ethz.ch>
  1 sibling, 0 replies; 36+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-04-19 10:49 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Thomas Rast, david, Michael Haggerty, Junio C Hamano, git,
	Shawn O.Pearce, David Barr

On Thu, Apr 19, 2012 at 5:25 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> As Thomas Rast pointed out on IRC today, it would be better to first
>  implement a simple converter of the old index to the new format,
> and then implement a faster reader for the midterm, so we have a
> "future-proof" index format at the midterm and a fast reader.

You can skip the old-index reading code in the converter by writing a
git test command and  reuse git code. It's really simple: modify
Makefile, add your command name, say test-index-converter, to
TEST_PROGRAMS_NEED_X, then create test-index-converter.c with this:

#include "cache"

int main(int ac, char **av)
{
    setup_git_directory();
    read_cache();
    for (i = 0; i < cache_nr; i++) {
        struct cache_entry *ce = active_cache[i];
        /* process the entry and produce new format here */
    }
    return 0;
}

"make test-index-converter" will give you the command.

Of course it also comes at a cost because C may be slower to write
than Python for new-index constructing code. Or just output something
easy to parse in Python and write the Python converter from there.
--
Duy - Python ignorant

>  The new
>  writing algorithm will then be implemented in the second part of the
> Summer of Code.
>
> I'll just post the update on the timeline and the midterm evaluation here.
>
> -- Timeline --
> 24/04 - 01/05: Document the new index format.
> 02/05 - 11/05: Create a converter of the old index format to the new format in
> python.
> 12/05 - 18/06: Parse the index from disk to the current in-memory format. The
> old index format shall still be readable.
> 19/06 - 09/07: Implement the re-reading of a single record, if the crc32 doesn't
> match (Meaning the record has been changed under the reader).
> - Midterm evaluation -
> 10/07 - 21/07:  Map the current internal structure to the new index format.
> 22/07 - 31/07: Change the current in-memory structure to keep track of the
> changed files.
> 01/08 - 13/08: Write the index to disk in both the old and the new format
> depending on the choice of the user and make sure only the changed parts are
> really written to disk in the new format.
> 11/08 - 13/08: Test the new index and profile the gains compared to the old
> format.
> /* Development work will be a bit slower from 18/06 to 21/07 because at my
>  * University there are exams in this period. I probably will only be able to
>  * work half the hours. I'll be back up to full speed after that. */
>
> -- Midterm evaluation --
> At the midterm, there will be a python prototype for the conversion of the old
> index format to the new "future-proof" index format and a faster reader for the
> new format. The native write part will be completed in the second part of the
> Summer of Code.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GSoC] Designing a faster index format
       [not found]                                             ` <877gxcoron.fsf@thomas.inf.ethz.ch>
@ 2012-04-20 20:02                                               ` Jeff King
  0 siblings, 0 replies; 36+ messages in thread
From: Jeff King @ 2012-04-20 20:02 UTC (permalink / raw)
  To: Thomas Rast
  Cc: Thomas Gummerer, Thomas Rast, david, Michael Haggerty,
	Nguyen Thai Ngoc Duy, Junio C Hamano, git, Shawn O.Pearce,
	David Barr

On Thu, Apr 19, 2012 at 02:02:32PM +0200, Thomas Rast wrote:

> > At the midterm, there will be a python prototype for the conversion of the old
> > index format to the new "future-proof" index format and a faster reader for the
> > new format. The native write part will be completed in the second part of the
> > Summer of Code.
> 
> I think this is the most important bit, and I'm curious if there are any
> objections/concerns on this.
> 
> It basically splits the project into
> 
> - first half: design, reader, raciness/locking issues
> - second half: writer, changes to take advantage of partial writing

I am OK with prototyping, and I'm OK with doing a throw-away prototype
in a different language. But doing a throw-away prototype that lasts
through the whole first half of the project makes me a little nervous.
I'm worried that the conversion from a python prototype to actual git
code in the second half is going to go badly and end up in either:

  1. A half-finished implementation, because the integration means we
     have to go back and re-visit assumptions made in the prototype that
     aren't valid in git itself.

  2. A conversion from python to C that is rushed and ends up doing
     things in a way that makes the conversion easier, but long term
     maintenance hard. Merge-recursive was originally written in python
     and converted to C, and it shows in the code. It's brittle, buggy,
     and hard to maintain (though that is just one data point; maybe it
     was just poorly written in the first place).

So I'd be more comfortable with the prototype being just for the design
phase, and developing the real reader and writer simultaneously in C.
Those implementations will also start as prototypes, but they can be
iterated on, rather than being throw-aways.

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2012-04-20 20:02 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-20 21:17 [GSoC] Designing a faster index format Thomas Gummerer
2012-03-21  1:29 ` Nguyen Thai Ngoc Duy
2012-03-21  9:22   ` Thomas Gummerer
2012-03-21 10:34     ` Nguyen Thai Ngoc Duy
     [not found]       ` <CALgYhfPOJpKbM__iU4KvChWeXWyyhWb2ocR-SLvrQQHNw5F5dQ@mail.gmail.com>
2012-03-21 11:18         ` Nguyen Thai Ngoc Duy
2012-03-21 12:51       ` Thomas Rast
2012-03-21 15:43         ` Thomas Gummerer
2012-03-21 16:19           ` Thomas Rast
2012-03-22 22:51             ` Thomas Gummerer
2012-03-23 10:10               ` Thomas Rast
2012-03-25  1:28                 ` Thomas Gummerer
2012-03-26 20:35                 ` Thomas Gummerer
2012-03-26 21:14                   ` Junio C Hamano
2012-03-27 11:08                     ` Thomas Gummerer
2012-03-27 11:47                   ` Thomas Rast
2012-03-29 15:21                     ` Thomas Gummerer
2012-03-29 21:06                       ` Junio C Hamano
2012-03-30  5:19                         ` Nguyen Thai Ngoc Duy
2012-04-02 21:02                           ` Thomas Gummerer
2012-04-03  8:51                             ` Michael Haggerty
2012-04-03 12:28                               ` Nguyen Thai Ngoc Duy
2012-04-03 19:07                               ` Thomas Gummerer
2012-04-03 20:15                                 ` david
2012-04-04 20:05                                   ` Thomas Gummerer
2012-04-05 14:39                                     ` Noel Grandin
2012-04-05 21:49                                     ` Thomas Rast
2012-04-06  3:22                                       ` Shawn Pearce
2012-04-06 15:11                                         ` Nguyen Thai Ngoc Duy
2012-04-06 15:24                                           ` Thomas Rast
2012-04-06 15:44                                             ` Nguyen Thai Ngoc Duy
2012-04-06 17:13                                               ` Shawn Pearce
2012-04-06 17:23                                                 ` Nguyen Thai Ngoc Duy
2012-04-06 17:56                                                   ` Shawn Pearce
     [not found]                                       ` <878vi18eqd.fsf@thomas.inf.ethz.ch>
     [not found]                                         ` <83571955-9256-4032-9182-FA9062D28B9D@gmail.com>
     [not found]                                           ` <8D2805A4-9C5F-43A9-B3ED-0DB77341A03C@gmail.com>
2012-04-19 10:49                                             ` Nguyen Thai Ngoc Duy
     [not found]                                             ` <877gxcoron.fsf@thomas.inf.ethz.ch>
2012-04-20 20:02                                               ` Jeff King
2012-04-05 10:43                                 ` Michael Haggerty

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.