Re: GSoC - Designing a faster index format

From: Shawn Pearce <spearce@spearce.org>
To: Nguyen Thai Ngoc Duy <pclouds@gmail.com>
Cc: elton sky <eltonsky9404@gmail.com>,
	Git Mailing List <git@vger.kernel.org>,
	Thomas Rast <trast@inf.ethz.ch>
Subject: Re: GSoC - Designing a faster index format
Date: Mon, 2 Apr 2012 10:27:47 -0400	[thread overview]
Message-ID: <CAJo=hJuVZiik6J0nhO4jpzWYenerRQoREHLMmJoFY8W0bZR+5A@mail.gmail.com> (raw)
In-Reply-To: <20120402123146.GA24813@do>

On Mon, Apr 2, 2012 at 08:31, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> -- 8< --
> GIT index format
> ================
>
> This format replaces the old "DIRC" format. Compared to the old
> format, which is essentially a sorted list of pathnames, this one:
>
>  - is tree-based
>  - use crc32 as checksum
>  - only verify integrity on parts that git accesses, instead of whole
>   file
>  - append changes to the end
>  - allow index versioning
>
> Updates can be made directly to the index by appending to the end. The
> index traversed by locating the root tree block from the trailer. When
> a path is updated, all related tree blocks are updated and appended to
> the end, then a new trailer (with generation increased by one) is
> written to conclude the index.
>
> The index size will increase continuously. At some point, we will need
> to repack it. Let assume a tree block is 64k on average and a path
> generally consists of 3 path components.  That means an entry update
> adds 192k and we can do about 80 updates before index reaches 16M (in
> addition to initial index size).

Only 3 path components? Java sources can easily have 8-10 with a long
Maven and Java package implied prefix. This will increase the
frequency of rewrites of the index file.

> At 16M or when trailer generation hits a limit (the limit can be
> configurable), we rewrite the index to reduce its size. Some heavy
> operations can also be used to rewrite index, such as checkout or
> reset.
>
> The index integrity is verified by crc32. One crc32 covers header and
> trailer. Each block has its own crc32. When the index is found
> corrupt, we could try to roll back to latest good version by looking
> for trailers from bottom up. Even when the index is not corrupt, users
> can still look back this way for older index versions.

How do you deal with a partially written append to the index file?
E.g. if a prior update crashes or the filesystem doesn't write
everything out before power failure, you need to find the last good
trailer block in the file.

> = The git index file has the following format
>
>   - A 8-byte header consisting of
>
>     4-byte signature:
>       The signature is { 'T', 'R', 'E', 'E' }
>
>     4-byte version number:
>       The current supported versions are 1.

Why not DIRC version 4?

>   - A number of blocks of variable size
>
>      1-byte block type
>
>      3-byte content size in byte
>
>      block content

So you are limiting the size of a canonical tree now? Currently there
is no limit on the size a tree. But here the entire index structure
plus set of names must be under 16 MiB. Granted no project probably
hits that limit, but you are painting us into a corner with an upper
limit here that doesn't look like it will be easy to increase.

>      4-byte crc32 of all above
>
>   - A 18-byte trailer consisting of
>
>      4-byte trailer signature:
>        The signature is { 'R', 'O', 'O', 'T' }
>
>      2-byte generation:
>         The first trailer is 0, the second 1 and so on.
>
>      4-byte root block offset
>
>      4-byte extension table offset:
>        Zero means no extension
>
>      4-byte checksum:
>        CRC32 of the header and the trailer (excluding this field)

See above my question about how to find the last good trailer if the
last append attempt was incomplete.

> == Tree block
>
>  A tree block contains a (maybe invalid) tree object and extra
>  information of its companion in working directory. Tree block has
>  block type 'T'.
>
>  Tree block content is basically the list of non-recursive entries in
>  specified path, with all attributes we store in the index now. There
>  are a few changes though to intergrate cache-tree and allow
>  bsearch() on mmap'd block.
>
>  A tree block content consists of
>
>  - 4-byte tree object size
>
>  - 20-byte SHA-1 of the cached tree object
>
>  - a list attributes corresponding to tree object's item, in the same
>    order.  These attributes are the same as in DIRC entry format
>    except that entry name is removed, and a tree block offset is
>    added in case the item is a directory.
>
>    32-bit ctime seconds, the last time a file's metadata changed
>      this is stat(2) data
>
>    32-bit ctime nanosecond fractions
>      this is stat(2) data
>
>    32-bit mtime seconds, the last time a file's data changed
>      this is stat(2) data
>
>    32-bit mtime nanosecond fractions
>      this is stat(2) data
>
>    32-bit dev
>      this is stat(2) data
>
>    32-bit ino
>      this is stat(2) data
>
>    32-bit mode, split into (high to low bits)
>
>      4-bit object type
>        valid values in binary are 1000 (regular file), 1010 (symbolic link)
>        and 1110 (gitlink)
>
>      3-bit unused
>
>      9-bit unix permission. Only 0755 and 0644 are valid for regular files.
>      Symbolic links and gitlinks have value 0 in this field.
>
>    32-bit uid
>      this is stat(2) data
>
>    32-bit gid
>      this is stat(2) data
>
>    32-bit file size
>      This is the on-disk size from stat(2), truncated to 32-bit.
>
>    160-bit SHA-1 for the represented object if blobs or the offset
>      to another tree block if trees
>
>    A 32-bit 'flags' field split into (high to low bits)
>
>      1-bit assume-valid flag
>
>      1-bit extended flag (must be zero in version 2)
>
>      2-bit stage (during merge)
>
>      12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
>      is stored in this field.
>
>      1-bit skip-worktree flag (used by sparse checkout)
>
>      1-bit intent-to-add flag (used by "git add -N")
>
>      14-bit unused, must be zero
>
>    A 16-bit offset, relative to the beginning of this block, to the
>      pathname of this entry. FIXME: make it 32-bit, relative to the
>      beginning of the file, so that we can reuse pathnames from other
>      (old) blocks?

16 bit offset doesn't work well in a block that can be as large as 2^24.

If you reuse a path name list at the start of the file, how do you
handle new names?