All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nguyen Thai Ngoc Duy <pclouds@gmail.com>
To: Thomas Gummerer <t.gummerer@gmail.com>
Cc: Junio C Hamano <gitster@pobox.com>,
	Thomas Rast <trast@student.ethz.ch>,
	git@vger.kernel.org
Subject: Re: [GSoC] Designing a faster index format
Date: Fri, 30 Mar 2012 12:19:48 +0700	[thread overview]
Message-ID: <CACsJy8Ag9yvGwKE_oiW8T+hR2hN_fzXvGCdOJ_H44DCOm9RF0Q@mail.gmail.com> (raw)
In-Reply-To: <7vk423qfps.fsf@alter.siamese.dyndns.org>

On Fri, Mar 30, 2012 at 4:06 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Is it really the time consuming part in the overall picture?  Do you have
> vital statistics for various pieces to justify that you are solving the
> right problem?  E.g. (not exhaustive)
>
>  - read_index(): overhead to
>   - validate the checksum of the whole index;
>   - extract entries into the_index.cache[] array;
>
>  - write_index(): overhead to
>   - serialize entries into the on-disk format;
>   - compute the checksum over the entire index;
>   - write the whole index to disk;
>
>  - frequency of the above two operations.

Also maybe the frequency of entry updates vs additions/removals. I
suspect refresh operation in some case can update a lot of entries. If
that's the case (and happens often), we may need special treatment for
it because simply appending entries might be costly.

> Also, optimizing the write codepath by penalizing the read codepath is a
> wrong trade-off if the index is read far more often than it is written
> during a single day of a developer.

I suspect so too, but some measurement has to be done there. It'd be
good if you provide a patch to collect index operation statistics.
Some of us can try it on for a few weeks. That would give us a better
picture.

On Thu, Mar 29, 2012 at 10:21 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> - Tree structure
> The advantage of the tree structure over the append-only data structure that was
> proposed would be that it would not require any merge or other maintainance work
> work after a change of a file. The disadvantage however is that changing a file
> would always require log(n) changes to the index, where n is the number of
> entries in the index. Another problem might be the integrity check, which would
> need either to use a hash over the whole index file or a hash on the tree,
> which would however take more time for checking.

I'd say it takes less time for checksuming because we only verify the
trees we read. And tree-based structure allows us to read just a
subdirectory, an advantage for multi-project repositories where people
stay in a subdir most of the time. Shawn suspected crc32 checksum over
a large chunk of data may be insufficient. By hashing tree by tree,
the chunks are significantly shorter, making crc32 viable and cheap
candidate compared to sha-1 (also note that crc32 is used to verify
compressed objects in a pack)
-- 
Duy

  reply	other threads:[~2012-03-30  5:20 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-20 21:17 [GSoC] Designing a faster index format Thomas Gummerer
2012-03-21  1:29 ` Nguyen Thai Ngoc Duy
2012-03-21  9:22   ` Thomas Gummerer
2012-03-21 10:34     ` Nguyen Thai Ngoc Duy
     [not found]       ` <CALgYhfPOJpKbM__iU4KvChWeXWyyhWb2ocR-SLvrQQHNw5F5dQ@mail.gmail.com>
2012-03-21 11:18         ` Nguyen Thai Ngoc Duy
2012-03-21 12:51       ` Thomas Rast
2012-03-21 15:43         ` Thomas Gummerer
2012-03-21 16:19           ` Thomas Rast
2012-03-22 22:51             ` Thomas Gummerer
2012-03-23 10:10               ` Thomas Rast
2012-03-25  1:28                 ` Thomas Gummerer
2012-03-26 20:35                 ` Thomas Gummerer
2012-03-26 21:14                   ` Junio C Hamano
2012-03-27 11:08                     ` Thomas Gummerer
2012-03-27 11:47                   ` Thomas Rast
2012-03-29 15:21                     ` Thomas Gummerer
2012-03-29 21:06                       ` Junio C Hamano
2012-03-30  5:19                         ` Nguyen Thai Ngoc Duy [this message]
2012-04-02 21:02                           ` Thomas Gummerer
2012-04-03  8:51                             ` Michael Haggerty
2012-04-03 12:28                               ` Nguyen Thai Ngoc Duy
2012-04-03 19:07                               ` Thomas Gummerer
2012-04-03 20:15                                 ` david
2012-04-04 20:05                                   ` Thomas Gummerer
2012-04-05 14:39                                     ` Noel Grandin
2012-04-05 21:49                                     ` Thomas Rast
2012-04-06  3:22                                       ` Shawn Pearce
2012-04-06 15:11                                         ` Nguyen Thai Ngoc Duy
2012-04-06 15:24                                           ` Thomas Rast
2012-04-06 15:44                                             ` Nguyen Thai Ngoc Duy
2012-04-06 17:13                                               ` Shawn Pearce
2012-04-06 17:23                                                 ` Nguyen Thai Ngoc Duy
2012-04-06 17:56                                                   ` Shawn Pearce
     [not found]                                       ` <878vi18eqd.fsf@thomas.inf.ethz.ch>
     [not found]                                         ` <83571955-9256-4032-9182-FA9062D28B9D@gmail.com>
     [not found]                                           ` <8D2805A4-9C5F-43A9-B3ED-0DB77341A03C@gmail.com>
2012-04-19 10:49                                             ` Nguyen Thai Ngoc Duy
     [not found]                                             ` <877gxcoron.fsf@thomas.inf.ethz.ch>
2012-04-20 20:02                                               ` Jeff King
2012-04-05 10:43                                 ` Michael Haggerty

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CACsJy8Ag9yvGwKE_oiW8T+hR2hN_fzXvGCdOJ_H44DCOm9RF0Q@mail.gmail.com \
    --to=pclouds@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=t.gummerer@gmail.com \
    --cc=trast@student.ethz.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.