From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michael Haggerty Subject: Re: [GSoC] Designing a faster index format Date: Tue, 03 Apr 2012 10:51:37 +0200 Message-ID: <4F7ABA19.7040408@alum.mit.edu> References: <8D287169-1AD9-4586-BDBC-F820220328FC@gmail.com> <871uomrubl.fsf@thomas.inf.ethz.ch> <8901F6B5-7396-44E1-9687-20BF95114728@gmail.com> <871uomq64c.fsf@thomas.inf.ethz.ch> <878virfx11.fsf@thomas.inf.ethz.ch> <2A61C352-5C71-4EDF-9DBE-01CC09AE03A5@gmail.com> <87r4we9sfo.fsf@thomas.inf.ethz.ch> <5CE5AEC7-22C8-4911-A79E-11F2F3D902A2@gmail.com> <7vk423qfps.fsf@alter.siamese.dyndns.org> <1604FE70-8B77-4EC1-823A-DC1F0334CD3A@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Nguyen Thai Ngoc Duy , Junio C Hamano , Thomas Rast , git@vger.kernel.org To: Thomas Gummerer X-From: git-owner@vger.kernel.org Tue Apr 03 10:51:56 2012 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1SEzSz-0000ws-PU for gcvg-git-2@plane.gmane.org; Tue, 03 Apr 2012 10:51:54 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752225Ab2DCIvs (ORCPT ); Tue, 3 Apr 2012 04:51:48 -0400 Received: from einhorn.in-berlin.de ([192.109.42.8]:42572 "EHLO einhorn.in-berlin.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751287Ab2DCIvr (ORCPT ); Tue, 3 Apr 2012 04:51:47 -0400 X-Envelope-From: mhagger@alum.mit.edu Received: from [192.168.100.152] (ssh.berlin.jpk.com [212.222.128.135]) (authenticated bits=0) by einhorn.in-berlin.de (8.13.6/8.13.6/Debian-1) with ESMTP id q338pbgR008988 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Tue, 3 Apr 2012 10:51:38 +0200 User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.28) Gecko/20120313 Lightning/1.0b2 Thunderbird/3.1.20 In-Reply-To: <1604FE70-8B77-4EC1-823A-DC1F0334CD3A@gmail.com> X-Scanned-By: MIMEDefang_at_IN-Berlin_e.V. on 192.109.42.8 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On 04/02/2012 11:02 PM, Thomas Gummerer wrote: > And here is the proposal: > Designing a faster index format > > -- Problem -- > The current git index is pretty slow when working with large git repositories, > because the whole index has to be rewritten for nearly every operation. For > example for adding a single file with git add, even though logically only one > single blob sha-1 is changed in the index, git has to recompute the hash over > the whole index and rewrite the index file. In addition to that the speed up > for writing the index can not come to the expense of the read operation, since > it is executed more often throughout a normal developers day compared to > writing the index. The current index also doesn't allow simple partial reading, > which could help speed up some commands, like git status and git diff, which > often don't need the whole index. > > -- Proposed solution -- > The proposed solution is to redesign the index to a B-tree based format. This > allows changes to the index in O(log(n)) time, with n being the number of > entries in the index. I thought that the index lock currently only blocks writers, not readers (am I wrong?). So given that you want to be able to mutate the index file without rewriting the whole file, it seems to me that you have to pick from one of these alternatives: 1. Change the locking semantics so that readers also block when the file is locked. This choice would have some important consequences: (a) readers will also have to obtain a lock before starting to read. (b) to avoid deadlock, it will become crucial that the lock is never held across the execution of any other git (sub-)commands that might want to read the index. 2. Implement a file format that can be read even while it is being mutated. If so, please explain the data file format in more detail; in particular, how do you plan to mutate the file in a way that does not disturb readers? How do you plan to read the whole index efficiently (I imagine that reading the whole index will remain a frequent operation)? I encourage you to include an analysis of the number of disk seeks when you are analyzing the cost of read/write operations on the index. This will have a strong effect on the time for working with the index when the disk cache is cold. The current index requires O(1) seeks for reading and writing, which I believe is a big part of the reason that the current read-the-whole-index/write-the-whole-index design performs excellently despite the amount of data that it is touching. > [...] > - Append-only data structure > An append-only data structure will allow for changes to the index in O(k) time, > with k being the number of files that changed. To reach this goal, the first > part of the index will be sorted and changes will always be written to the end, > in the order in which the changes occured. This last part of the index will be > merged with the rest using a heuristic, which will be the execution of git > commit and git status. > > To make sure the index isn't corrupted, without calculating the sha1 hash for > the whole index file every time something is changed, the hash is always > calculated for the whole index when merging, but when only a single entry is > changed the sha-1 hash is only calculated for the last change. This will > increase the cost for reading the index to log(n) + k * log(k) where n is the > number of entries in the sorted part of the index and k is the number of entries > in the unsorted part of the index, which will have to be merged with the rest > of the index. I don't understand this analysis of the reading time. I suppose you are assuming that you want to read the status of a single file. But in that case, it is enough to find the entry in the old index (O(log(n)) assuming some sort of tree structure) plus do a linear scan through the unsorted entries (i.e., O(k), not O(k log(k))). > [...] > This [append-only] idea was dropped, because as mentioned in the problem description reads > are more common then writes and therefore trading write speed for read speed > is not a good tradeoff. The amount of read speed that would have to be sacrificed depends on the size of k and n. Under the assumption that k << n, the read speed of an append-only index file format (with periodic compaction) would be close to optimal. For the append-only format, k is the number of files that have been changed since the index was last rewritten (including duplicates if a file has been changed more than once). Supposing that the index is compacted on branch changes and "occasionally" when k grows too large, I have the feeling that k will typically be quite small in comparison to n, especially in the huge repositories that need this optimization. Remember that a full index rewrite/compaction will only take as long as it currently takes for *any* change to the index (which is not *that* terrible). It would also be easy to estimate the size of k and n on every operation. Therefore, if an operation is expected to force a large fraction of index file entries to be invalidated, it is OK for it to force an index compaction to keep subsequent read operations fast. And even if, once in a blue moon, somebody changes *all* of the files in his huge repository while somehow evading compaction, the time to read the full index would still only be a factor of two slower than the current design. (Granted, the time to read a single entry in this scenario would be much longer than the log(n) that would be possible given a pure-tree-based design.) Michael -- Michael Haggerty mhagger@alum.mit.edu http://softwareswirl.blogspot.com/