From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michael Haggerty Subject: Re: [GSoC] Designing a faster index format Date: Thu, 05 Apr 2012 12:43:37 +0200 Message-ID: <4F7D7759.6020603@alum.mit.edu> References: <8D287169-1AD9-4586-BDBC-F820220328FC@gmail.com> <871uomrubl.fsf@thomas.inf.ethz.ch> <8901F6B5-7396-44E1-9687-20BF95114728@gmail.com> <871uomq64c.fsf@thomas.inf.ethz.ch> <878virfx11.fsf@thomas.inf.ethz.ch> <2A61C352-5C71-4EDF-9DBE-01CC09AE03A5@gmail.com> <87r4we9sfo.fsf@thomas.inf.ethz.ch> <5CE5AEC7-22C8-4911-A79E-11F2F3D902A2@gmail.com> <7vk423qfps.fsf@alter.siamese.dyndns.org> <1604FE70-8B77-4EC1-823A-DC1F0334CD3A@gmail.com> <4F7ABA19.7040408@alum.mit.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Nguyen Thai Ngoc Duy , Junio C Hamano , Thomas Rast , git@vger.kernel.org To: Thomas Gummerer X-From: git-owner@vger.kernel.org Thu Apr 05 12:43:57 2012 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1SFkAW-00005K-V1 for gcvg-git-2@plane.gmane.org; Thu, 05 Apr 2012 12:43:57 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753926Ab2DEKnv (ORCPT ); Thu, 5 Apr 2012 06:43:51 -0400 Received: from einhorn.in-berlin.de ([192.109.42.8]:41012 "EHLO einhorn.in-berlin.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751905Ab2DEKnv (ORCPT ); Thu, 5 Apr 2012 06:43:51 -0400 X-Envelope-From: mhagger@alum.mit.edu Received: from [192.168.100.152] (ssh.berlin.jpk.com [212.222.128.135]) (authenticated bits=0) by einhorn.in-berlin.de (8.13.6/8.13.6/Debian-1) with ESMTP id q35AhbRV028150 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Thu, 5 Apr 2012 12:43:38 +0200 User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.28) Gecko/20120313 Lightning/1.0b2 Thunderbird/3.1.20 In-Reply-To: X-Scanned-By: MIMEDefang_at_IN-Berlin_e.V. on 192.109.42.8 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On 04/03/2012 09:07 PM, Thomas Gummerer wrote: > On Apr 3, 2012, at 10:51 AM, Michael Haggerty wrote: >> On 04/02/2012 11:02 PM, Thomas Gummerer wrote: >>> - Append-only data structure >>> [...] >>> To make sure the index isn't corrupted, without calculating the sha1 hash for >>> the whole index file every time something is changed, the hash is always >>> calculated for the whole index when merging, but when only a single entry is >>> changed the sha-1 hash is only calculated for the last change. This will >>> increase the cost for reading the index to log(n) + k * log(k) where n is the >>> number of entries in the sorted part of the index and k is the number of entries >>> in the unsorted part of the index, which will have to be merged with the rest >>> of the index. >> >> I don't understand this analysis of the reading time. I suppose you are >> assuming that you want to read the status of a single file. But in that >> case, it is enough to find the entry in the old index (O(log(n)) >> assuming some sort of tree structure) plus do a linear scan through the >> unsorted entries (i.e., O(k), not O(k log(k))). > > The current way git operates it always reads the whole index, making it necessary > to merge the unsorted entries with the sorted part. Thinking about it it would even > be O(k log(n)), because the appended part is unsorted. > O(log(n)) + O(k) would be the complexity for loading only a single entry from the > index. I was confused because in your original mail you seemed to claim that reading the whole sorted part of the index scales like O(log(n)), where it certainly scales at least like O(n). To read the whole index if using an append-only data structure, I would do the following: 1. Read the file header to find where the addenda begin: one seek plus O(1). 2. Read the addenda in order (I assume each addendum to be sorted on disk), and merge-sort the addenda together, discarding the earlier of any duplicates: one seek plus O(k) I/O plus O(k log k) computation (this is the worst case, if each addendum contains a single file). 3. Read the sorted part of the file in order, while merging it together with the combined addenda: one seek plus O(n) I/O plus O(n + k) computation. Total: 3 seeks plus O(n+k) I/O plus O(n + k log(k)) computation. Whereas for a B-tree, it is hard to estimate the complexity because you have provided very little detail about how you want to lay the data structure out on disk. But presumably the number of seeks will be significantly larger. And if you are not careful, the number of seeks will approach the number of nodes in the index O(n) or perhaps the number of added nodes (not files!) which could go something like O(k log(k)). Michael -- Michael Haggerty mhagger@alum.mit.edu http://softwareswirl.blogspot.com/