From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-9.3 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,RP_MATCHES_RCVD shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 8CD171F744 for ; Thu, 30 Jun 2016 10:31:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932086AbcF3Kbf (ORCPT ); Thu, 30 Jun 2016 06:31:35 -0400 Received: from mail-wm0-f53.google.com ([74.125.82.53]:37066 "EHLO mail-wm0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751721AbcF3Kbc (ORCPT ); Thu, 30 Jun 2016 06:31:32 -0400 Received: by mail-wm0-f53.google.com with SMTP id a66so111816237wme.0 for ; Thu, 30 Jun 2016 03:30:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:to:references:cc:newsgroups:from:message-id:date:user-agent :mime-version:in-reply-to:content-transfer-encoding; bh=TUgIz2zRaV4ETNNU7fo2I0JiWQQgloycMyxd2kk6svU=; b=VKxVSF5suLoMGJOJATd7YAVJFuatV1/WxlWQGOKa9Jl29JOG/Aa5OOODwl6riendxE dA6YB97aTGTtlwAMXuASk2Zlz5Fo9759qZ5i+0YWGmUsb0Y2XS0d+Y1VzE/85sf1w4mF aTjXfR4e3OSYxCu3lloY9jbFW3qIPkBAKFf6Z6ehbWTr6KvjEfbAAE3u5/ruKQ/TB3UB MoclH/Fj7C8HkEl5C61pKX/mPN39A4QlizsWjqGMeqPTBLMebXEGbUEo+Cb6IIcImK5V jn4bz7+Cg+dhCQPLSyu6l6lJzZ6zs4JhthSAg1xTr1s6mbiPBJc/KK1jJuOK1XO/9/AN PfrA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:newsgroups:from :message-id:date:user-agent:mime-version:in-reply-to :content-transfer-encoding; bh=TUgIz2zRaV4ETNNU7fo2I0JiWQQgloycMyxd2kk6svU=; b=SfFgKNOMQeOagLG/9JImcKDFbjNoLPzXaM534git3bokCx7L8FhApmnqAliXalvY3Y 8qWarKBRBY2QXaDShgndDWczEonlb3QsnhlBbhpR8m+Cn4zPqghEaxRknEKVYN2+BvvO sLT4PB336L62RgBnsinMfICsjsUMaZfHjOGj6OEoethWJqk3mCzBowXi+HH3n4Y6ApcO Y7rFFSU0pYJOkGP2fj7S2eOSkVSE/VCy7m+PgWAIZYeBjGNlzdE2mK8KEIYcpHfm7Jc8 FoDcUPCrztXTBwzakMhBqseBZvg0jc+7z4Lv2Kw3U647q10YnzFidc0PRFfhkoin/qjn WDIg== X-Gm-Message-State: ALyK8tIK52aLYxsyr/WxsWHZGnnFsYa/kZ/1GoU6aW+zDX8jg25nxPlEX2Fat5zNv2Ks9Q== X-Received: by 10.194.158.225 with SMTP id wx1mr12884547wjb.177.1467282649381; Thu, 30 Jun 2016 03:30:49 -0700 (PDT) Received: from [192.168.1.34] (daw105.neoplus.adsl.tpnet.pl. [83.23.22.105]) by smtp.googlemail.com with ESMTPSA id bb4sm2284305wjb.32.2016.06.30.03.30.47 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 30 Jun 2016 03:30:48 -0700 (PDT) Subject: Re: topological index field for commit objects To: Jeff King References: <20160629205647.GA25987@sigill.intra.peff.net> <5774426F.3090000@gmail.com> <20160629220049.GA4416@sigill.intra.peff.net> Cc: Junio C Hamano , Stefan Beller , Linus Torvalds , Marc Strapetz , Git Mailing List Newsgroups: gmane.comp.version-control.git From: =?UTF-8?Q?Jakub_Nar=c4=99bski?= Message-ID: <5774F4C7.805@gmail.com> Date: Thu, 30 Jun 2016 12:30:31 +0200 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0 MIME-Version: 1.0 In-Reply-To: <20160629220049.GA4416@sigill.intra.peff.net> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org W dniu 2016-06-30 o 00:00, Jeff King pisze: > On Wed, Jun 29, 2016 at 11:49:35PM +0200, Jakub Narębski wrote: > >>> So this is the ideal case for generation numbers (the worst cases are >>> when the things you are looking for are in branchy, close history where >>> the generation numbers don't tell you much; but in such cases the >>> walking is usually not too bad). >> >> There are other approaches (special indices) that help reachability >> queries beside "generation number". > > Yes, though generation numbers can help with more questions (e.g., "what > is the merge base"). Well, those special indices usually need generation number anyway. For example FELINE indices^1 (from "Reachability Queries in Very Large Graphs", http://openproceedings.org/EDBT/2014/paper_166.pdf, that I found when trying to find more about compressed bitmap indices used by Git) also utilize generation numbers to speed up reachability analysis. The idea behind FELINE is to associate every vertex (commit) with a unique ordered pair of natural integers (i.e. two more numbers, in addition to the level aka generation number), so that partial order over N^2 is superset of partial order of graph (DAG of revisions). It can answer in O(1) that nodes are not connected, and cut search space if they are. BTW. some references in this paper ("Related work" section) use PWAH compression scheme - perhaps it would work with EWAH that Git uses? 1. FELINE = Fast rEfined onLINE search ... or fun with acronyms >> By the way, what should happen if you add a replacement (in the git-replace >> meaning) that creates a shortcut, therefore invalidating generation numbers, >> at least in strict sense - committerdate as generation number would be still >> good, I think? > > This is one of the open questions. My older patches turned them off when > replacements and grafts are in effect. Well, if you store the cache of generation numbers in the packfile, or in the index of the packfile, or in the bitmap file, or in separate bitmap-like file, generating them on repack, then of course any grafts or replacements invalidate them... though for low level commands (like object counting) replacements are transparent -- or rather they are (and can be) treated as any other ref for reachability analysis. Well, if there are no grafts, you could still use them for doing "git --no-replace-objects log ...", isn't it? SIDENOTE: should grafts be deprecated and later removed, now that Git has superior alternative in the form of git-replace, and a simple way to convert grafts to replacements? Anyway, if file with mapping from revisions to its generation numbers were stored outside of packfiles, it could simply be updated / rewritten when adding new replacement. You could use one cache for no-replace, and one for replace. Though I do wonder if it would be possible to do such rewrite fast... well, fast enough assuming that adding replacements is rare. Answering my own question: >> committerdate as generation number would be still good, I think? No, it wouldn't: pre replace: 1 <-- 2 <-- 3 <-- 5 <-- \ \----- 4 <-- 6 <-- 7 <-- 8 post replace 1 <-- 2 \-- 3' <-- 5 \ \ \------ 4 <-- 6 <-- 7 <-- 8 <--\ Either the replacement commit would have generation number correct, but its child wouldn't, or its child would have correct generation number but the replacement wouldn't. >>> I have patches that generate and store the numbers at pack time, similar >>> to the way we do the reachability bitmaps. Ah, so those cached generation numbers are generated and stored at pack time. Where you store them: is it a separate file? Bitmap file? Packfile? >>> They're not production ready, >>> but they could probably be made so without too much effort. You wouldn't >>> have ready-made generation numbers for commits since the last full >>> repack, but you can compute them incrementally based on what you do have >>> at a cost linear to the unpacked commits (this is the same for bitmaps). >> >> Do Git use EWAH / EWOK bitmaps for reachability analysis, or is it still >> limited to object counting? > > At GitHub we are using them for --contains analysis, along with mass > ahead/behind (e.g., as in https://github.com/gitster/git/branches). My > plan is to send patches upstream, but they need some cleanup first. That would be nice to have, please. Er, is mass ahead/behind something that can be plugged into Git (e.g. for "git branch -v -v"), or is it something GitHub-specific? P.S. Having Git ensure that committerdate (as an epoch) is greater than committerdates of its parents at the commit creation time (with providing warning about time skew, perhaps not doing it if skew is too large) would be not costly. No need to add anything, and it would help future queries to be faster, I think. -- Jakub Narębski