From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-14.4 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 44312C388F9 for ; Mon, 26 Oct 2020 19:46:42 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E910621655 for ; Mon, 26 Oct 2020 19:46:41 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="uCahIb+0" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728511AbgJZTql (ORCPT ); Mon, 26 Oct 2020 15:46:41 -0400 Received: from mail-qk1-f202.google.com ([209.85.222.202]:55389 "EHLO mail-qk1-f202.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728499AbgJZTqk (ORCPT ); Mon, 26 Oct 2020 15:46:40 -0400 Received: by mail-qk1-f202.google.com with SMTP id u16so6982149qkm.22 for ; Mon, 26 Oct 2020 12:46:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=sender:date:in-reply-to:message-id:mime-version:references:subject :from:to:cc; bh=zTLuEp7BUq9sAVIOf0tz1A9XLjm1BBXSkdzAsylBOzk=; b=uCahIb+0cyfFLhWUdqJO+lPhhr9p+YjX+EddGTRBf5pQOjC1D5dbMlgEo7B+FEIgCH YKCv69h65eVCNT0fHu5L3/hgskOHPxgx13Zl70w7ukwTouXb9AJRCzff+SwELZFaACcl 9/Y+FN2dAyrOyMOaoz33yH8onMqUYkd18zZZ4rQoYbn4zceVCfKieTYIYJ0Kj98Slgf5 /uzqaBEB2VeFZaQTEhoTpzN+62FjX0bgaMb4aH6yfCvTY0gGfo5JXTyBMPNqNEedQ4ug JkhLoDOpb9F82DBG/sXrz8U6Z7xT/u0iMREQo4RAfkK3c1o1nh2dL5osWRFmRfsLNYBL w1Mw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=zTLuEp7BUq9sAVIOf0tz1A9XLjm1BBXSkdzAsylBOzk=; b=jtrGtZGcD5ENOb0mwrnu6wwCwrdC9S0UIxsRLAgz9DaRt+Kyx5pxr2zWc5rxqTy3r+ 4EA3RXpQhXnZdbRW8d6MeeTrrKKcyqztqI2RgUVFY3RpLCmQC6RZilCdcf6a1ltM2kLB 1dgwJNIz7NtEs++b0/tvcIj8EXdn8pQCkXG2ZAU/bt2YKSUTcsONWcdg2bI5DMC8OQZz LcGhpgEYAGA2cQJ8Qaj/i14GvSwFvt4PpLKcRxDngPQF/MPbuOMr/vJE8IsAbimucsPX k3dCP7v/UbYrS6VV0E6Dug5X690doT2bgFe6q0WM8kmVZLDvHS2GuliDib9J8vgPnmDC CrKw== X-Gm-Message-State: AOAM532/8NtjwkI4WGNC4X/8NVejPuz6EzBFON5tOnAhnRXsaU981Ntr H2i/CfdbsYUKC3Ya+PQLhDby82G9XtnN2o5imNcb X-Google-Smtp-Source: ABdhPJz74Dmk9TmxMGJpHCla+KolBi0wEZB/vuajG71U2mzvvjHNTL8ZNKQ8650YK0+qD9ZOZLd6CGJf46ISe94b/4OF Sender: "jonathantanmy via sendgmr" X-Received: from twelve4.c.googlers.com ([fda3:e722:ac3:10:24:72f4:c0a8:437a]) (user=jonathantanmy job=sendgmr) by 2002:a0c:85e3:: with SMTP id o90mr13257349qva.46.1603741597526; Mon, 26 Oct 2020 12:46:37 -0700 (PDT) Date: Mon, 26 Oct 2020 12:46:35 -0700 In-Reply-To: <2f04c074-3eee-766c-bedb-2e3cc0a91528@syntevo.com> Message-Id: <20201026194635.2119420-1-jonathantanmy@google.com> Mime-Version: 1.0 References: <2f04c074-3eee-766c-bedb-2e3cc0a91528@syntevo.com> X-Mailer: git-send-email 2.29.0.rc1.297.gfa9743e501-goog Subject: Re: Questions about partial clone with '--filter=tree:0' From: Jonathan Tan To: alexandr.miloslavskiy@syntevo.com Cc: jonathantanmy@google.com, git@vger.kernel.org, christian.couder@gmail.com, marc.strapetz@syntevo.com, me@ttaylorr.com Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org > > Having such an option (and teaching "blame" to use it to prefetch) would > > indeed speed up "blame". But if we implement this, what would happen if > > the user ran "blame" on the same file twice? I can't think of a way of > > preventing the same fetch from happening twice except by checking the > > existence of, say, the last 10 OIDs corresponding to that path. But if > > we have the list of those 10 OIDs, we could just prefetch those 10 OIDs > > without needing a new filter. > > I must admit that I didn't notice this problem. Still, it seems easy > enough to solve with this approach: > > 1) Estimate number of missing things > 2) If "many", just download everything for as described before > and consider it done. > 3) If "not so many", assemble a list of OIDs on the boundary of unknown > (for example, all root tree OIDs for commits that are missing any > trees) and use the usual fetch to download all OIDs in one go. > 4) Repeat step 3 multiple times. Only N= requests > are needed, regardless of the number of commits. My point was that if you can estimate it ("have the list of those 10 OIDs"), then you can just fetch it. This does send "quite a bit of OIDs", as you said below - I'll address it below. > > Another possible solution that has been discussed before (but a much > > more involved one) is to teach Git to be able to serve results of > > computations, and then have "blame" be able to stitch that with local > > data. (For example, "blame" could check the history of a certain path to > > find the commit(s) that the remote has information of, query the remote > > for those commits, and then stitch the results together with local > > history.) This scheme would work not only for "blame" but for things > > like "grep" (with history) and "log -S", whereas > > "--filter=sparse:parthlist" would only work with "blame". But > > admittedly, this solution is more involved. > > I understand that you're basically talking about implementing > prefetching in git itself? No - I did talk about prefetching earlier, but here I mean having Git on the server perform the "blame" computation itself. For example, let's say I want to run "blame" on foo.txt at HEAD. HEAD and HEAD^ are commits that only the local client has, whereas HEAD^^ was fetched from the remote. By comparing HEAD, HEAD^, and HEAD^^, Git knows which lines come from HEAD and HEAD^. For the rest, Git would make a request to the server, passing the commit ID and the path, and would get back a list of line numbers and commits. > To my understanding, this will still need > either the command I suggested, or implement graph walking with massive > OID requests as described above in 1)2)3)4). The latter will not require > protocol changes, but will involve sending quite a bit of OIDs around. Yes, prefetching will require graph walking with large OID requests but will not require protocol changes, as you say. I'm not too worried about the large numbers of OIDs - Git servers already have to support relatively large numbers of OIDs to support the bulk prefetch we do during things like checkout and diff.