From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 301CDC4708F for ; Tue, 1 Jun 2021 13:17:06 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 02971613B1 for ; Tue, 1 Jun 2021 13:17:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233758AbhFANSq (ORCPT ); Tue, 1 Jun 2021 09:18:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37616 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233064AbhFANSq (ORCPT ); Tue, 1 Jun 2021 09:18:46 -0400 Received: from mail-ej1-x631.google.com (mail-ej1-x631.google.com [IPv6:2a00:1450:4864:20::631]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 90BC8C061574 for ; Tue, 1 Jun 2021 06:17:04 -0700 (PDT) Received: by mail-ej1-x631.google.com with SMTP id g8so13212592ejx.1 for ; Tue, 01 Jun 2021 06:17:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=klerks-biz.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=NdW/5vv9qCbiNm5DSe0mX5jJJJHoet2BAvk1IJEsSLM=; b=GB1kb9qPgQnbqdZk52aW+uHawzVDLpBXR5HydwjfpLar8CEC5vgnlSXXHmhUZTRz6T IJAuZMTIjixhxC8wKMYt+P4ug2kIh8L1ZzWdmoFglTAcGltFE9LeE9GVBoeVDarFgGqR NKQpxUAfJnGO/WW34VZ7PEU8HWugwndNWgOfbeLlvmHbtvFqdMpS+vyTrD/g3fK10WI4 28tA2e4zE+bXFdHoH6tdGFBqmDGTAYt50yv5D13VKw2leGyDNRy19DWWq9ro/dpto7/f UhyoJxOrN4kqPZkZKKSXM0MgdcVRvfxUTJahLO5czOSxKBHFTb83bsYvgkg6S4k4A4m1 SCYg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=NdW/5vv9qCbiNm5DSe0mX5jJJJHoet2BAvk1IJEsSLM=; b=DCTZgo2veF4fqgptBt6j5S5RvwS/oosvg/9FBidmvDeOyfI/EONrPhgxdPEqU+q1FY Eq5hldHL8whtYIZAJD8CnWEOO0WIWHVlaHwU2RJ94RTpxG0pNaRiwlkBRO8wV3ZsBuFt NSz/1ngHfR8Rv1R51kjzHWo3ta65/2CwUVwJxenS2qec0a3EbaQYLf7sw4oqhh4Xh+n6 TseTR/xK+3S5j48G7irNVUrg36OdQaKTSpWU5Kre2JG6Saq06wq/Ihy7nLkkpPLBko0d mUvHxPOsexkYm+9xcEBqICjF4R/WOmIbzq0OUKohRmRyMgRfTkeEjPmJJF3vZwLQNZEH S6Ig== X-Gm-Message-State: AOAM531slUZnKH9IUIm7L3hlxM0Uy0k+wyjeWAXh7AP4Eh7Rod+eU0rl IgSOHL5AUNZFtHJPt3lT2BPIt0JrVio3Ej4wZ0XG3g== X-Google-Smtp-Source: ABdhPJzzIjhOkDg55TpDOXfzP+qOfpicHma7/s0xCVBSpHFCn0hqh4tnkBhbRRrpjOPIP5sJ+xT+KweJdCx1m5jnIp8= X-Received: by 2002:a17:906:c1d3:: with SMTP id bw19mr29498576ejb.145.1622553423009; Tue, 01 Jun 2021 06:17:03 -0700 (PDT) MIME-Version: 1.0 References: <032cabb2-652a-1d88-2e12-601b40a4020c@gmail.com> In-Reply-To: <032cabb2-652a-1d88-2e12-601b40a4020c@gmail.com> From: Tao Klerks Date: Tue, 1 Jun 2021 15:16:51 +0200 Message-ID: Subject: Re: Removing Partial Clone / Filtered Clone on a repo To: Derrick Stolee Cc: git@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Tue, Jun 1, 2021 at 12:39 PM Derrick Stolee wrote: > Could you describe more about your scenario and why you want to > get all objects? A 13GB (with 1.2GB shallow head) repo is in that in-between spot where you want to be able to get something useful to the user as fast as possible (read: in less than the 4 hours it would take to download the whole thing over a mediocre VPN, with corresponding risk of errors partway), but where a user might later (eg overnight) want to get the rest of the repo, to avoid history inconsistency issues. In our current mode of operation (Shallow clones to 15 months' depth by default), the initial clone can complete in well under an hour, but the problem with the resulting clone is that normal git tooling will see the shallow grafted commit as the "initial commit" of all older files, and that causes no end of confusion on the part of users, eg on "git blame". This is the main reason why we would like to consider moving to full-history but filtered-blob clones. (there are other reasons around manageability, eg the git server's behavior around --shallow-since when some branches in refspec scope are older than that date; it sends them with all their history, effectively downloading the whole repo; similarly if a refspec is expanded and the next fetch is run without explicit --shallow-since, and finds new branches not already shallow-grafted, it will download those in their entirely because the shallow-since date is not persisted beyond the shallow grafts themselves). With a (full-history all-trees no-blobs-except-HEAD) filtered clone, the initial download can be quite a bit smaller than the shallow clone scenario above (eg 1.5GB vs 2.2GB), and most of the disadvantages of shallow clones are addressed: the just-in-time fetching can typically work quite naturally, there are no "lies" in the history, nor are there scenarios where you suddenly fetch an extra 10GB of history without wanting/expecting to. With the filtered clone there are still little edge-cases that might motivate a user to "bite the bullet" and unfilter their clone, however: The most obvious one I've found so far is "git blame" - it loops fetch requests serially until it bottoms out, which on an older poorly-factored file (hundreds or thousands of commits, each touching different bits of a file) will effectively never complete, at 10s/fetch. And depending on the UI tooling the user is using, they may have almost no visibility into why this "git blame" (or "annotate", or whatever the given UI calls it) seems to hang forever. You can work around this "git blame" issue for *most* situations, in the case of our repo, by using a different initial filter spec, eg "--filter=blob:limit=200k", which only costs you an extra 1GB or so... But then you still have outliers - and in fact, the most "blameable" files will tend to be the larger ones... :) My working theory is that we should explain all the following to users: * Your initial download is a nice compromise between functionality and download delay * You have almost all the useful history, and you have it within less than an hour * If you try to use "git blame" (or some other as-yet-undiscovered scenarios) on a larger file, it may hang. In that case cancel, run a magic command we provide which fetches all the blobs in that specific file's history, and try again. (the magic command is a path-filtered rev-list looking for missing objects, passed into fetch) * If you ever get tired of the rare weird hangs, you have the option of running *some process* that "unfilters" the repo, paying down that initial compromise (and taking up a bit more HD space), eg overnight This explanation is a little awkward, but less awkward than the previous "'git blame' lies to you - it blames completely the wrong person for the bulk of the history for the bulk of the files; unshallow overnight if this bothers you", which is the current story with shallow clone. Of course, if unfiltering a large repo is impractical (and if it will remain so), then we will probably need to err on the side of generosity in the original clone - eg 1M instead of 200k as the blob filter, 3GB vs 2.5GB as the initial download - and remove the last line of the explanation! If unfiltering, or refiltering, were practical, then we would probably err on the size of less-blobs-by-default to optimize first download. Over time, as we refactor the project itself to reduce the incidence of megafiles, I would expect to be able to drop the standard/recommended blob-size-limit too. Sorry about the wall-of-text, hopefully I've answered the question! Thanks, Tao