* Removing Partial Clone / Filtered Clone on a repo @ 2021-06-01 10:24 Tao Klerks 2021-06-01 10:39 ` Derrick Stolee 0 siblings, 1 reply; 6+ messages in thread From: Tao Klerks @ 2021-06-01 10:24 UTC (permalink / raw) To: git Hi folks, I'm trying to deepen my understanding of the Partial Clone functionality for a possible deployment at scale (with a large-ish 13GB project where we are using date-based shallow clones for the time being), and one thing that I can't get my head around yet is how you "unfilter" an existing filtered clone. The gitlab intro document (https://docs.gitlab.com/ee/topics/git/partial_clone.html#remove-partial-clone-filtering) suggests that you need to get the full list of missing blobs, and pass that into a fetch...: git fetch origin $(git rev-list --objects --all --missing=print | grep -oP '^\?\K\w+') In my project's case, that would be millions of blob IDs! I tested this with a path-based filter to rev-list, to see what getting 30,000 blobs might look like, and it took a looong while... I don't understand much about the negotiation process, but I have to assume there is a fixed per-blob cost in this scenario which is *much* higher than in a "regular" fetch or clone. Obviously one answer is to throw away the repo and start again with a clean unfiltered clone... But between repo-local config, project settings in IDEs / external tools, and unpushed local branches, this is an awkward thing to ask people to do. I initially thought it might be possible to add an extra remote (without filter / promisor settings), mess with the negotiation settings to make the new remote not know anything about what's local, and then get a full set of refs and their blobs from that remote... but I must have misunderstood how the negotation-tip stuff works because I can't get that to do anything (it always "sees" my existing refs and I just get the new remote's refs "for free" without object transfer). The official doc at https://git-scm.com/docs/partial-clone makes no mention of plans or goals (or non-goals) related to this "unfiltering" - is it something that we should expect a story to emerge around? Thanks, Tao Klerks ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Removing Partial Clone / Filtered Clone on a repo 2021-06-01 10:24 Removing Partial Clone / Filtered Clone on a repo Tao Klerks @ 2021-06-01 10:39 ` Derrick Stolee 2021-06-01 13:16 ` Tao Klerks 0 siblings, 1 reply; 6+ messages in thread From: Derrick Stolee @ 2021-06-01 10:39 UTC (permalink / raw) To: Tao Klerks, git On 6/1/21 6:24 AM, Tao Klerks wrote: > Hi folks, > > I'm trying to deepen my understanding of the Partial Clone > functionality for a possible deployment at scale (with a large-ish > 13GB project where we are using date-based shallow clones for the time > being), and one thing that I can't get my head around yet is how you > "unfilter" an existing filtered clone. > > The gitlab intro document > (https://docs.gitlab.com/ee/topics/git/partial_clone.html#remove-partial-clone-filtering) > suggests that you need to get the full list of missing blobs, and pass > that into a fetch...: > > git fetch origin $(git rev-list --objects --all --missing=print | grep > -oP '^\?\K\w+') I think the short answer is to split your "git rev-list" call into batches by limiting the count. Perhaps pipe that command to a file and then split it into batches of "reasonable" size. Your definition of "reasonable" may vary, so try a few numbers. > The official doc at https://git-scm.com/docs/partial-clone makes no > mention of plans or goals (or non-goals) related to this "unfiltering" > - is it something that we should expect a story to emerge around? The design is not intended for this kind of "unfiltering". The feature is built for repositories where doing so would be too expensive (both network time and disk space) to be valuable. Also, asking for the objects one-by-one like this is very inefficient on the server side. A fresh clone can make use of existing delta compression in a way that this type of request cannot (at least, not easily). You _would_ be better off making a fresh clone and then adding that pack-file to your .git/objects/pack directory of the repository you want. Could you describe more about your scenario and why you want to get all objects? Thanks, -Stolee ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Removing Partial Clone / Filtered Clone on a repo 2021-06-01 10:39 ` Derrick Stolee @ 2021-06-01 13:16 ` Tao Klerks 2021-06-01 13:40 ` Derrick Stolee 0 siblings, 1 reply; 6+ messages in thread From: Tao Klerks @ 2021-06-01 13:16 UTC (permalink / raw) To: Derrick Stolee; +Cc: git On Tue, Jun 1, 2021 at 12:39 PM Derrick Stolee <stolee@gmail.com> wrote: > Could you describe more about your scenario and why you want to > get all objects? A 13GB (with 1.2GB shallow head) repo is in that in-between spot where you want to be able to get something useful to the user as fast as possible (read: in less than the 4 hours it would take to download the whole thing over a mediocre VPN, with corresponding risk of errors partway), but where a user might later (eg overnight) want to get the rest of the repo, to avoid history inconsistency issues. In our current mode of operation (Shallow clones to 15 months' depth by default), the initial clone can complete in well under an hour, but the problem with the resulting clone is that normal git tooling will see the shallow grafted commit as the "initial commit" of all older files, and that causes no end of confusion on the part of users, eg on "git blame". This is the main reason why we would like to consider moving to full-history but filtered-blob clones. (there are other reasons around manageability, eg the git server's behavior around --shallow-since when some branches in refspec scope are older than that date; it sends them with all their history, effectively downloading the whole repo; similarly if a refspec is expanded and the next fetch is run without explicit --shallow-since, and finds new branches not already shallow-grafted, it will download those in their entirely because the shallow-since date is not persisted beyond the shallow grafts themselves). With a (full-history all-trees no-blobs-except-HEAD) filtered clone, the initial download can be quite a bit smaller than the shallow clone scenario above (eg 1.5GB vs 2.2GB), and most of the disadvantages of shallow clones are addressed: the just-in-time fetching can typically work quite naturally, there are no "lies" in the history, nor are there scenarios where you suddenly fetch an extra 10GB of history without wanting/expecting to. With the filtered clone there are still little edge-cases that might motivate a user to "bite the bullet" and unfilter their clone, however: The most obvious one I've found so far is "git blame" - it loops fetch requests serially until it bottoms out, which on an older poorly-factored file (hundreds or thousands of commits, each touching different bits of a file) will effectively never complete, at 10s/fetch. And depending on the UI tooling the user is using, they may have almost no visibility into why this "git blame" (or "annotate", or whatever the given UI calls it) seems to hang forever. You can work around this "git blame" issue for *most* situations, in the case of our repo, by using a different initial filter spec, eg "--filter=blob:limit=200k", which only costs you an extra 1GB or so... But then you still have outliers - and in fact, the most "blameable" files will tend to be the larger ones... :) My working theory is that we should explain all the following to users: * Your initial download is a nice compromise between functionality and download delay * You have almost all the useful history, and you have it within less than an hour * If you try to use "git blame" (or some other as-yet-undiscovered scenarios) on a larger file, it may hang. In that case cancel, run a magic command we provide which fetches all the blobs in that specific file's history, and try again. (the magic command is a path-filtered rev-list looking for missing objects, passed into fetch) * If you ever get tired of the rare weird hangs, you have the option of running *some process* that "unfilters" the repo, paying down that initial compromise (and taking up a bit more HD space), eg overnight This explanation is a little awkward, but less awkward than the previous "'git blame' lies to you - it blames completely the wrong person for the bulk of the history for the bulk of the files; unshallow overnight if this bothers you", which is the current story with shallow clone. Of course, if unfiltering a large repo is impractical (and if it will remain so), then we will probably need to err on the side of generosity in the original clone - eg 1M instead of 200k as the blob filter, 3GB vs 2.5GB as the initial download - and remove the last line of the explanation! If unfiltering, or refiltering, were practical, then we would probably err on the size of less-blobs-by-default to optimize first download. Over time, as we refactor the project itself to reduce the incidence of megafiles, I would expect to be able to drop the standard/recommended blob-size-limit too. Sorry about the wall-of-text, hopefully I've answered the question! Thanks, Tao ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Removing Partial Clone / Filtered Clone on a repo 2021-06-01 13:16 ` Tao Klerks @ 2021-06-01 13:40 ` Derrick Stolee 2021-06-01 16:54 ` Tao Klerks 0 siblings, 1 reply; 6+ messages in thread From: Derrick Stolee @ 2021-06-01 13:40 UTC (permalink / raw) To: Tao Klerks; +Cc: git On 6/1/2021 9:16 AM, Tao Klerks wrote: > On Tue, Jun 1, 2021 at 12:39 PM Derrick Stolee <stolee@gmail.com> wrote: > >> Could you describe more about your scenario and why you want to >> get all objects? > > A 13GB (with 1.2GB shallow head) repo is in that in-between spot where > you want to be able to get something useful to the user as fast as > possible (read: in less than the 4 hours it would take to download the > whole thing over a mediocre VPN, with corresponding risk of errors > partway), but where a user might later (eg overnight) want to get the > rest of the repo, to avoid history inconsistency issues. As you describe below, the inconsistency is in terms of performance, not correctness. I thought it was worth a clarification. ... > With the filtered clone there are still little edge-cases that might > motivate a user to "bite the bullet" and unfilter their clone, > however: The most obvious one I've found so far is "git blame" - it > loops fetch requests serially until it bottoms out, which on an older > poorly-factored file (hundreds or thousands of commits, each touching > different bits of a file) will effectively never complete, at > 10s/fetch. And depending on the UI tooling the user is using, they may > have almost no visibility into why this "git blame" (or "annotate", or > whatever the given UI calls it) seems to hang forever. I'm aware that the first 'git blame' on a file is a bit slow in the partial clone case. It's been on my list for improvement whenever I get the "spare" time to do it. However, if someone else wants to work on it I will briefly outline the approach I was going to investigate: During the history walk for 'git blame', it might be helpful to collect a batch of blobs to download in a single round trip. This requires refactoring the search to walk the commit history and collect a list of (commit id, blob id) pairs as if we were doing a simplified history walk. We can then ask for the list of blob id's in a single request and then perform the line-by-line blaming logic on that list. [If we ever hit a point where we would do a rename check, pause the walk and request all blobs so far and flush the line-by-line diff before continuing.] This basic idea is likely difficult to implement, but would likely dramatically improve the first 'git blame' in a blobless clone. A similar approach could maybe be used by the line-log logic (git log -L). > You can work around this "git blame" issue for *most* situations, in > the case of our repo, by using a different initial filter spec, eg > "--filter=blob:limit=200k", which only costs you an extra 1GB or so... > But then you still have outliers - and in fact, the most "blameable" > files will tend to be the larger ones... :) I'm interested in this claim that 'the most "blameable" files will tend to be the larger ones.' I typically expect blame to be used on human-readable text files, and my initial reaction is that larger files are harder to use with 'git blame'. However, your 200k limit isn't so large that we can't expect _some_ files to reach that size. Looking at the root of git.git I see a few files above 100k and files like diff.c reaching very close to 200k (uncompressed). I tend to find that the files in git.git are smaller than the typical large project. > My working theory is that we should explain all the following to users: > * Your initial download is a nice compromise between functionality and > download delay > * You have almost all the useful history, and you have it within less > than an hour > * If you try to use "git blame" (or some other as-yet-undiscovered > scenarios) on a larger file, it may hang. In that case cancel, run a > magic command we provide which fetches all the blobs in that specific > file's history, and try again. (the magic command is a path-filtered > rev-list looking for missing objects, passed into fetch) > * If you ever get tired of the rare weird hangs, you have the option > of running *some process* that "unfilters" the repo, paying down that > initial compromise (and taking up a bit more HD space), eg overnight Partial clone is all about tradeoffs: you get faster clones that download missing objects as needed. The user behavior dictates how many objects are needed, so the user has the capability to adjust that need. The fewer objects needed locally, the faster the repo will be. Your concern about slow commands is noted, but also blindly downloading every file in history will slow the repo due to the full size of the objects on disk. I think there is merit to your back-filling history idea. There are likely benefits to the "download everything missing" concept, but also it would be good to design such a feature to have other custom knobs, such as: * Get only "recent" history, perhaps with a "--since=<date>" kind of flag. This would walk commits only to a certain date, then find all missing blobs reachable from their root trees. * Get only a "cone" of history. This could work especially well with sparse-checkout, but other pathspecs could be used to limit the walk. ... > Of course, if unfiltering a large repo is impractical (and if it will > remain so), then we will probably need to err on the side of > generosity in the original clone - eg 1M instead of 200k as the blob > filter, 3GB vs 2.5GB as the initial download - and remove the last > line of the explanation! If unfiltering, or refiltering, were > practical, then we would probably err on the size of > less-blobs-by-default to optimize first download. I'm glad that you have self-discovered a workaround to handle these cases. If we had a refiltering feature, then you could even start with a blobless clone to have an extremely fast initial clone, followed by a background job that downloads the remaining objects. > Over time, as we refactor the project itself to reduce the incidence > of megafiles, I would expect to be able to drop the > standard/recommended blob-size-limit too. My experience working with large repos and partial clone is similar: the new pain points introduced by these features make users aware of "repository smells" in their organization and they tend to self-correct by refactoring the repository. This is a never-ending process as repos grow, especially with many contributors. Thank you for sharing your experience! -Stolee ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Removing Partial Clone / Filtered Clone on a repo 2021-06-01 13:40 ` Derrick Stolee @ 2021-06-01 16:54 ` Tao Klerks 2021-06-02 5:04 ` Tao Klerks 0 siblings, 1 reply; 6+ messages in thread From: Tao Klerks @ 2021-06-01 16:54 UTC (permalink / raw) To: Derrick Stolee; +Cc: git On Tue, Jun 1, 2021 at 3:40 PM Derrick Stolee <stolee@gmail.com> wrote: > > you want to be able to get something useful to the user as fast as > > possible [...] but where a user might later (eg overnight) want to get the > > rest of the repo, to avoid history inconsistency issues. > > As you describe below, the inconsistency is in terms of performance, > not correctness. I thought it was worth a clarification. Sorry I was not clear here - I did not mean formal correctness nor performance, when referring to the incentive to get the rest of the repo - I was referring to the fact that a medium-shallow clone (eg 15 months of a 20-year project) provides an inconsistent perspective on the code history: * On the one hand, most of the time you have everything you need, and when you bump up against *available* history limits from a file or branch history view, it's reasonably clear that's what's happening (in some UI tools this is more explicit than in others). * On the other hand, when you happen to look at something older, it is easy for the history to seem to "lie", showing changes made in a file by a person that really *didn't* make those changes. Their commit just happened to be selected as the shallow graft, and so seems to have "added" all the files in the project. This reasonably intelligible when looking at file history, but extremely non-obvious when looking at git blame (in a medium-shallow clone). > I'm aware that the first 'git blame' on a file is a bit slow in the > partial clone case. Without wanting to harp on about it, it can easily be pathologically slow, eg in my case a random well-trafficked file has 300 in-scope commits, at 10 seconds per independent blob fetch - and so ends up taking an hour to git blame (the first time for such a file, as you noted). > It's been on my list for improvement whenever I > get the "spare" time to do it. However, if someone else wants to work > on it I will briefly outline the approach I was going to investigate: One reason I wasn't asking about / angling for this, particularly, is that I expect there will be other tools doing their own versions of this. I haven't tested "tig" on this, for example, but I suspect it doesn't do a plain git blame, given what I've seen of its instantly showing the file contents and "gradually" filling in the authorship data. I for one rarely use plain git blame, I don't know much about the usage patterns of other users. Most of "my" users will be using Intellij IDEA, which seems to have a surprisingly solid/scalable git integration (but I have not yet tested this case there yet) There also other related reasons to go for a "get most of the relevant blobs across history" approach, specifically around tooling: there are lots of tools & integrations that use git libraries (or even homebrew implementations) rather than the git binaries / IPC, and many of those tend to lag *far* behind in support for things like shallow clone, partial clone, mailmap, core.splitindex, replace refs, etc etc. My current beef is with Sublime Merge, which is snappy as one could wish for, really lovely to use within its scope, but doesn't have any idea what a promisor is, and simply says "nah, no content here" when you look at a missing blob. (for the moment) > > the most "blameable" > > files will tend to be the larger ones... :) > > I'm interested in this claim that 'the most "blameable" files will > tend to be the larger ones.' I typically expect blame to be used on > human-readable text files, and my initial reaction is that larger > files are harder to use with 'git blame'. Absolutely, I meant "the larger text/code files", not including other stuff that tends to accumulate in the higher filesize brackets. I meant that I, for one, in this project at least, often find myself using git blame (or equivalent) to "spelunk" into who touched a specific line, in cases where looking at the plain history is useless because there have been many hundreds or thousands of changes - and in my limited experience, files with that many reasons to change tend to be large. > Your concern about slow commands is noted, but also blindly > downloading every file in history will slow the repo due to the > full size of the objects on disk. I have in the past claimed that "larger repo" (specifically, a deeper clone that gets many larger blobs) is slower, but haven't actually found any significant evidence to back my claim. Obviously something like "git gc" will be slower, but is there anything in the practical day-to-day that cares whether the commit depth is 10,000 commits or 200,000 commits for a given branch, or whether you only have the blobs at the "tip" of the branch/project, or all the blobs in history? (besides GC, specifically) > it would be good to design such a feature to have other > custom knobs, such as: > * Get only "recent" history, perhaps with a "--since=<date>" > kind of flag. This would walk commits only to a certain date, > then find all missing blobs reachable from their root trees. As long as you know at initial clone time that this is what you want, combining shallow clone with sparse clone already enables this today (shallow clone, set up filter, unshallow, and potentially remove filter). You can even do more complicated things like unshallowing with different increasingly-aggressive filters in multiple steps/fetches over different time periods. The main challenge that I perceive at the moment is that you're effectively locked into "one shot". As soon as you've retrieved the commits with blobs missing, "filling them in" at scale seems to be orders of magnitude more expensive than an equivalent clone would have been. > If we had a refiltering feature, then you could even > start with a blobless clone to have an extremely fast initial > clone, followed by a background job that downloads the remaining > objects. Yes please! I think one thing that I'm not clearly understanding yet in this conversation, is whether the tax on explicit and specialized blob list fetching could be made much lower. As far as I can tell, in a blobless clone with full trees we have most of the data one could want, to decide what blobs to request - paths, filetypes, and commit dates. This leaves three pain points that I am aware of: * Filesizes are not (afaik) available in a blobless clone. This sounds like a pretty deep limitation, which I'll gloss over. * Blob paths are available in trees, but not trivially exposed by git rev-list - could a new "--missing" option value make sense? Or does it make just as much sense to expect the caller/scripter to iterate ls-tree outputs? (I assume doing so would be much slower, but have not tested) * Something about the "git fetch <remote> blob-hash ..." pattern seems to scale very poorly - is that something that might see change in future, or is it a fundamental issue? Thanks again for the detailed feedback! Tao ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Removing Partial Clone / Filtered Clone on a repo 2021-06-01 16:54 ` Tao Klerks @ 2021-06-02 5:04 ` Tao Klerks 0 siblings, 0 replies; 6+ messages in thread From: Tao Klerks @ 2021-06-02 5:04 UTC (permalink / raw) To: Derrick Stolee; +Cc: git I understand replying to myself is bad form, but I need to add a correction/clarification to a statement I made below: On Tue, Jun 1, 2021 at 6:54 PM Tao Klerks <tao@klerks.biz> wrote:> > it would be good to design such a feature to have other> > custom knobs, such as: > > * Get only "recent" history, perhaps with a "--since=<date>" > > kind of flag. This would walk commits only to a certain date, > > then find all missing blobs reachable from their root trees. > > As long as you know at initial clone time that this is what you want, > combining shallow clone with sparse clone already enables this today > (shallow clone, set up filter, unshallow, and potentially remove > filter). You can even do more complicated things like unshallowing > with different increasingly-aggressive filters in multiple > steps/fetches over different time periods. The main challenge that I > perceive at the moment is that you're effectively locked into "one > shot". As soon as you've retrieved the commits with blobs missing, > "filling them in" at scale seems to be orders of magnitude more > expensive than an equivalent clone would have been. As I just noted in another thread, there seems to be one extra step needed to pull this off: you need to add a *.promisor file for the initial shallow clone's packfile, because otherwise (at least with the 2.31 client that I am using) later "git fetch" calls take forever doing something with rev-list that I don't understand, presumably due to the relationship between promisor packfiles and non-promisor packfiles... ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2021-06-02 5:04 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-06-01 10:24 Removing Partial Clone / Filtered Clone on a repo Tao Klerks 2021-06-01 10:39 ` Derrick Stolee 2021-06-01 13:16 ` Tao Klerks 2021-06-01 13:40 ` Derrick Stolee 2021-06-01 16:54 ` Tao Klerks 2021-06-02 5:04 ` Tao Klerks
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.