Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions

From: ZheNing Hu <adlternative@gmail.com>
To: Elijah Newren <newren@gmail.com>
Cc: Derrick Stolee <derrickstolee@github.com>,
	Elijah Newren via GitGitGadget <gitgitgadget@gmail.com>,
	Git Mailing List <git@vger.kernel.org>,
	Victoria Dye <vdye@github.com>,
	Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>,
	Matheus Tavares <matheus.bernardino@usp.br>
Subject: Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
Date: Sat, 15 Oct 2022 10:17:37 +0800	[thread overview]
Message-ID: <CAOLTT8R0MxEWErrw80-F+b1higbuWuQjvkEGi2c4ARzuRzeNWw@mail.gmail.com> (raw)
In-Reply-To: <CABPp-BFwiMrgm+_sO6TsLUj77r_krgzYEWZanbyx2Fnn4rM8tg@mail.gmail.com>

Elijah Newren <newren@gmail.com> 于2022年10月6日周四 15:53写道：
>
> On Fri, Sep 30, 2022 at 2:54 AM ZheNing Hu <adlternative@gmail.com> wrote:
> >
> > I am not sure if these ideas are feasible.
> >
> > Elijah Newren <newren@gmail.com> 于2022年9月28日周三 13:38写道：
> > >
> [...]
> > > > There's nothing Git can do to help those engineers that do cross-tree
> > > > work.
> > >
> > > I'm going to partially disagree with this, in part because of our
> > > experience with many inter-module dependencies that evolve over time.
> > > Folks can start on a certain module and begin refactoring.  Being
> > > aware that their changes will affect other areas of the code, the can
> > > do a search (e.g. "git grep --cached ..." to find cases outside their
> > > current sparse checkout), and then selectively unsparsify to get the
> > > relevant few dozen (or maybe even few hundred) modules added.  They
> > > aren't switching to a dense checkout, just a less sparse one.  When
> > > they are done, they may narrow their sparse specification again.  We
> > > have a number of users doing cross-tree work who are using
> > > sparse-checkouts, and who find it productive and say it still speeds
> > > up their local build/test cycles.
> > >
> > > So, I'd say that ensuring Git supports behavior B well in
> > > sparse-checkouts, is something Git can do to help out both some of the
> > > engineers doing cross-tree work, and some of the engineers that are
> > > doing cross-tree testing.
> > >
> > > (For full disclosure, we also have users doing cross-tree work using
> > > regular dense checkouts and I agree there's not a lot we can do to
> > > help them.)
> > >
> >
> > Let me guess where the cross tree users using sparse-checkout are
> > getting their revenue from:
>
> Is "revenue" perhaps a case of auto-correct choosing the wrong word?
>

s/revenue/benefits

> > 1. they don't have to download the entire repository of blobs at once
> > 2. their working tree can be easily resized.
> > 3. they could have something like sparse-index to optimize the performance
> > of git commands.
>
> These correspond to partial clone, sparse-checkout, and sparse-index.
> I think these 3 features and the various work done to support them,
> plus submodule (which is a different kind of solution) are the
> features Git provides to work with repository subsets.  Some
> repositories (especially the big monorepos like the Microsoft ones)
> will benefit from using all three of these features.  Others might
> only want to use one or two of them.
>

Here I am just amazed that cross-tree users can shorten the
test/build cycle when only using sparse-checkout. So this benefits
don't come from above there conjectures. Not partial clone, not
sparse-index, not resize repo frequently.

> As an example, the repository where we first applied sparse-checkouts
> to (and which had the complicated dependencies) does not use partial
> clones or a sparse-index.   While partial clone and sparse-index might
> help a little, the .git directory for a full clone is merely 2G, and
> there are less than 100K entries in the index.  However,
> sparse-checkout helps out a lot.
>

Yes, you make a good explanation here that we don't necessarily need
to apply all these kinds of features. But I still feel a little confuse: Where
does the time savings come from? Is it saved by the time reduction of
git checkout? Or is it the reduction of some unnecessary working tree scans
during test/build time?

> > But it's still worth worrying about the size of the git repository blobs,
> > even if it's just only blobs in mono-repo's HEAD, that may also be too big
> > for the user's local area to handle.
> >
> > Perhaps it would make more sense to place this integration testing work on
> > a remote server.
> >
> > I am not sure if these ideas are feasible:
> >
> > 1. mount the large git repo on the server to local.
> > 2. just ssh to a remote server to run integration tests.
> > 3. use an external tool to run integration tests on the remote server.
>
> Are you suggesting #1 as a way for just handling the git history, or
> also for handling the worktree with some kind of virtual file system
> where not all files are actually written locally?  If you're only
> talking about the history, then you're kind of going on a tangent
> unrelated to this document.  If you're talking about worktrees and
> virtual file systems, then Git proper doesn't have anything of the
> sort currently.  There are at least two solutions in this space --
> Microsoft's Git-VFS (which I think they are phasing out) and Google's
> similar virtual file system -- but I'm not currently particularly
> interested in either one.
>

Here I mean git nfs, or some kind of git virtual file system, or some
git workspace, I don't really understand why they are now
phasing out?

> #3 is precisely what we did first (except "*a* remote server" rather
> than "*the* remote server").  I think I called it out in the email
> you're responding to; it's often good enough for many people.
> However, sometimes those tests fail and people want to run locally so
> it's easier to inspect.  Or they just want to be able to run locally
> anyway.  So, while #3 helped, it wasn't good enough.
>

Agree, testing locally sometimes is necessary.

> #2 is also something we did.  Using tools like Coder or GitHub
> codespaces or other offerings in that area, you can provide developers
> a nice beefy box with good network connectivity to the main Git
> repository, on which they can do development and running of tests.
> Then developers can connect to such machines from a variety of
> different external locations.  Works great for some people...but build
> times and ability of IDEs to handle the code base are still an issue,
> so doing smarter things with sparse-checkouts is still important.
> And, even if #2 works for some people, others still want to develop
> and run integration tests on their (beefy) laptops.
>

Agree too.

> All three of these, as far as I can tell, are just things that
> individual teams setup and aren't anything that would affect Git's
> development one way or another.
>
>
> However, I'll note that while we internally definitely did two of the
> three things you suggested here, it wasn't a complete enough solution
> for us and sparse-checkout adoption was still pretty minimal at that
> point.  So, we went back to our sparse-checkouts and asked how we
> could modify the build system to allow us to not check out the in-tree
> dependencies of the things we are tweaking, but still get a correct
> build and allow us to run tests.  Once we got that working, we finally
> really unlocked the value of sparse checkouts for us (both improving
> things for developers on laptops, and for developers on the
> development box in the cloud).  It went from very few folks using
> sparse checkouts with that repository, to being the default and
> recommended usage at that point.
>

Yeah, I'm a big believer in sparse-checkout or partial-clone which are
good features but not many people realize that they can use them.

> While the build changes were internal things we did, I think that the
> underlying usage scenario matters to Git development because it helps
> inform how sparse-checkout can be used.  In particular, it suggests
> why some sparse-checkout users may be interested in finding results
> for files that do not match their sparse-checkout patterns -- in-tree
> dependencies may not necessarily be checked out, but those are related
> enough to the code that developers are working on, that developers are
> still potentially interested in using e.g. "git grep" or "git log -p"
> to find out information about code or changes in those other areas.
> (And, of course, developers are also potentially interested in finding
> out what other code depends on what they are changing, but I suspect
> folks were already aware of that usecase.)  It's certainly not the
> only usecase, but it's an additional one that I didn't think was quite
> reflected in Stolee's description of why users would want searches to
> turn up results for files not found in their working tree.
>

Some users may really want to focus only on their subprojects, so I think
"git log -p" shouldn't show files that don't satisfy the
sparse-checkout patterns,
and "git grep" too. But some users may need to search something globally,
and I think those people are in the minority, so maybe there should be a
"git log -p --scrope=all" or "git grep --scrope=all" for them.

> > > > The only thing I can think about is that the diffstat might want to show
> > > > the stats for the conflicted files, in which case that's an important
> > > > perspective on the distinction from --restrict.
> > >
> > > We only show the diffstat on a successful merge, so there's no
> > > diffstat to show if there are any conflicted files.
> > >
> >
> > Sorry, I have some questions here: how does git merge know there are
> > no conflicts without downloading the blobs?
>
> Not sure how that's related to the above, but to answer your question:
>

Ah, this question relates to my previous question in [1]. At first I always
thought it was git merge that caused the extra blob downloading.
In the end, it turned out to be caused by the last diffstat of merge...

> Sometimes merge has to download blobs to know if there are conflicts
> or not.  But only sometimes.  Since tree objects have the hashes of
> the blobs, having the tree objects is sufficient to determine which
> side(s) of history modified each path.
>
> If both sides of history modified the same file, then you *might* have
> conflicts, and you indeed need the blobs to verify.  But if only one
> side of history modified a file and the other left it alone, then
> there is no conflict.

I think I probably get it. e.g. tree of HEAD of user1 have a tree entry
"a4e1fc out/file1" which is same SHA1 to blob in merge base, because
it's out of sparse-checkout specification, and it fetch a commit of user2,
and its tree has a tree entry "13f91e out/file1", so git merge doesn't really
need to check the contents of the file here, because only one side
changes it.

Thanks for your answers!

[1]: https://lore.kernel.org/git/CABPp-BEBB1oqdVcXrWwMAdtb0TwHZvr-6KDa210j5ncw54Di_g@mail.gmail.com/