All of lore.kernel.org
 help / color / mirror / Atom feed
From: ZheNing Hu <adlternative@gmail.com>
To: Elijah Newren <newren@gmail.com>
Cc: Derrick Stolee <derrickstolee@github.com>,
	Elijah Newren via GitGitGadget <gitgitgadget@gmail.com>,
	Git Mailing List <git@vger.kernel.org>,
	Victoria Dye <vdye@github.com>,
	Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>,
	Matheus Tavares <matheus.bernardino@usp.br>
Subject: Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
Date: Sat, 15 Oct 2022 10:17:37 +0800	[thread overview]
Message-ID: <CAOLTT8R0MxEWErrw80-F+b1higbuWuQjvkEGi2c4ARzuRzeNWw@mail.gmail.com> (raw)
In-Reply-To: <CABPp-BFwiMrgm+_sO6TsLUj77r_krgzYEWZanbyx2Fnn4rM8tg@mail.gmail.com>

Elijah Newren <newren@gmail.com> 于2022年10月6日周四 15:53写道:
>
> On Fri, Sep 30, 2022 at 2:54 AM ZheNing Hu <adlternative@gmail.com> wrote:
> >
> > I am not sure if these ideas are feasible.
> >
> > Elijah Newren <newren@gmail.com> 于2022年9月28日周三 13:38写道:
> > >
> [...]
> > > > There's nothing Git can do to help those engineers that do cross-tree
> > > > work.
> > >
> > > I'm going to partially disagree with this, in part because of our
> > > experience with many inter-module dependencies that evolve over time.
> > > Folks can start on a certain module and begin refactoring.  Being
> > > aware that their changes will affect other areas of the code, the can
> > > do a search (e.g. "git grep --cached ..." to find cases outside their
> > > current sparse checkout), and then selectively unsparsify to get the
> > > relevant few dozen (or maybe even few hundred) modules added.  They
> > > aren't switching to a dense checkout, just a less sparse one.  When
> > > they are done, they may narrow their sparse specification again.  We
> > > have a number of users doing cross-tree work who are using
> > > sparse-checkouts, and who find it productive and say it still speeds
> > > up their local build/test cycles.
> > >
> > > So, I'd say that ensuring Git supports behavior B well in
> > > sparse-checkouts, is something Git can do to help out both some of the
> > > engineers doing cross-tree work, and some of the engineers that are
> > > doing cross-tree testing.
> > >
> > > (For full disclosure, we also have users doing cross-tree work using
> > > regular dense checkouts and I agree there's not a lot we can do to
> > > help them.)
> > >
> >
> > Let me guess where the cross tree users using sparse-checkout are
> > getting their revenue from:
>
> Is "revenue" perhaps a case of auto-correct choosing the wrong word?
>

s/revenue/benefits

> > 1. they don't have to download the entire repository of blobs at once
> > 2. their working tree can be easily resized.
> > 3. they could have something like sparse-index to optimize the performance
> > of git commands.
>
> These correspond to partial clone, sparse-checkout, and sparse-index.
> I think these 3 features and the various work done to support them,
> plus submodule (which is a different kind of solution) are the
> features Git provides to work with repository subsets.  Some
> repositories (especially the big monorepos like the Microsoft ones)
> will benefit from using all three of these features.  Others might
> only want to use one or two of them.
>

Here I am just amazed that cross-tree users can shorten the
test/build cycle when only using sparse-checkout. So this benefits
don't come from above there conjectures. Not partial clone, not
sparse-index, not resize repo frequently.

> As an example, the repository where we first applied sparse-checkouts
> to (and which had the complicated dependencies) does not use partial
> clones or a sparse-index.   While partial clone and sparse-index might
> help a little, the .git directory for a full clone is merely 2G, and
> there are less than 100K entries in the index.  However,
> sparse-checkout helps out a lot.
>

Yes, you make a good explanation here that we don't necessarily need
to apply all these kinds of features. But I still feel a little confuse: Where
does the time savings come from? Is it saved by the time reduction of
git checkout? Or is it the reduction of some unnecessary working tree scans
during test/build time?

> > But it's still worth worrying about the size of the git repository blobs,
> > even if it's just only blobs in mono-repo's HEAD, that may also be too big
> > for the user's local area to handle.
> >
> > Perhaps it would make more sense to place this integration testing work on
> > a remote server.
> >
> > I am not sure if these ideas are feasible:
> >
> > 1. mount the large git repo on the server to local.
> > 2. just ssh to a remote server to run integration tests.
> > 3. use an external tool to run integration tests on the remote server.
>
> Are you suggesting #1 as a way for just handling the git history, or
> also for handling the worktree with some kind of virtual file system
> where not all files are actually written locally?  If you're only
> talking about the history, then you're kind of going on a tangent
> unrelated to this document.  If you're talking about worktrees and
> virtual file systems, then Git proper doesn't have anything of the
> sort currently.  There are at least two solutions in this space --
> Microsoft's Git-VFS (which I think they are phasing out) and Google's
> similar virtual file system -- but I'm not currently particularly
> interested in either one.
>

Here I mean git nfs, or some kind of git virtual file system, or some
git workspace, I don't really understand why they are now
phasing out?

> #3 is precisely what we did first (except "*a* remote server" rather
> than "*the* remote server").  I think I called it out in the email
> you're responding to; it's often good enough for many people.
> However, sometimes those tests fail and people want to run locally so
> it's easier to inspect.  Or they just want to be able to run locally
> anyway.  So, while #3 helped, it wasn't good enough.
>

Agree, testing locally sometimes is necessary.

> #2 is also something we did.  Using tools like Coder or GitHub
> codespaces or other offerings in that area, you can provide developers
> a nice beefy box with good network connectivity to the main Git
> repository, on which they can do development and running of tests.
> Then developers can connect to such machines from a variety of
> different external locations.  Works great for some people...but build
> times and ability of IDEs to handle the code base are still an issue,
> so doing smarter things with sparse-checkouts is still important.
> And, even if #2 works for some people, others still want to develop
> and run integration tests on their (beefy) laptops.
>

Agree too.

> All three of these, as far as I can tell, are just things that
> individual teams setup and aren't anything that would affect Git's
> development one way or another.
>
>
> However, I'll note that while we internally definitely did two of the
> three things you suggested here, it wasn't a complete enough solution
> for us and sparse-checkout adoption was still pretty minimal at that
> point.  So, we went back to our sparse-checkouts and asked how we
> could modify the build system to allow us to not check out the in-tree
> dependencies of the things we are tweaking, but still get a correct
> build and allow us to run tests.  Once we got that working, we finally
> really unlocked the value of sparse checkouts for us (both improving
> things for developers on laptops, and for developers on the
> development box in the cloud).  It went from very few folks using
> sparse checkouts with that repository, to being the default and
> recommended usage at that point.
>

Yeah, I'm a big believer in sparse-checkout or partial-clone which are
good features but not many people realize that they can use them.

> While the build changes were internal things we did, I think that the
> underlying usage scenario matters to Git development because it helps
> inform how sparse-checkout can be used.  In particular, it suggests
> why some sparse-checkout users may be interested in finding results
> for files that do not match their sparse-checkout patterns -- in-tree
> dependencies may not necessarily be checked out, but those are related
> enough to the code that developers are working on, that developers are
> still potentially interested in using e.g. "git grep" or "git log -p"
> to find out information about code or changes in those other areas.
> (And, of course, developers are also potentially interested in finding
> out what other code depends on what they are changing, but I suspect
> folks were already aware of that usecase.)  It's certainly not the
> only usecase, but it's an additional one that I didn't think was quite
> reflected in Stolee's description of why users would want searches to
> turn up results for files not found in their working tree.
>

Some users may really want to focus only on their subprojects, so I think
"git log -p" shouldn't show files that don't satisfy the
sparse-checkout patterns,
and "git grep" too. But some users may need to search something globally,
and I think those people are in the minority, so maybe there should be a
"git log -p --scrope=all" or "git grep --scrope=all" for them.

> > > > The only thing I can think about is that the diffstat might want to show
> > > > the stats for the conflicted files, in which case that's an important
> > > > perspective on the distinction from --restrict.
> > >
> > > We only show the diffstat on a successful merge, so there's no
> > > diffstat to show if there are any conflicted files.
> > >
> >
> > Sorry, I have some questions here: how does git merge know there are
> > no conflicts without downloading the blobs?
>
> Not sure how that's related to the above, but to answer your question:
>

Ah, this question relates to my previous question in [1]. At first I always
thought it was git merge that caused the extra blob downloading.
In the end, it turned out to be caused by the last diffstat of merge...

> Sometimes merge has to download blobs to know if there are conflicts
> or not.  But only sometimes.  Since tree objects have the hashes of
> the blobs, having the tree objects is sufficient to determine which
> side(s) of history modified each path.
>
> If both sides of history modified the same file, then you *might* have
> conflicts, and you indeed need the blobs to verify.  But if only one
> side of history modified a file and the other left it alone, then
> there is no conflict.

I think I probably get it. e.g. tree of HEAD of user1 have a tree entry
"a4e1fc out/file1" which is same SHA1 to blob in merge base, because
it's out of sparse-checkout specification, and it fetch a commit of user2,
and its tree has a tree entry "13f91e out/file1", so git merge doesn't really
need to check the contents of the file here, because only one side
changes it.

Thanks for your answers!

[1]: https://lore.kernel.org/git/CABPp-BEBB1oqdVcXrWwMAdtb0TwHZvr-6KDa210j5ncw54Di_g@mail.gmail.com/

  reply	other threads:[~2022-10-15  2:17 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-25  0:09 [PATCH] sparse-checkout.txt: new document with sparse-checkout directions Elijah Newren via GitGitGadget
2022-09-26 17:20 ` Junio C Hamano
2022-09-26 17:38 ` Junio C Hamano
2022-09-27  3:05   ` Elijah Newren
2022-09-27  4:30     ` Junio C Hamano
2022-09-26 20:08 ` Victoria Dye
2022-09-26 22:36   ` Junio C Hamano
2022-09-27  7:30     ` Elijah Newren
2022-09-27 16:07       ` Junio C Hamano
2022-09-28  6:13         ` Elijah Newren
2022-09-27  6:09   ` Elijah Newren
2022-09-27 16:42   ` Derrick Stolee
2022-09-28  5:42     ` Elijah Newren
2022-09-27 15:43 ` Junio C Hamano
2022-09-28  7:49   ` Elijah Newren
2022-09-27 16:36 ` Derrick Stolee
2022-09-28  5:38   ` Elijah Newren
2022-09-28 13:22     ` Derrick Stolee
2022-10-06  7:10       ` Elijah Newren
2022-10-06 18:27         ` Derrick Stolee
2022-10-07  2:56           ` Elijah Newren
2022-09-30  9:54     ` ZheNing Hu
2022-10-06  7:53       ` Elijah Newren
2022-10-15  2:17         ` ZheNing Hu [this message]
2022-10-15  4:37           ` Elijah Newren
2022-10-15 14:49             ` ZheNing Hu
2022-09-30  9:09   ` ZheNing Hu
2022-09-28  8:32 ` [PATCH v2] " Elijah Newren via GitGitGadget
2022-10-08 22:52   ` [PATCH v3] " Elijah Newren via GitGitGadget
2022-11-06  6:04     ` [PATCH v4] " Elijah Newren via GitGitGadget
2022-11-07 20:44       ` Derrick Stolee
2022-11-16  4:39         ` Elijah Newren
2022-11-15  4:03       ` ZheNing Hu
2022-11-16  3:18         ` ZheNing Hu
2022-11-16  6:51           ` Elijah Newren
2022-11-16  5:49         ` Elijah Newren
2022-11-16 10:04           ` ZheNing Hu
2022-11-16 10:10             ` ZheNing Hu
2022-11-16 14:33               ` ZheNing Hu
2022-11-19  2:36                 ` Elijah Newren
2022-11-19  2:15             ` Elijah Newren
2022-11-23  9:08               ` ZheNing Hu
2023-01-14 10:18           ` ZheNing Hu
2023-01-20  4:30             ` Elijah Newren
2023-01-23 15:05               ` ZheNing Hu
2023-01-24  3:17                 ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAOLTT8R0MxEWErrw80-F+b1higbuWuQjvkEGi2c4ARzuRzeNWw@mail.gmail.com \
    --to=adlternative@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=matheus.bernardino@usp.br \
    --cc=newren@gmail.com \
    --cc=shaoxuan.yuan02@gmail.com \
    --cc=vdye@github.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.