All of lore.kernel.org
 help / color / mirror / Atom feed
* Saving space/network on common repos
@ 2014-12-17  6:58 Craig Silverstein
  2014-12-17 22:01 ` Stefan Beller
  2014-12-17 22:32 ` Jonathan Nieder
  0 siblings, 2 replies; 9+ messages in thread
From: Craig Silverstein @ 2014-12-17  6:58 UTC (permalink / raw)
  To: git

At Khan Academy, we are running a Jenkins installation as our build
server.  By design, our Jenkins machine has several different
directories that each hold a copy of the same git repository.  (For
instance, Jenkins may be running tests on our repo at several
different commits at the same time.)  When Jenkins decides to run a
test -- I'm simplifying a bit -- it will pick one of the copies of the
repo, do a 'git fetch origin && git checkout <some commit>' and the
run the tests.

Our repo has a lot of churn and some big files, and this git fetch can
take a long time. I'd like to reduce both the time to fetch and the
disk space used by sharing objects between the repo copies.

My research has turned up three techniques that try to address this use case:
* git clone --reference
* git clone --shared
* git clone <local repo>, which creates hard links

I can probably use any of these approaches, but git clone --reference
would be the easiest to set up.  I would do so by creating a 'cache'
repo that is just created to serve as a reference and not used in any
other way, so I wouldn't have to worry about the dangers with pruning,
accidentally deleting the repo, etc.

My big concern is that all these methods seem to just affect clone.  So:

Question 1) If I do 'git clone --reference, will the reference repo be
used for subsequent fetches as well?  What about 'git clone --shared'?

Question 2) If I git clone a local repo, will subsequent fetches also
create hard links?

Question 3) If the answer to any of the above is yes, how does this
work with packing?  Say I pack the reference repo (being careful not
to prune anything).  Will subsequent fetches still be able to get the
objects they need from the reference repo?

An added complication is submodules.  We have a submodule that is as
big and slow to fetch as our main repository.

Question 4) Is there a practical way to set up submodules so they can
use the same object-sharing framework that the main repo does?

I'm not keen on rewriting .gitmodules in each of my repos, so probably
something that uses info/alternates is the most workable.  I have a
scheme for setting that up that maybe will work, but it's a moot point
if info/alternates doesn't work for fetching.

I'm wondering if the best approach for us might be to use
GIT_OBJECT_DIRECTORY: set GIT_OBJECT_DIRECTORY to the shared cached
directory for each of our repos, so they all fetch to the same place.

Question 5) I haven't seen this mentioned anywhere else, so I'm
guessing it won't work.  Am I missing a big problem?

Question 6) Will git be sad if two different repos that share an
object directory, both do 'git fetch' at the same time?  I could maybe
protect fetches with an flock, but jenkins can do git operations
behind my back so it would be easier if I didn't have to worry about
locking.

Question 7) Is GIT_OBJECT_DIRECTORY supposed to work with subrepos?
In my experimentation, it looks like it doesn't: when I run
'GIT_OBJECT_DIRECTORY=../obj git submodule update --init' it still
puts the objects in .git/modules/<submodule>/objects/.  Is this a bug?
 Is there any way to work around it?

Any suggestions would be appreciated!  It feels to me like this is
something that git should support pretty easily given its
architecture, but I just don't see a way to do it.

Thanks,
craig

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Saving space/network on common repos
  2014-12-17  6:58 Saving space/network on common repos Craig Silverstein
@ 2014-12-17 22:01 ` Stefan Beller
  2014-12-17 22:32 ` Jonathan Nieder
  1 sibling, 0 replies; 9+ messages in thread
From: Stefan Beller @ 2014-12-17 22:01 UTC (permalink / raw)
  To: csilvers; +Cc: git

I am not sure if there was any improvement since then, but Junio
wrote about alternates 2 years ago
http://git-blame.blogspot.com/2012/08/bringing-bit-more-sanity-to-alternates.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Saving space/network on common repos
  2014-12-17  6:58 Saving space/network on common repos Craig Silverstein
  2014-12-17 22:01 ` Stefan Beller
@ 2014-12-17 22:32 ` Jonathan Nieder
  2014-12-17 23:57   ` Craig Silverstein
  1 sibling, 1 reply; 9+ messages in thread
From: Jonathan Nieder @ 2014-12-17 22:32 UTC (permalink / raw)
  To: Craig Silverstein; +Cc: git, Nguyễn Thái Ngọc Duy

(+cc: Duy who wrote the recent 'checkout --to' patch series)
Hi Craig,

Craig Silverstein wrote:

>          By design, our Jenkins machine has several different
> directories that each hold a copy of the same git repository.  (For
> instance, Jenkins may be running tests on our repo at several
> different commits at the same time.)  When Jenkins decides to run a
> test -- I'm simplifying a bit -- it will pick one of the copies of the
> repo, do a 'git fetch origin && git checkout <some commit>' and the
> run the tests.

You might find 'git new-workdir' from contrib/workdir to be helpful.
It lets you attach multiple working copies to a single set of objects
and refs.

There's a patch series to move that functionality into core git
through an option "git checkout --to=<directory>" that creates a
new workdir for an existing repository that is currently in the
pu ("proposed updates") branch.

[...]
> An added complication is submodules.  We have a submodule that is as
> big and slow to fetch as our main repository.
>
> Question 4) Is there a practical way to set up submodules so they can
> use the same object-sharing framework that the main repo does?

It's possible to do, but we haven't written a nice UI for it yet.
(In other words, you can do this by cloning with --no-recurse-submodules
and manually creating the submodule workdir in the appropriate place.
Later calls to "git submodule update" will do the right thing.)

Thanks for a useful example,
Jonathan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Saving space/network on common repos
  2014-12-17 22:32 ` Jonathan Nieder
@ 2014-12-17 23:57   ` Craig Silverstein
  2014-12-18  0:07     ` Jonathan Nieder
  0 siblings, 1 reply; 9+ messages in thread
From: Craig Silverstein @ 2014-12-17 23:57 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, Nguyễn Thái Ngọc

On Wed, Dec 17, 2014 at 2:32 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> You might find 'git new-workdir' from contrib/workdir to be helpful.
> It lets you attach multiple working copies to a single set of objects
> and refs.

Thanks!  That does indeed sound promising -- like a more principled
version of my GIT_OBJECT_DIRECTORY suggestion.

>> Question 4) Is there a practical way to set up submodules so they can
>> use the same object-sharing framework that the main repo does?
>
> It's possible to do, but we haven't written a nice UI for it yet.
> (In other words, you can do this by cloning with --no-recurse-submodules
> and manually creating the submodule workdir in the appropriate place.

Hmm, let me see if I understand you right -- you're suggesting that
when cloning my reference repo, I do
    git clone --no-recurse-submodules <my repo>
    for (path, url) in `parse-.gitmodules`: git clone url path
# this is psuedocode, obviously :-)

and then when I want to create a new workdir, I do something like:
    cd reference_repo
    git new-workdir /var/workspace1
    for (path, url) in `parse-.gitmodules`: cd path && git new-workdir
/var/workspace1/path

?  Basically, I'm going back to the old git way of having each
submodule have its own .git directory, rather than having it have a
.git file with a 'gitdir' entry.  Am I understanding this right?

Also, it seems to me there's the possibility, with git-newdir, that if
several of the workspaces try to fetch at the same time they could
step on each others' toes.  Is that a problem?  I know there's a push
lock but I don't believe there's a fetch lock, and I could imagine git
getting unhappy if two fetches happened in the same repo at the same
time.

craig

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Saving space/network on common repos
  2014-12-17 23:57   ` Craig Silverstein
@ 2014-12-18  0:07     ` Jonathan Nieder
  2014-12-23  1:00       ` Craig Silverstein
  0 siblings, 1 reply; 9+ messages in thread
From: Jonathan Nieder @ 2014-12-18  0:07 UTC (permalink / raw)
  To: Craig Silverstein; +Cc: git, Nguyễn Thái Ngọc Duy

Craig Silverstein wrote:
> On Wed, Dec 17, 2014 at 2:32 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
>> Craig Silverstein wrote:

>>> Question 4) Is there a practical way to set up submodules so they can
>>> use the same object-sharing framework that the main repo does?
>>
>> It's possible to do, but we haven't written a nice UI for it yet.
>> (In other words, you can do this by cloning with --no-recurse-submodules
>> and manually creating the submodule workdir in the appropriate place.
>
> Hmm, let me see if I understand you right -- you're suggesting that
> when cloning my reference repo, I do
>     git clone --no-recurse-submodules <my repo>
>     for (path, url) in `parse-.gitmodules`: git clone url path
> # this is psuedocode, obviously :-)
>
> and then when I want to create a new workdir, I do something like:
>     cd reference_repo
>     git new-workdir /var/workspace1
>     for (path, url) in `parse-.gitmodules`: cd path && git new-workdir /var/workspace1/path
>
> ?  Basically, I'm going back to the old git way of having each
> submodule have its own .git directory, rather than having it have a
> .git file with a 'gitdir' entry.  Am I understanding this right?

Basically.  The initial clone can still use --recurse-submodules.
When you create a new workdir you'd create new workdirs for the
submodules by hand.

A 'git submodule foreach' command in the initial repo can take
care of the `parse-.gitmodules` part.

[...]
> Also, it seems to me there's the possibility, with git-newdir, that if
> several of the workspaces try to fetch at the same time they could
> step on each others' toes.  Is that a problem?  I know there's a push
> lock but I don't believe there's a fetch lock, and I could imagine git
> getting unhappy if two fetches happened in the same repo at the same
> time.

That's a good question.  If concurrent fetches cause trouble then I'd
consider it a bug (it's not too different from multiple concurrent
pushes to the same repository, which is a very common thing to do),
but I haven't looked carefully into whether such bugs exist.

Hopefully someone else can chime in.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Saving space/network on common repos
  2014-12-18  0:07     ` Jonathan Nieder
@ 2014-12-23  1:00       ` Craig Silverstein
  2014-12-23  1:33         ` Jonathan Nieder
  2014-12-23  3:12         ` Jonathan Nieder
  0 siblings, 2 replies; 9+ messages in thread
From: Craig Silverstein @ 2014-12-23  1:00 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, Nguyễn Thái Ngọc

btw, just FYI, the scheme you lay out here doesn't actually work
as-is.  The problem is the config file, which has an entry like:
           worktree = ../../../mysubmodule
This depends on the config file living in
./git/modules/mysubmodule/config.  But the proposed scheme moves the
config file to mysubmodule/.git/config, and the relative path is
broken.

I'm not sure what the best solution is; the cleanest one requires a
pretty substantial rewrite of git-new-workdir (not that it's such a
giant piece of code), and separating out new_workdir from new_gitdir.
The smallest one involves having some way to suppress the final 'git
checkout -f' (which is the only thing in this script that needs the
worktree entry to resolve somewhere) to allow for post-script cleanup.

craig

On Wed, Dec 17, 2014 at 4:07 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Craig Silverstein wrote:
>> On Wed, Dec 17, 2014 at 2:32 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
>>> Craig Silverstein wrote:
>
>>>> Question 4) Is there a practical way to set up submodules so they can
>>>> use the same object-sharing framework that the main repo does?
>>>
>>> It's possible to do, but we haven't written a nice UI for it yet.
>>> (In other words, you can do this by cloning with --no-recurse-submodules
>>> and manually creating the submodule workdir in the appropriate place.
>>
>> Hmm, let me see if I understand you right -- you're suggesting that
>> when cloning my reference repo, I do
>>     git clone --no-recurse-submodules <my repo>
>>     for (path, url) in `parse-.gitmodules`: git clone url path
>> # this is psuedocode, obviously :-)
>>
>> and then when I want to create a new workdir, I do something like:
>>     cd reference_repo
>>     git new-workdir /var/workspace1
>>     for (path, url) in `parse-.gitmodules`: cd path && git new-workdir /var/workspace1/path
>>
>> ?  Basically, I'm going back to the old git way of having each
>> submodule have its own .git directory, rather than having it have a
>> .git file with a 'gitdir' entry.  Am I understanding this right?
>
> Basically.  The initial clone can still use --recurse-submodules.
> When you create a new workdir you'd create new workdirs for the
> submodules by hand.
>
> A 'git submodule foreach' command in the initial repo can take
> care of the `parse-.gitmodules` part.
>
> [...]
>> Also, it seems to me there's the possibility, with git-newdir, that if
>> several of the workspaces try to fetch at the same time they could
>> step on each others' toes.  Is that a problem?  I know there's a push
>> lock but I don't believe there's a fetch lock, and I could imagine git
>> getting unhappy if two fetches happened in the same repo at the same
>> time.
>
> That's a good question.  If concurrent fetches cause trouble then I'd
> consider it a bug (it's not too different from multiple concurrent
> pushes to the same repository, which is a very common thing to do),
> but I haven't looked carefully into whether such bugs exist.
>
> Hopefully someone else can chime in.
>
> Thanks,
> Jonathan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Saving space/network on common repos
  2014-12-23  1:00       ` Craig Silverstein
@ 2014-12-23  1:33         ` Jonathan Nieder
  2014-12-23  3:12         ` Jonathan Nieder
  1 sibling, 0 replies; 9+ messages in thread
From: Jonathan Nieder @ 2014-12-23  1:33 UTC (permalink / raw)
  To: Craig Silverstein; +Cc: git, Nguyễn Thái Ngọc

Craig Silverstein wrote:

> btw, just FYI, the scheme you lay out here doesn't actually work
> as-is.  The problem is the config file, which has an entry like:
>            worktree = ../../../mysubmodule
> This depends on the config file living in
> ./git/modules/mysubmodule/config.  But the proposed scheme moves the
> config file to mysubmodule/.git/config, and the relative path is
> broken.

*puzzled* Can you give a transcript illustrating this happening?

Submodules with .git directory within their worktree or under
.git/modules/ are both supposed to work.  And in either case, having
refs/ and objects/ as symlinks should work fine.  When git new-workdir
creates a new workdir, it has its own new and separate config file, so
I don't think that is the source of trouble.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Saving space/network on common repos
  2014-12-23  1:00       ` Craig Silverstein
  2014-12-23  1:33         ` Jonathan Nieder
@ 2014-12-23  3:12         ` Jonathan Nieder
  2014-12-23  5:36           ` Craig Silverstein
  1 sibling, 1 reply; 9+ messages in thread
From: Jonathan Nieder @ 2014-12-23  3:12 UTC (permalink / raw)
  To: Craig Silverstein; +Cc: git, Nguyễn Thái Ngọc Duy

Craig Silverstein wrote:

> btw, just FYI, the scheme you lay out here doesn't actually work
> as-is.  The problem is the config file, which has an entry like:
>            worktree = ../../../mysubmodule
> This depends on the config file living in
> ./git/modules/mysubmodule/config.  But the proposed scheme moves the
> config file to mysubmodule/.git/config, and the relative path is
> broken.

As was pointed out to me privately, the behavior is exactly as you
described and I had confused myself by looking at directory that
wasn't even made with git-new-workdir.  Sorry for the nonsense.

Workdirs share a single config file because information associated to
branches set by "git branch --set-upstream-to", "git branch
--edit-description", "git remote", and so on are stored in the config
file.

The 'git checkout --to' series in "pu" avoids this problem by ignoring
core.bare and core.worktree in worktrees created with 'git checkout --to'.
To try it:

	git clone https://kernel.googlesource.com/pub/scm/git/git
	cd git
	git merge 'origin/pu^{/nd/multiple-work-trees}^2'
	make
	PATH=$(pwd)/bin-wrappers:$PATH

	git checkout --to=../experiment next

This seems like good motivation to try to get that series in good
shape and release it soon.

Thanks again,
Jonathan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Saving space/network on common repos
  2014-12-23  3:12         ` Jonathan Nieder
@ 2014-12-23  5:36           ` Craig Silverstein
  0 siblings, 0 replies; 9+ messages in thread
From: Craig Silverstein @ 2014-12-23  5:36 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, Nguyễn Thái Ngọc

} This seems like good motivation to try to get that series in good
shape and release it soon.

I was going to spend some time tomorrow (if I can find any :-) )
trying to fix up the contrib script to work with submodules, or at
least the kind that we use.  Is that something that's worth the time
to do, or would we be better off just waiting for the work-tree stuff
to get released?  If I do end up doing it, would you be interested in
a pull request (or however patches are submitted in the git world)?

craig

On Mon, Dec 22, 2014 at 7:12 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Craig Silverstein wrote:
>
>> btw, just FYI, the scheme you lay out here doesn't actually work
>> as-is.  The problem is the config file, which has an entry like:
>>            worktree = ../../../mysubmodule
>> This depends on the config file living in
>> ./git/modules/mysubmodule/config.  But the proposed scheme moves the
>> config file to mysubmodule/.git/config, and the relative path is
>> broken.
>
> As was pointed out to me privately, the behavior is exactly as you
> described and I had confused myself by looking at directory that
> wasn't even made with git-new-workdir.  Sorry for the nonsense.
>
> Workdirs share a single config file because information associated to
> branches set by "git branch --set-upstream-to", "git branch
> --edit-description", "git remote", and so on are stored in the config
> file.
>
> The 'git checkout --to' series in "pu" avoids this problem by ignoring
> core.bare and core.worktree in worktrees created with 'git checkout --to'.
> To try it:
>
>         git clone https://kernel.googlesource.com/pub/scm/git/git
>         cd git
>         git merge 'origin/pu^{/nd/multiple-work-trees}^2'
>         make
>         PATH=$(pwd)/bin-wrappers:$PATH
>
>         git checkout --to=../experiment next
>
> This seems like good motivation to try to get that series in good
> shape and release it soon.
>
> Thanks again,
> Jonathan

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-12-23  5:37 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-17  6:58 Saving space/network on common repos Craig Silverstein
2014-12-17 22:01 ` Stefan Beller
2014-12-17 22:32 ` Jonathan Nieder
2014-12-17 23:57   ` Craig Silverstein
2014-12-18  0:07     ` Jonathan Nieder
2014-12-23  1:00       ` Craig Silverstein
2014-12-23  1:33         ` Jonathan Nieder
2014-12-23  3:12         ` Jonathan Nieder
2014-12-23  5:36           ` Craig Silverstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.