All of lore.kernel.org
 help / color / mirror / Atom feed
* Avery Pennarun's git-subtree?
@ 2010-07-21 17:15 Bryan Larsen
  2010-07-21 19:43 ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 58+ messages in thread
From: Bryan Larsen @ 2010-07-21 17:15 UTC (permalink / raw)
  To: git

I've been using Avery Pennarun's git-subtree 
(http://github.com/apenwarr/git-subtree) for a while now and have been 
finding it very useful and problem-free.

Git submodules have been particularly problematic for me on a project 
which contains submodules which contain submodules.  git-subtree "just 
works", without any futzing.

We've also had problems with less git savvy users dropping patches 
because they've occurred inside of a module.

It would be really nice if git-subtree became an part of git.    Avery 
has submitted git-subtree in the past and has indicated a willingness to 
do so again if there was a good chance of acceptance.

Avery's announcment of v0.3 is also informative: 
http://kerneltrap.org/mailarchive/git/2010/2/4/22366

thank you,
Bryan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-21 17:15 Avery Pennarun's git-subtree? Bryan Larsen
@ 2010-07-21 19:43 ` Ævar Arnfjörð Bjarmason
  2010-07-21 19:56   ` Avery Pennarun
  0 siblings, 1 reply; 58+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2010-07-21 19:43 UTC (permalink / raw)
  To: Bryan Larsen; +Cc: git

On Wed, Jul 21, 2010 at 17:15, Bryan Larsen <bryan.larsen@gmail.com> wrote:
> I've been using Avery Pennarun's git-subtree
> (http://github.com/apenwarr/git-subtree) for a while now and have been
> finding it very useful and problem-free.
>
> Git submodules have been particularly problematic for me on a project which
> contains submodules which contain submodules.  git-subtree "just works",
> without any futzing.
>
> We've also had problems with less git savvy users dropping patches because
> they've occurred inside of a module.

What sort of workflows do you find bad with git-submodule that are
better with git-subtree?

The submodule concept is simple, but a lot of the implementation is
bad IMO. It doesn't integrate well, e.g. you have to remember to do
git clone --recursive, or git clone and git submodule update --init
after that, submodules don't remember what branch you wanted, so git
submodule foreach 'git pull' doesn't DWYM (although I have a hack for
that) etc.

I've also wondered if we couldn't just store all the heads .gitmodules
point to inside the main .git repository, and just git gc them when
submodules are removed.

I'd planned to maybe submit patches to fix some of these UI issues,
knowing about more of them would help. I also haven't tried
git-subtree.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-21 19:43 ` Ævar Arnfjörð Bjarmason
@ 2010-07-21 19:56   ` Avery Pennarun
  2010-07-21 20:36     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 58+ messages in thread
From: Avery Pennarun @ 2010-07-21 19:56 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Bryan Larsen, git, Junio C Hamano

On Wed, Jul 21, 2010 at 3:43 PM, Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> What sort of workflows do you find bad with git-submodule that are
> better with git-subtree?
>
> The submodule concept is simple, but a lot of the implementation is
> bad IMO. It doesn't integrate well, e.g. you have to remember to do
> git clone --recursive, or git clone and git submodule update --init
> after that, submodules don't remember what branch you wanted, so git
> submodule foreach 'git pull' doesn't DWYM (although I have a hack for
> that) etc.

In my experience, there is exactly one killer problem with submodules
that people are looking to solve with git-subtree:

Branching.

If you have a random developer in your office and they need to make a
patch to one of your subprojects in the course of making their main
project work, with submodules this requires incredibly error-prone
contortions involving branching both projects, making sure you have
push access to both projects, learning how to use git-submodule, etc.
And then merging that branch into someone else's branch is
complicated, particularly if they've also applied their own changes to
the subproject.

With git-subtree, that developer just commits the changes to the
merged project - and that's it.  Then you or someone else, who knows
how git-subtree works, at any point in the future, can submit the
subproject changes upstream, or not, as appropriate.

No amount of bugfixing in git submodule can fix this workflow, because
it's not a result of bugs.  (The bugs, particularly the
disconnected-by-default HEADs on submodule checkouts, do make it a bit
worse :( )  It would require a fundamental redesign to make this work
nicely with submodules.

git-subtree is certainly a fundamental redesign.  Arguably there might
be even better ways to design it, of course.  And submodules are good
for certain other situations that git-subtree isn't, so it's obviously
not a one-for-one replacement.

If we can get some kind of consensus in principle that git-subtree is
a good idea to merge into git core, I can prepare some patches and we
can talk about the details.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-21 19:56   ` Avery Pennarun
@ 2010-07-21 20:36     ` Ævar Arnfjörð Bjarmason
  2010-07-21 21:09       ` Avery Pennarun
  0 siblings, 1 reply; 58+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2010-07-21 20:36 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Bryan Larsen, git, Junio C Hamano

On Wed, Jul 21, 2010 at 19:56, Avery Pennarun <apenwarr@gmail.com> wrote:

> No amount of bugfixing in git submodule can fix this workflow, because
> it's not a result of bugs.  (The bugs, particularly the
> disconnected-by-default HEADs on submodule checkouts, do make it a bit
> worse :( )  It would require a fundamental redesign to make this work
> nicely with submodules.

I think most of those can be fixed, actually. The only requirement
that the git plumbing imposes on git-submodules is that a "commit"
entry exist in your tree, the rest is just (ugly plumbing).

Thus, we could:

   * Hack git-submodule (or its replacement) to check import the tree
     that contains that "commit" into one central .git

   * Fix git status / git commit so that you could commit into
     submodules, i.e.:

     for each submodule in this-commit:
         chdir $submodule && commit
     done && cd $root && commit -m"bumping sumbodules"

   * Make git-push push the submodule contents and the
     superprojects. You'd just need to have commit access to the url
     listed in .gitmodules.

What's missing from that (which would be nice) is the ability to check
out a subdirectory from another repository. That could (I think) be
done by just adding a normal "tree" entry, and then specifying that
that tree can be found in git://... instead of the main tree.

> If we can get some kind of consensus in principle that git-subtree is
> a good idea to merge into git core, I can prepare some patches and we
> can talk about the details.

From having looked at it briefly it looks very nice. But it looks to
me as if the main differences between git-submodule and git-subtree
are in the porcelain, not the plumbing.

It would be a lot less confusing to users of Git in the long term if
we would at least try to unify these two approaches instead of having
two mutually incompatible ways of doing essentially the same thing.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-21 20:36     ` Ævar Arnfjörð Bjarmason
@ 2010-07-21 21:09       ` Avery Pennarun
  2010-07-21 21:20         ` Avery Pennarun
                           ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: Avery Pennarun @ 2010-07-21 21:09 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Bryan Larsen, git, Junio C Hamano

On Wed, Jul 21, 2010 at 4:36 PM, Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> On Wed, Jul 21, 2010 at 19:56, Avery Pennarun <apenwarr@gmail.com> wrote:
>> No amount of bugfixing in git submodule can fix this workflow, because
>> it's not a result of bugs.  (The bugs, particularly the
>> disconnected-by-default HEADs on submodule checkouts, do make it a bit
>> worse :( )  It would require a fundamental redesign to make this work
>> nicely with submodules.
> [...]
> I think most of those can be fixed, actually. The only requirement
> that the git plumbing imposes on git-submodules is that a "commit"
> entry exist in your tree, the rest is just (ugly plumbing).

Sure.  But this commit object (and the objects it points to) are never
automatically pushed, fetched, or fsck'd.  They're second class
citizens.  As it turns out, this was a major design mistake in
implementing the submodule commit objects.

All the behaviour people *currently* get from submodules could have
been obtained without using a new 'commit' object type at all.  Just
add a commitid to the horrible junk (including repo URLs, argh) that
already needs to get pasted into .gitmodules, and have git-commit at
the top level update .gitmodules automatically (as it currently
updates the 'commit' tree entries).  Problem solved (at least, solved
to exactly the extent that it is today).

What we *really* want is a way to have git actually recurse through
commit objects when doing *any* operation, as if they were tree
objects.  If we had that, submodules could be beautiful (because you'd
push them to the same repo, etc and users would see none of the
complexity).  But this doesn't exist.  And for backward compatibility
at this point, we'd probably need to introduce an entirely new kind of
tree entry to support such a thing.

> Thus, we could:
>
>   * Hack git-submodule (or its replacement) to check import the tree
>     that contains that "commit" into one central .git

This part is relatively easy, I think - at least in concept, although
I bet there would be widespread implementation tweaks - and would
clean up a lot of the mess.  However it would require a change to the
.git/index file format to remember when a subdir is a commit and not a
"normal" tree so that it doesn't silently commit the next thing as a
tree instead.

>   * Fix git status / git commit so that you could commit into
>     submodules, i.e.:
>
>     for each submodule in this-commit:
>         chdir $submodule && commit
>     done && cd $root && commit -m"bumping submodules"

After making the earlier change to get rid of the extra .git subdirs,
this next requirement would actually be considerably more work,
because 'git commit' would need to know how to update a subcommit
without changing HEAD.  You certainly couldn't just code it up as a
recursive "git commit" as you imply (and as you could do right now).

>   * Make git-push push the submodule contents and the
>     superprojects. You'd just need to have commit access to the url
>     listed in .gitmodules.

This is really a *killer* problem, and you're making it sound easy.
Let's imagine that my app has 25 different submodules - not
unreasonable at all in a world with dozens of ever-changing ruby gems
and suchlike.

Now, if I want to branch my project, I might have to branch 25
projects just so I can push my changes?  It's totally awful.  And the
awfulness is multiplied many times over if .gitmodules has hard-coded
repo paths, because then I have to update the repo path in my branch
but not the other branch, and merging will have conflicts.  You might
think that my .git/config could just override .gitmodules, but then
some guy trying to fetch my branch will fail to fetch the submodules
from my branch and get errors and have no idea what's going on.

And you might think that using relative repo paths in .gitmodules
would work, but that's only if I branched all 25 submodules in the
*first* place.  In real life, most subprojects point at the original
project's home repo by default (because nobody thinks they'll be
patching 25 subprojects when they start, and they're probably right),
but then you have to individually change the URLs when you decide you
need to patch them, and life gets complicated and ugly, especially
when the next guy goes to fork your project and now needs to fork some
subprojects but not others.

There is no good solution to the submodule problem if each submodule
has to go in its own repo.  I've been thinking about this for years
now, and watching lots of discussions about it on the git mailing
list, and I just can't see any other option.  All the submodules have
to get pushed to and fetched from the same repo by default.  Anything
else is insane.

One option might be to store the submodule commit refs as refs in your
superproject.  That wouldn't actually be so bad, except for the
aforementioned problem that fetch/push/clone/etc don't actually trace
through commit objects when deciding what objects to send you, so
fetching the ref of the superproject wouldn't autofetch the subproject
refs.  Also, you could accidentally delete one of the subproject refs
and lose tons of history without ever realizing it.  That's error
prone and confusing... and clutters up your repo refs list with
administrative stuff you didn't actually want in the first place.

> What's missing from that (which would be nice) is the ability to check
> out a subdirectory from another repository. That could (I think) be
> done by just adding a normal "tree" entry, and then specifying that
> that tree can be found in git://... instead of the main tree.

Actually that's already easy with submodules (and git-subtree makes it
easy too, though slightly different).  Just fetch the commit from the
other repo, and do:

   git checkout FETCH_HEAD -- subdirname

>> If we can get some kind of consensus in principle that git-subtree is
>> a good idea to merge into git core, I can prepare some patches and we
>> can talk about the details.
>
> From having looked at it briefly it looks very nice. But it looks to
> me as if the main differences between git-submodule and git-subtree
> are in the porcelain, not the plumbing.

No.  The fundamental difference is exactly one: git-subtree uses
normal 'tree' entries (rather than commits) in its trees, so that all
the git tools recurse through them like any other tree.  Thus you
don't need any extra refs, extra .git dirs, etc.  That allows you to
bypass all the useless behaviour git has around 'commit' entries.
This is very much a plumbing difference.

The git-submodule porcelain happens to independently be kind of
annoying and inconvenient, but that would be much easier to fix if it
weren't for the plumbing-related problems.

> It would be a lot less confusing to users of Git in the long term if
> we would at least try to unify these two approaches instead of having
> two mutually incompatible ways of doing essentially the same thing.

True.  But I don't have the time, and implementing the new 'commit'
entry semantics sounds like a lot of work (as opposed to arguing about
them, which I guess I'm good at but which seems unproductive).

In productive terms: git-subtree is solving problems for real users
right now.  It might solve more problems for more users if it were
integrated into the core and thus made "official."  Nothing precludes
making submodules better later.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-21 21:09       ` Avery Pennarun
@ 2010-07-21 21:20         ` Avery Pennarun
  2010-07-21 22:46         ` Jens Lehmann
  2010-07-21 23:46         ` Ævar Arnfjörð Bjarmason
  2 siblings, 0 replies; 58+ messages in thread
From: Avery Pennarun @ 2010-07-21 21:20 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Bryan Larsen, git, Junio C Hamano

On Wed, Jul 21, 2010 at 5:09 PM, Avery Pennarun <apenwarr@gmail.com> wrote:
> All the submodules have
> to get pushed to and fetched from the same repo by default.  Anything
> else is insane.

...and just to clarify, by far the least insane option here is to have
the whole thing all under a single ref, which is currently impossible
with submodules.

>> What's missing from that (which would be nice) is the ability to check
>> out a subdirectory from another repository. That could (I think) be
>> done by just adding a normal "tree" entry, and then specifying that
>> that tree can be found in git://... instead of the main tree.
>
> Actually that's already easy with submodules (and git-subtree makes it
> easy too, though slightly different).  Just fetch the commit from the
> other repo, and do:
>
>   git checkout FETCH_HEAD -- subdirname

Sorry, that's not right.  You can use this instead for roughly the
effect you want:

    git read-tree --prefix subdirname FETCH_HEAD: && git checkout subdirname

Have fun,

Avery

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-21 21:09       ` Avery Pennarun
  2010-07-21 21:20         ` Avery Pennarun
@ 2010-07-21 22:46         ` Jens Lehmann
  2010-07-22  1:09           ` Avery Pennarun
  2010-07-21 23:46         ` Ævar Arnfjörð Bjarmason
  2 siblings, 1 reply; 58+ messages in thread
From: Jens Lehmann @ 2010-07-21 22:46 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano

Am 21.07.2010 23:09, schrieb Avery Pennarun:
> What we *really* want is a way to have git actually recurse through
> commit objects when doing *any* operation, as if they were tree
> objects.

This would not be useful for every work flow (or to put it in other
words: this is not what I *really* want ;-). And as you pointed
out, that only works when you have a single repo you are working
against (like you do in your subtree model).

But unless I got something wrong (which might very well be the
case, as I never have used subtree myself), all changes to the
subtree will only show up in that single repo, unless you actively
push them somewhere else. And that, it seems to me, is as easy to
forget as you can right now forget to push a submodules commit you
already recorded and pushed in the superproject). So am I wrong
assuming that subtree is more focused on a single repo containing
all commits which /might/ then be shared, while submodules are
about /always/ sharing code via their own repo?


> There is no good solution to the submodule problem if each submodule
> has to go in its own repo.  I've been thinking about this for years
> now, and watching lots of discussions about it on the git mailing
> list, and I just can't see any other option.  All the submodules have
> to get pushed to and fetched from the same repo by default.  Anything
> else is insane.

I have to object here. Your insanity is someone else's work flow ;-)
And I am the last one not to admit that there are some severe
usability warts still to be fixed for submodules (I put up a - not
necessarily complete - list at
http://wiki.github.com/jlehmann/git-submod-enhancements/ ). And
myself and others are actively working on them (the next bigger
thing after a new config option about when to consider a submodule
modified are recursive checkouts, so that "git submodule update"
will hopefully be almost obsolete in the near future).

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-21 21:09       ` Avery Pennarun
  2010-07-21 21:20         ` Avery Pennarun
  2010-07-21 22:46         ` Jens Lehmann
@ 2010-07-21 23:46         ` Ævar Arnfjörð Bjarmason
  2 siblings, 0 replies; 58+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2010-07-21 23:46 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Bryan Larsen, git, Junio C Hamano

On Wed, Jul 21, 2010 at 21:09, Avery Pennarun <apenwarr@gmail.com> wrote:
> On Wed, Jul 21, 2010 at 4:36 PM, Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>> On Wed, Jul 21, 2010 at 19:56, Avery Pennarun <apenwarr@gmail.com> wrote:
>>> No amount of bugfixing in git submodule can fix this workflow, because
>>> it's not a result of bugs.  (The bugs, particularly the
>>> disconnected-by-default HEADs on submodule checkouts, do make it a bit
>>> worse :( )  It would require a fundamental redesign to make this work
>>> nicely with submodules.
>> [...]
>> I think most of those can be fixed, actually. The only requirement
>> that the git plumbing imposes on git-submodules is that a "commit"
>> entry exist in your tree, the rest is just (ugly plumbing).
>
> Sure.  But this commit object (and the objects it points to) are never
> automatically pushed, fetched, or fsck'd.  They're second class
> citizens.  As it turns out, this was a major design mistake in
> implementing the submodule commit objects.
>
> All the behaviour people *currently* get from submodules could have
> been obtained without using a new 'commit' object type at all.  Just
> add a commitid to the horrible junk (including repo URLs, argh) that
> already needs to get pasted into .gitmodules, and have git-commit at
> the top level update .gitmodules automatically (as it currently
> updates the 'commit' tree entries).  Problem solved (at least, solved
> to exactly the extent that it is today).

Yeah, that does sound better than the current mess.

> What we *really* want is a way to have git actually recurse through
> commit objects when doing *any* operation, as if they were tree
> objects.  If we had that, submodules could be beautiful (because you'd
> push them to the same repo, etc and users would see none of the
> complexity).  But this doesn't exist.  And for backward compatibility
> at this point, we'd probably need to introduce an entirely new kind of
> tree entry to support such a thing.
>
>> Thus, we could:
>>
>>   * Hack git-submodule (or its replacement) to check import the tree
>>     that contains that "commit" into one central .git
>
> This part is relatively easy, I think - at least in concept, although
> I bet there would be widespread implementation tweaks - and would
> clean up a lot of the mess.  However it would require a change to the
> .git/index file format to remember when a subdir is a commit and not a
> "normal" tree so that it doesn't silently commit the next thing as a
> tree instead.
>
>>   * Fix git status / git commit so that you could commit into
>>     submodules, i.e.:
>>
>>     for each submodule in this-commit:
>>         chdir $submodule && commit
>>     done && cd $root && commit -m"bumping submodules"
>
> After making the earlier change to get rid of the extra .git subdirs,
> this next requirement would actually be considerably more work,
> because 'git commit' would need to know how to update a subcommit
> without changing HEAD.  You certainly couldn't just code it up as a
> recursive "git commit" as you imply (and as you could do right now).
>
>>   * Make git-push push the submodule contents and the
>>     superprojects. You'd just need to have commit access to the url
>>     listed in .gitmodules.
>
> This is really a *killer* problem, and you're making it sound easy.
> Let's imagine that my app has 25 different submodules - not
> unreasonable at all in a world with dozens of ever-changing ruby gems
> and suchlike.
>
> Now, if I want to branch my project, I might have to branch 25
> projects just so I can push my changes?  It's totally awful.  And the
> awfulness is multiplied many times over if .gitmodules has hard-coded
> repo paths, because then I have to update the repo path in my branch
> but not the other branch, and merging will have conflicts.  You might
> think that my .git/config could just override .gitmodules, but then
> some guy trying to fetch my branch will fail to fetch the submodules
> from my branch and get errors and have no idea what's going on.
>
> And you might think that using relative repo paths in .gitmodules
> would work, but that's only if I branched all 25 submodules in the
> *first* place.  In real life, most subprojects point at the original
> project's home repo by default (because nobody thinks they'll be
> patching 25 subprojects when they start, and they're probably right),
> but then you have to individually change the URLs when you decide you
> need to patch them, and life gets complicated and ugly, especially
> when the next guy goes to fork your project and now needs to fork some
> subprojects but not others.
>
> There is no good solution to the submodule problem if each submodule
> has to go in its own repo.  I've been thinking about this for years
> now, and watching lots of discussions about it on the git mailing
> list, and I just can't see any other option.  All the submodules have
> to get pushed to and fetched from the same repo by default.  Anything
> else is insane.

Yeah, bundling the submodules in the upstream repo so only one person
ever has to worry about gathering them up and pushing them to the
central repo sounds better for most uses than the current submodule
implementation.

OTOH, I have some submodules that I track on GitHub that would really
inflate the size of the repo that's tracking them. So there are
definitely use cases for having the tree somewhere remotely as well,
especially for large submodules like game art, which some people have
reported submodules for.

> One option might be to store the submodule commit refs as refs in your
> superproject.  That wouldn't actually be so bad, except for the
> aforementioned problem that fetch/push/clone/etc don't actually trace
> through commit objects when deciding what objects to send you, so
> fetching the ref of the superproject wouldn't autofetch the subproject
> refs.  Also, you could accidentally delete one of the subproject refs
> and lose tons of history without ever realizing it.  That's error
> prone and confusing... and clutters up your repo refs list with
> administrative stuff you didn't actually want in the first place.
>
>> What's missing from that (which would be nice) is the ability to check
>> out a subdirectory from another repository. That could (I think) be
>> done by just adding a normal "tree" entry, and then specifying that
>> that tree can be found in git://... instead of the main tree.
>
> Actually that's already easy with submodules (and git-subtree makes it
> easy too, though slightly different).  Just fetch the commit from the
> other repo, and do:
>
>   git checkout FETCH_HEAD -- subdirname
>
>>> If we can get some kind of consensus in principle that git-subtree is
>>> a good idea to merge into git core, I can prepare some patches and we
>>> can talk about the details.
>>
>> From having looked at it briefly it looks very nice. But it looks to
>> me as if the main differences between git-submodule and git-subtree
>> are in the porcelain, not the plumbing.
>
> No.  The fundamental difference is exactly one: git-subtree uses
> normal 'tree' entries (rather than commits) in its trees, so that all
> the git tools recurse through them like any other tree.  Thus you
> don't need any extra refs, extra .git dirs, etc.  That allows you to
> bypass all the useless behaviour git has around 'commit' entries.
> This is very much a plumbing difference.
>
> The git-submodule porcelain happens to independently be kind of
> annoying and inconvenient, but that would be much easier to fix if it
> weren't for the plumbing-related problems.
>
>> It would be a lot less confusing to users of Git in the long term if
>> we would at least try to unify these two approaches instead of having
>> two mutually incompatible ways of doing essentially the same thing.
>
> True.  But I don't have the time, and implementing the new 'commit'
> entry semantics sounds like a lot of work (as opposed to arguing about
> them, which I guess I'm good at but which seems unproductive).
>
> In productive terms: git-subtree is solving problems for real users
> right now.  It might solve more problems for more users if it were
> integrated into the core and thus made "official."  Nothing precludes
> making submodules better later.

Sure, don't get me wrong. git-subtree looks very useful, and I have no
objection to having it in git.git, and even if it's not optimal for
everything good working software now shouldn't be held up by some
theoretical pie-in-the-sky system.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-21 22:46         ` Jens Lehmann
@ 2010-07-22  1:09           ` Avery Pennarun
       [not found]             ` <m31vavn8la.fsf@localhost.localdomain>
  0 siblings, 1 reply; 58+ messages in thread
From: Avery Pennarun @ 2010-07-22  1:09 UTC (permalink / raw)
  To: Jens Lehmann
  Cc: Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano

On Wed, Jul 21, 2010 at 6:46 PM, Jens Lehmann <Jens.Lehmann@web.de> wrote:
> Am 21.07.2010 23:09, schrieb Avery Pennarun:
>> What we *really* want is a way to have git actually recurse through
>> commit objects when doing *any* operation, as if they were tree
>> objects.
>
> This would not be useful for every work flow (or to put it in other
> words: this is not what I *really* want ;-). And as you pointed
> out, that only works when you have a single repo you are working
> against (like you do in your subtree model).

But you see, the utter failure of the way git-submodule works is that
it required a change to the git repository format, but that repository
format change resulted in absolutely *zero* improvement.

The tree object of the parent points at 'commit xxxx'.  But everything
in git has been *specially modified* to *just ignore* that 'commit
xxxx'.  It would have given exactly the same functionality - and much
less confusingly - if .gitmodules would just include the desired
commitid of the child project.  You could still have the same 'git
submodule' command with the same syntax and semantics.  And it
wouldn't have bastardized the git repo format.

It would have been just as good to just dump something into your
Makefile to go 'git clone' the subprojects from somewhere before
building.  Seriously, it would be one or two lines of code; all of
git-submodule replaces about one or two lines of code in your
Makefile.  And you know what?  If I just used that one or two lines of
code, I'd have all sorts of flexibility in where the subprojects get
cloned from, which I currently don't have, and which is the insanity
that drove me to write git-subtree in the first place.

HOWEVER

I'm not saying we can change that now.  I'm not suggesting that this
feature can be safely removed or changed at all.  Furthermore, I
totally agree that having large subprojects *not* be in your repo is
sometimes a good idea.  I just think it was actually a bad idea to
intrusively add support to git to implement this when it could have
been done without modifying git at all.

I also believe that the vast majority of people who use git-submodules
would rather have it work differently.  (Again, this is not to
subtract functionality.  The existing functionality is useful
sometimes.)

> But unless I got something wrong (which might very well be the
> case, as I never have used subtree myself), all changes to the
> subtree will only show up in that single repo, unless you actively
> push them somewhere else. And that, it seems to me, is as easy to
> forget as you can right now forget to push a submodules commit you
> already recorded and pushed in the superproject). So am I wrong
> assuming that subtree is more focused on a single repo containing
> all commits which /might/ then be shared, while submodules are
> about /always/ sharing code via their own repo?

Yes, this is absolutely intentional.  It also matches exactly with
everything else in the git repo philosophy!

I make my own clone.  I mess with it, I fiddle with it, I make 17
clones on my local machine, I throw away what I don't like, I pull
merge, I rebase, and then *eventually* I submit *some* of my patches
upstream.  git-subtree lets you do all those things.  git-submodule
stomps on you repeatedly if you try.

To wit:

- cloning a local supermodule on my local machine to another copy:
every call to 'git submodule update' re-downloads submodule repos from
the remote machine, because the submodule path is hardcoded to point
at a remote machine.  Better still, if I've modified any of my
subprojects without pushing changes upstream, the clone will fail,
because the new copy of the superproject will have no access to my
subproject's patches.  (If .gitmodules supplies a relative path, it's
even worse, because my 'origin' in the new copy is now pointing to a
local folder, not a remote one, and all the submodules don't exist
there.)

- branching a local supermodule on my local machine: fails to branch
the submodule automatically and makes it super easy to lose patches
altogether (since by default, they're committed to a detached HEAD).

- pulling/merging: always causes a conflict if local and remote have
modified the same submodule.

- rebasing: always causes a conflict if local and remote have modified
the same submodule.  Also requires you to rebase submodules separately
from the supermodule.  (Yes, this happens often in real life.)

- submitting upstream: requires me to have a separate repo that's a
copy of the upstream repo, and to manage at least one subrepo branch
for every superproject branch, just to track my submissions.  With
git-subtree, no extra repos are necessary.

It's very clear that git-submodule's current behaviour totally
mismatches the entire git philosophy.  That's why it's so impossible
to make the git-submodule command usable.

Another mental exercise: try to think of any other part of git where
it would be considered remotely acceptable to put the absolute or
relative URL of one repo inside another repo.  git URLs are an
implementation detail of clone/fetch/push/pull.  The *content* that
git manages should not have to deal with that stuff.  With
git-submodule, it has to.  With git-subtree, it doesn't.

>> There is no good solution to the submodule problem if each submodule
>> has to go in its own repo.  I've been thinking about this for years
>> now, and watching lots of discussions about it on the git mailing
>> list, and I just can't see any other option.  All the submodules have
>> to get pushed to and fetched from the same repo by default.  Anything
>> else is insane.
>
> I have to object here. Your insanity is someone else's work flow ;-)

Sorry.  I was being a little hyperbolic.  Some people might want to do
use multiple repos for certain things - but I believe those people are
much more rare than the kind who want to do it my way.  And
furthermore, even those people would probably actually like it better
if *most* of their subprojects - the smallish ones - could be all in
one repo.

Even if you like multiple repos, I'm sure you don't like being
*forced* to manually fork multiple repos just to fork a single
superproject.  I'm sure you don't like updating .gitmodules to change
the absolute URL of a submodule, and then getting merge conflicts when
someone else had to do the same thing.  There's no way you like that.
If you like that, then you really are insane. :)

> And I am the last one not to admit that there are some severe
> usability warts still to be fixed for submodules (I put up a - not
> necessarily complete - list at
> http://wiki.github.com/jlehmann/git-submod-enhancements/ ). And
> myself and others are actively working on them (the next bigger
> thing after a new config option about when to consider a submodule
> modified are recursive checkouts, so that "git submodule update"
> will hopefully be almost obsolete in the near future).

I don't believe you can fix git-submodule by fixing surface warts.
It's fundamentally broken.  Since we're stuck with supporting the
current behaviour at the end of time, fixing the surface warts might
be necessary and even mildly helpful.  It will also be soul sucking
since no matter how hard you try, people will still hate the result.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
       [not found]             ` <m31vavn8la.fsf@localhost.localdomain>
@ 2010-07-22 18:23               ` Bryan Larsen
  2010-07-24 22:36                 ` Jakub Narebski
  2010-07-22 19:41               ` Avery Pennarun
  1 sibling, 1 reply; 58+ messages in thread
From: Bryan Larsen @ 2010-07-22 18:23 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Avery Pennarun, Jens Lehmann,
	=?iso-8859-15?q? Ævar Arnfjörð Bjarmason?=,
	git, Junio C Hamano, Linus Torvalds

>
> Using git-subtree has its warts too: I don't think for example that there is
> a way to get a log _automatically excluding_ history subtree-merged
> subprojects.  Or is it there?
>

It works exactly right for me when I used git-subtree in "squashed" 
mode.  Changes which were done in tree show up separately in the log, 
changes which were pulled in via git-subtree pull show up as a single 
summary entry in the log.

This discussion has been about how to improve git submodules, which is 
sorely needed.   However, it's quite clear that git submodules will 
never work as well as git subtrees in certain quite common situations. 
  If fixed, git submodules will be more appropriate in other situations. 
   However, I'm not asking to remove git submodules or prevent anybody 
from fixing them, I'm just asking that git subtree be merged.

Does anybody actually oppose the merger of git-subtree, which has (at 
least) hundreds of users despite its out-of-tree status?

thanks,
Bryan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
       [not found]             ` <m31vavn8la.fsf@localhost.localdomain>
  2010-07-22 18:23               ` Bryan Larsen
@ 2010-07-22 19:41               ` Avery Pennarun
  2010-07-22 19:56                 ` Jonathan Nieder
                                   ` (3 more replies)
  1 sibling, 4 replies; 58+ messages in thread
From: Avery Pennarun @ 2010-07-22 19:41 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Jens Lehmann, Ævar Arnfjörð Bjarmason,
	Bryan Larsen, git, Junio C Hamano, Linus Torvalds

On Thu, Jul 22, 2010 at 5:57 AM, Jakub Narebski <jnareb@gmail.com> wrote:
> Avery Pennarun <apenwarr@gmail.com> writes:
>> The tree object of the parent points at 'commit xxxx'.  But everything
>> in git has been *specially modified* to *just ignore* that 'commit
>> xxxx'.  It would have given exactly the same functionality - and much
>> less confusingly - if .gitmodules would just include the desired
>> commitid of the child project.  You could still have the same 'git
>> submodule' command with the same syntax and semantics.  And it
>> wouldn't have bastardized the git repo format.
>
> Actually the prototype implementation by Martin Waitz worked in such way,
> i.e. with special file in top directory holding SHA-1 of submodule commits,
> what you can read on https://git.wiki.kernel.org/index.php/SubprojectSupport
> page.
>
> The low level plumbing with 'commit' entries in the 'tree' object was
> created by Linus Torvalds (CC-ed).  I don't remember discussion about why
> this solution was chosen, though.  But please read about differences between
> git-subtree and git-submodule below.

I actually think Linus's contribution - the particular change to the
repo format to have trees link to commits - was exactly right.  If we
want to talk about failings of git-subtree, they all precisely come
down to the fact that, because it has tree->tree links instead of
tree->commit links, it has to stash commitid information in the commit
message, which is gross and error prone.

git-subtree would have benefitted from tree->commit links, but because
git's implementation of them is broken, that wasn't an option.

Unfortunately everything built *on top of* Linus's file format
contribution has turned out to be a disaster.  Actually making the
subprojects have their own local .git repositories was a disaster, for
exactly the same reasons that having every subdir in svn have its own
.svn directory (or in cvs, every directory has its own CVS directory)
is a disaster.  When you split things up that way, you can't easily do
global atomic operations across the entire set of content.  And you
can accidentally have a subdir pointing at a totally different place
than the parent thinks it is.  And you have CVS/.svn/.git directories
cluttering stuff up everywhere.

The tree->commit links do not preclude you doing wonderful global
atomic operations across the entire set of content.  The separate
repository garbage absolutely does.

>> To wit:
>>
>> - cloning a local supermodule on my local machine to another copy:
>> every call to 'git submodule update' re-downloads submodule repos from
>> the remote machine, because the submodule path is hardcoded to point
>> at a remote machine.
>
> Errrr... the URL to submodule repository (I guess it is what you meant here
> by "submodule path") in the config file overrides URL to submodule
> repository in '.gitmodules' for a reason.  So the plumbing support is here,
> it is only failing of an UI that we don't have '--recursive-local' or
> '--convert-submodules' (like '--convert-links' in wget) in "git clone".

Let me be more specific.

I create an app named myapp on github:

   git://github.com/apenwarr/myapp

It uses 17 different ruby gems, which I import as subprojects.  I have
two choices:

[1] .gitmodules can use absolute paths to the original gem locations:

   git://github.com/rubygems/gem[1..n]

[2] Or else I can fork them all and use relative paths in .gitmodules:

   ../gem[1..n]
   translates to --> git://github.com/apenwarr/gem[1..n]

At this phase, both options are okay (though option #2 is obviously
much more work).  My next step will be to clone myapp onto my local
machine:

   git clone --recursive git://github.com/apenwarr/myapp

And it will grab all the submodules just fine.

Now let's say I want to change gem13.  If I used option #1, I have to
now go fork gem13 on github.  Then do one of the following:

[1A] Re-point my .git/config file to point at the new submodule
location, git://github.com/apenwarr/gem13 but leave .gitmodules alone

[1B] or update both .git/config and .gitmodules

If I do #1A, then when I push my changes, the *next* guy who clones
git://github.com/apenwarr/myapp will fail; the gem13 link in myapp
points at a commit that is *only* in apenwarr/gem13, not
rubygems/gem13.

If I do #1B, then if someone else does something similar in their own
copy and pulls from me, we will have a conflict in .gitmodules.

In both cases, if two people need to patch gem13 during their changes
to myapp, merges will fail because there is no submodule-recursive
merge (and trying to write one would be incredibly hard since it would
have to communicate across sub-repositories).

So if you do #1, then I don't know of any options other than #1A and
#1B, and neither one works.

Now, if I had done #2 instead, things are a little better, because
we're using relative paths in .gitmodules so when the second guy
clones a copy of myapp, he can also clone a copy of all 17 gems, and
all the paths will still work.

When the second guy does 'git pull apenwarr myapp' it will still fail,
though; it will try to get the latest gem13 from ../gem13 -->
secondguy/gem13, when actually the required commits are in
apenwarr/gem13.

Furthermore, 'git clone --recursive myapp myapp2' will totally fail,
because it will then expect gem[1..n] to all be in separate local
directories at the same level as myapp, which they aren't.  (You might
be saying: what do you need that for?  Well, I rarely do.  But
sometimes.  And as long as I don't use git-submodule, it works fine.)

You can fix warts all day long.  You can't make it work, because it's
not just warts; the insides are rotten.

>> - branching a local supermodule on my local machine: fails to branch
>> the submodule automatically and makes it super easy to lose patches
>> altogether (since by default, they're committed to a detached HEAD).
>
> That's UI problem, too.  Theough I guess that using detached HEAD was
> choosen because it is simplest solution.

I've seen the discussion about submodule branch names go by on the git
list a few times, and I participated once or twice.  The current
option was certainly chosen because it's the simplest; unfortunately,
it's also non-functional, and all the other options are also awful.

Here it is in a nutshell: if I'm branching myapp, I already have a
branch that I want to store all my changes under; it's the branch I'm
working on in myapp.  That's not to say I want that same myapp branch
name *in my gem13 repository*; my branchname is probably something
like add-feature-to-myapp, which has nothing to do with gem13.  The
changes required to gem13 to implement add-feature-to-myapp are
probably just a tiny bugfix or config option.  gem13 doesn't know
anything about myapp.  The upstream gem13 maintainers certainly don't
care about myapp.  As a guy who *just wants to get work done on myapp
right now*, thinking about what to name my trivial one-patch temporary
branch gem13 is a *waste of time*.

I don't *want* my gem13 changes to have a branchname.

So the disconnected HEAD is the right answer then, right?  No!  The
default disconnected HEAD makes it *far* too easy to lose my changes.
I don't want to name my branch, but I *have* to, because I *have* to
push it somewhere separately, because if I don't, then my changes to
myapp will be useless to everyone who tries to pull from me.

The question of what to name the submodule branch is unanswerable
because it's the wrong question.

> Otherwise you would have either
> put submodule branch name in '.gitmodules' (but that's contrary to git
> philosophy that branches are ephemeral and branch names are local matter),

Surely including *repository URLs* inside the *repository content* is
at least as bad as including branch names.  If we're going to do one,
we might as well do the other.  But it won't help, because the stored
branch name will probably be 'master', and my personal hacked-up copy
of gem13 shouldn't be on a branch named master anyway.

>> - pulling/merging: always causes a conflict if local and remote have
>> modified the same submodule.
>>
>> - rebasing: always causes a conflict if local and remote have modified
>> the same submodule.  Also requires you to rebase submodules separately
>> from the supermodule.  (Yes, this happens often in real life.)
>
> That's a matter of UI, and lack of merge strategy that can merge
> submodules... although if I remember correctly there was some preliminary or
> proof of concept work on submodule-aware merge strategy.
>
> "git merge" and "git rebase" would have to acquire '--recursive' option.
> Currently you probably need to use 'git submodule foreach ...', I guess.

Merge and rebase are actually very different here.  Merging is
something I might expect to work across submodules eventually;
rebasing is much less obvious, because successive versions of myapp
might actually be jumping back and forth between versions of gem13.
Then what does it mean to auto-rebase gem13 when you're rebasing
myapp?

You should check out git-subtree --squash here; it's quite interesting
and makes rebasing easy, even if the subtree version is alternating
back and forth.  I'm not sure how you'd map it onto git-submodule,
though, even if git-submodule weren't broken.

>> - submitting upstream: requires me to have a separate repo that's a
>> copy of the upstream repo, and to manage at least one subrepo branch
>> for every superproject branch, just to track my submissions.  With
>> git-subtree, no extra repos are necessary.
>
> NOTE that it is important design decision to have by default separate object
> storage for submodules.

I certainly won't deny that :)  This discussion is about whether it
was the right decision.

> First, this allow to not clone submodule, and do not download its objects.
> This is *impossible* with git-subtree (with using 'subtree' merge strategy).
> I'm not sure how commonly this feature is used in real life, but somebody
> here in this thread gave example of submodule with arts, which is large
> because it contains large / many binary files, while being required to have
> only for some.
>
> Second, from what I remember this was implemented also for perfomance
> reasons... though I don't remember reasoning used.

I think this ended up being a terrible mistake.  The problems you
identify come down to this:

1) Sometimes I want to clone only some subdirs of a project
2) Sometimes I don't want the entire history because it's too big.
3) Super huge git repositories start to degrade in performance.

(Actually #3 isn't really a problem as far as I've ever seen, and bup
stores hundreds of gigs, including trees that reference millions of
blobs, in a single git repo without dying.  But okay, maybe this is a
problem sometimes for some types of operations.)

These problems come up regardless of whether you're using submodules.
The hard truth of the matter is that people are using submodules to
try to solve these problems, but they were never caused by the lack of
submodules in the first place.

When I clone the Linux kernel, sometimes I just don't want the entire
history.  That's why people invented shallow clones (although last
time I checked, they were still a little half-assed).

When I clone KDE, sometimes I don't want all the subprograms;
sometimes I do. That's why people invented sparse checkouts, and why
(I think) it would be nice to have sparse clones as well (where you
don't even download the objects for subtrees you don't care about).

There is simply not a clear path from "my repo is too big" to "all my
problems will be solved if git-submodule is implemented correctly."

The truth is, problems 1-3 are easily solvable by improving the git
implementation, without any change in architecture and without
requiring people to layout their projects differently.

The *real* need for submodules - the need you can't fix without
submodules - has nothing to do with these requirements.  It's about
each submodule wanting to have its own lifecycle, owner, changelog,
and release process, and - perhaps this is actually the killer
requirement - each supermodule wanting to be able to cleanly rewind a
submodule if they don't like the new version.

>> It's very clear that git-submodule's current behaviour totally
>> mismatches the entire git philosophy.  That's why it's so impossible
>> to make the git-submodule command usable.
>
> That's very strong accusation.

Agreed... but that doesn't make it wrong :)

> Using git-subtree has its warts too: I don't think for example that there is
> a way to get a log _automatically excluding_ history subtree-merged
> subprojects.  Or is it there?

There's git-subtree merge --squash.  It's pretty cool.  Also insane
and not as good as real tree->commit links.  I will gladly admit to
git-subtree's warts.

>
> Sumodule                           | Subtree
> -----------------------------------+----------------------------------
> must clone recursively submodules; | automatically gets all subtrees

Yup.

> can not clone some submodules      | cannot leave out some subtree, but
>                                   | nowadays can not checkout it

I don't understand what you mean on the right-hand side here.  FWIW,
subtree forces you to always checkout the entire thing (unless you use
git sparse checkouts, I guess; maybe that's what you mean).

> rebase and merge needs separate    | rebase and merge works normally
> work in submodule currently        |

True.

> easy to send updates upstream      | need not to worry about submodule
> to submodule repo                  | repository

It's actually easy to send subtree updates upstream with the new 'git
subtree push' command, which was contributed recently.  Or you can
send them via format-patch if you use 'git subtree split'.  It's one
line more of typing than doing it on a submodule repo, and that one
line is greatly offset by the hugely reduced typing by not using
submodules.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-22 19:41               ` Avery Pennarun
@ 2010-07-22 19:56                 ` Jonathan Nieder
  2010-07-22 20:06                   ` Avery Pennarun
                                     ` (2 more replies)
  2010-07-23  8:31                 ` Chris Webb
                                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 58+ messages in thread
From: Jonathan Nieder @ 2010-07-22 19:56 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Jakub Narebski, Jens Lehmann,
	Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano, Linus Torvalds

Avery Pennarun wrote:

> Unfortunately everything built *on top of* Linus's file format
> contribution has turned out to be a disaster.

Aside: this kind of statement might make it unlikely for exactly
those who would benefit most from your opinions to read them.

Well, that is my guess, anyway.  I know that I have not found the time
to read your email (though I would like to) because I suspect based on
such sweeping statements that it would take a while to separate the
useful part from the rest.

Of course I am glad to see people thinking about these issues.
My comment is only about how the results get presented.

Jonathan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-22 19:56                 ` Jonathan Nieder
@ 2010-07-22 20:06                   ` Avery Pennarun
  2010-07-22 20:17                   ` Ævar Arnfjörð Bjarmason
  2010-07-22 20:43                   ` Elijah Newren
  2 siblings, 0 replies; 58+ messages in thread
From: Avery Pennarun @ 2010-07-22 20:06 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Jakub Narebski, Jens Lehmann, Ævar Arnfjörð,
	Bryan Larsen, git, Junio C Hamano, Linus Torvalds

On Thu, Jul 22, 2010 at 3:56 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Avery Pennarun wrote:
>> Unfortunately everything built *on top of* Linus's file format
>> contribution has turned out to be a disaster.
>
> Aside: this kind of statement might make it unlikely for exactly
> those who would benefit most from your opinions to read them.
>
> Well, that is my guess, anyway.  I know that I have not found the time
> to read your email (though I would like to) because I suspect based on
> such sweeping statements that it would take a while to separate the
> useful part from the rest.

Unfortunately you will find that the rest of my email more or less
just expands in detail on those sweeping statements.

Sorry.

Avery

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-22 19:56                 ` Jonathan Nieder
  2010-07-22 20:06                   ` Avery Pennarun
@ 2010-07-22 20:17                   ` Ævar Arnfjörð Bjarmason
  2010-07-22 21:33                     ` Avery Pennarun
  2010-07-22 20:43                   ` Elijah Newren
  2 siblings, 1 reply; 58+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2010-07-22 20:17 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Avery Pennarun, Jakub Narebski, Jens Lehmann, Bryan Larsen, git,
	Junio C Hamano, Linus Torvalds

On Thu, Jul 22, 2010 at 19:56, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Avery Pennarun wrote:
>
>> Unfortunately everything built *on top of* Linus's file format
>> contribution has turned out to be a disaster.
>
> Aside: this kind of statement might make it unlikely for exactly
> those who would benefit most from your opinions to read them.
>
> Well, that is my guess, anyway.  I know that I have not found the time
> to read your email (though I would like to) because I suspect based on
> such sweeping statements that it would take a while to separate the
> useful part from the rest.
>
> Of course I am glad to see people thinking about these issues.
> My comment is only about how the results get presented.

Well, it's not like Linus is the image of calmness when attacking
something he perceives as crap design either >:)

Anyway, to answer Bryan's question. My comments in previous messages
shouldn't be interpreted as opposition to git-subtree being merged at
all. It's clearly very useful, especially for cases where
git-submodule is wanting. I'd be happy to review a patch that
integrated it into the Git tree.

But it's also clear that we have a lot of tribal knowledge about the
lackings of git submodule / git subtree. It would be *really* useful
if people like Avery and Jens which have obviously thought hard about
the submodule/subtree issues would draft up some (calmly written) docs
about how the two differ (with comparison tables etc.).

That'd be a very helpful resource for Git users in deciding which one
to use.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-22 19:56                 ` Jonathan Nieder
  2010-07-22 20:06                   ` Avery Pennarun
  2010-07-22 20:17                   ` Ævar Arnfjörð Bjarmason
@ 2010-07-22 20:43                   ` Elijah Newren
  2010-07-22 21:32                     ` Avery Pennarun
  2 siblings, 1 reply; 58+ messages in thread
From: Elijah Newren @ 2010-07-22 20:43 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Avery Pennarun, Jakub Narebski, Jens Lehmann,
	Ævar Arnfjörð,
	Bryan Larsen, git, Junio C Hamano, Linus Torvalds

Hi,

On Thu, Jul 22, 2010 at 1:56 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Avery Pennarun wrote:
>
>> Unfortunately everything built *on top of* Linus's file format
>> contribution has turned out to be a disaster.
>
> Aside: this kind of statement might make it unlikely for exactly
> those who would benefit most from your opinions to read them.
>
> Well, that is my guess, anyway.  I know that I have not found the time
> to read your email (though I would like to) because I suspect based on
> such sweeping statements that it would take a while to separate the
> useful part from the rest.

I'd usually agree with such a sentiment, but I don't think it's
accurate in this case.  Having read Avery's emails in this thread, I
think he does a really good job explaining why submodules don't (and
won't) work for a lot of people.  I think he provided a better
explanation than I could have for why I've never had much luck with
submodules (and further convinced me that not only do they not work
for me now, but they aren't ever going to fulfill the usecases I had).

I can't really add much other than that we've been relatively happy
with git-subtree and would like to see it or something like it merged.
 Our problems with it so far have turned out to be issues in other
areas of git (e.g. the known issue about --prefix being ignored with
the code being merged under a different directory due to
rename-detection, and the bugs in merge-recursive's handling of D/F
changes).


Elijah

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-22 20:43                   ` Elijah Newren
@ 2010-07-22 21:32                     ` Avery Pennarun
  0 siblings, 0 replies; 58+ messages in thread
From: Avery Pennarun @ 2010-07-22 21:32 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Jonathan Nieder, Jakub Narebski, Jens Lehmann,
	Ævar Arnfjörð,
	Bryan Larsen, git, Junio C Hamano, Linus Torvalds

On Thu, Jul 22, 2010 at 4:43 PM, Elijah Newren <newren@gmail.com> wrote:
> (e.g. the known issue about --prefix being ignored with
> the code being merged under a different directory due to
> rename-detection, [...])

Aside: if this is the bug I think it is, then it's is fixed by the git
merge -Xsubtree feature, which has since been merged into git.  (I
think Elijah knew that, I just wanted to make sure it's clear to
anyone else reading.)

Have fun,

Avery

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-22 20:17                   ` Ævar Arnfjörð Bjarmason
@ 2010-07-22 21:33                     ` Avery Pennarun
  2010-07-23 15:10                       ` Jens Lehmann
  2010-07-26 17:34                       ` Eugene Sajine
  0 siblings, 2 replies; 58+ messages in thread
From: Avery Pennarun @ 2010-07-22 21:33 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jonathan Nieder, Jakub Narebski, Jens Lehmann, Bryan Larsen, git,
	Junio C Hamano, Linus Torvalds

On Thu, Jul 22, 2010 at 4:17 PM, Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> But it's also clear that we have a lot of tribal knowledge about the
> lackings of git submodule / git subtree. It would be *really* useful
> if people like Avery and Jens which have obviously thought hard about
> the submodule/subtree issues would draft up some (calmly written) docs
> about how the two differ (with comparison tables etc.).
>
> That'd be a very helpful resource for Git users in deciding which one
> to use.

I think I'm too biased to write that, but if someone else wants to
take the lead, I could certainly contribute.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-22 19:41               ` Avery Pennarun
  2010-07-22 19:56                 ` Jonathan Nieder
@ 2010-07-23  8:31                 ` Chris Webb
  2010-07-23  8:40                   ` Avery Pennarun
  2010-07-23 15:10                 ` Jens Lehmann
  2010-07-23 15:19                 ` Marc Branchaud
  3 siblings, 1 reply; 58+ messages in thread
From: Chris Webb @ 2010-07-23  8:31 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Jakub Narebski, Jens Lehmann, ?var Arnfj?r? Bjarmason,
	Bryan Larsen, git, Junio C Hamano, Linus Torvalds

Avery Pennarun <apenwarr@gmail.com> writes:

> I actually think Linus's contribution - the particular change to the
> repo format to have trees link to commits - was exactly right.  If we
> want to talk about failings of git-subtree, they all precisely come
> down to the fact that, because it has tree->tree links instead of
> tree->commit links, it has to stash commitid information in the commit
> message, which is gross and error prone.
> 
> git-subtree would have benefitted from tree->commit links, but because
> git's implementation of them is broken, that wasn't an option.

I considered using submodules for one of my projects, and decided against
for some of the usability reasons with multiple repositories which you
highlight. (I didn't know about subtree.)

You've surely considered this already, but reading your description in this
thread, my first thought is that commits within trees could mean different
things depending on whether they're at paths listed in .gitmodules or not.
If the path is listed, the commit is in an external repository. If it isn't,
it's a reference to a local commit, allowing submodules to live in the same
repo as their parent and share some of the advantages you describe for
sub-tree.

Over time, git could then become smarter about recursing through commits in
trees, although I can see a potential problem with needing to know about a
.gitmodules blob in the top-level tree when we're examining a deeper level
tree.

Cheers,

Chris.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-23  8:31                 ` Chris Webb
@ 2010-07-23  8:40                   ` Avery Pennarun
  2010-07-23 15:11                     ` Jens Lehmann
  2010-07-23 15:13                     ` Jens Lehmann
  0 siblings, 2 replies; 58+ messages in thread
From: Avery Pennarun @ 2010-07-23  8:40 UTC (permalink / raw)
  To: Chris Webb
  Cc: Jakub Narebski, Jens Lehmann, ?var Arnfj?r? Bjarmason,
	Bryan Larsen, git, Junio C Hamano, Linus Torvalds

On Fri, Jul 23, 2010 at 4:31 AM, Chris Webb <chris@arachsys.com> wrote:
> You've surely considered this already, but reading your description in this
> thread, my first thought is that commits within trees could mean different
> things depending on whether they're at paths listed in .gitmodules or not.
> If the path is listed, the commit is in an external repository. If it isn't,
> it's a reference to a local commit, allowing submodules to live in the same
> repo as their parent and share some of the advantages you describe for
> sub-tree.

I think it would be better if we could abandon .gitmodules entirely;
it's really only useful for listing repository URLs, and listing
repository URLs is a major part of the problem.

Something that would be neat, and at least vaguely backward-compatible
would be to simply *try* fetching the linked commit objects from a
remote repo, and checking them out from the local repo.  If the
objects exists, fetch/checkout of them will just work; if they don't,
then it can (for backwards compatibility) revert to the current
behaviour.  Push would, if the objects exist, send them to the remote
repo.

Then there could be a .gitconfig option that flips this new behaviour
on and off, ie. auto-checkouts subprojects that *can* be checked out
without any extra knowledge, or not.  If not, then you have to use the
old-style git submodule stuff.

(This proposal is not as easy as it sounds; to do it *right* would
involve not having a separate .git repo for each subproject.  That
means changes to the index file format and a bunch of related stuff.
Though I guess you could keep the sub-repo stuff and it would still be
better than what we have now.)

Have fun,

Avery

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-22 19:41               ` Avery Pennarun
  2010-07-22 19:56                 ` Jonathan Nieder
  2010-07-23  8:31                 ` Chris Webb
@ 2010-07-23 15:10                 ` Jens Lehmann
  2010-07-23 16:05                   ` Bryan Larsen
  2010-07-23 22:32                   ` Avery Pennarun
  2010-07-23 15:19                 ` Marc Branchaud
  3 siblings, 2 replies; 58+ messages in thread
From: Jens Lehmann @ 2010-07-23 15:10 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Jakub Narebski, Ævar Arnfjörð Bjarmason,
	Bryan Larsen, git, Junio C Hamano, Linus Torvalds, Heiko Voigt

Am 22.07.2010 21:41, schrieb Avery Pennarun:
> I create an app named myapp on github:
> 
>    git://github.com/apenwarr/myapp
> 
> It uses 17 different ruby gems, which I import as subprojects.  I have
> two choices:
> 
> [1] .gitmodules can use absolute paths to the original gem locations:
> 
>    git://github.com/rubygems/gem[1..n]
> 
> [2] Or else I can fork them all and use relative paths in .gitmodules:
> 
>    ../gem[1..n]
>    translates to --> git://github.com/apenwarr/gem[1..n]

You forgot what we do as best practice at work:

[3] Fork the gem repos on github (or another server reachable by your
    co-workers) and use those, so you don't have to change the URL
    later:

    git://github.com/apenwarrrubygems/gem[1..n]

Your problems go away, setup has to be done only once on project
start and not for every developer, you can use your own branchnames
and you have a staging repo from where you can push patches upstream
if necessary.


> Surely including *repository URLs* inside the *repository content* is
> at least as bad as including branch names.  If we're going to do one,
> we might as well do the other.  But it won't help, because the stored
> branch name will probably be 'master', and my personal hacked-up copy
> of gem13 shouldn't be on a branch named master anyway.

You sure are aware that having a branch name associated with a
submodule checkout is a request repeatedly made?


> The *real* need for submodules - the need you can't fix without
> submodules - has nothing to do with these requirements.  It's about
> each submodule wanting to have its own lifecycle, owner, changelog,
> and release process, and - perhaps this is actually the killer
> requirement - each supermodule wanting to be able to cleanly rewind a
> submodule if they don't like the new version.

That is just one example. Another one is code shared between
different repos (think: libraries) where you want to make sure that
a bugfix in the library made in project A will make it to the shared
code repo and thus doesn't have to be fixed again by projects B to X.
This was one of the reasons we preferred submodules over subtrees
in our evaluation, because there is no incentive to push fixes inside
the subtree back to its own repo like there is when using submodules.


>>> It's very clear that git-submodule's current behaviour totally
>>> mismatches the entire git philosophy.  That's why it's so impossible
>>> to make the git-submodule command usable.
>>
>> That's very strong accusation.
> 
> Agreed... but that doesn't make it wrong :)

But calling a feature "impossible to make ... usable" is an
interesting thing to say about a feature lots of people are
using productively in their daily work, no? ;-)


>> rebase and merge needs separate    | rebase and merge works normally
>> work in submodule currently        |
> 
> True.

Nope, there is a patch in pu doing
that when it is a simple fast forward
and giving you advice when both sides
are already merged inside the submodule
(CCed Heiko, because he is the author
of that feature)

It is the /commits/ that have to be
done twice, once in the submodule and
then in the superproject. (But that is
not necessarily bad, imagine having git
gui as a submodule: you would be
automagically reminded that stuff for
git gui should be sent somewhere else
than to Junio).

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-22 21:33                     ` Avery Pennarun
@ 2010-07-23 15:10                       ` Jens Lehmann
  2010-07-26 17:34                       ` Eugene Sajine
  1 sibling, 0 replies; 58+ messages in thread
From: Jens Lehmann @ 2010-07-23 15:10 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jakub Narebski, Bryan Larsen, git, Junio C Hamano,
	Linus Torvalds

Am 22.07.2010 23:33, schrieb Avery Pennarun:
> On Thu, Jul 22, 2010 at 4:17 PM, Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>> But it's also clear that we have a lot of tribal knowledge about the
>> lackings of git submodule / git subtree. It would be *really* useful
>> if people like Avery and Jens which have obviously thought hard about
>> the submodule/subtree issues would draft up some (calmly written) docs
>> about how the two differ (with comparison tables etc.).
>>
>> That'd be a very helpful resource for Git users in deciding which one
>> to use.
> 
> I think I'm too biased to write that, but if someone else wants to
> take the lead, I could certainly contribute.

While I don't consider myself biased, I just don't know enough about
the details of the subtree approach to write that.

But I would certainly contribute to the submodule side of such a
document too.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-23  8:40                   ` Avery Pennarun
@ 2010-07-23 15:11                     ` Jens Lehmann
  2010-07-23 22:33                       ` Avery Pennarun
  2010-07-23 15:13                     ` Jens Lehmann
  1 sibling, 1 reply; 58+ messages in thread
From: Jens Lehmann @ 2010-07-23 15:11 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Chris Webb, Jakub Narebski, ?var Arnfj?r? Bjarmason,
	Bryan Larsen, git, Junio C Hamano, Linus Torvalds

Am 23.07.2010 10:40, schrieb Avery Pennarun:
> I think it would be better if we could abandon .gitmodules entirely;
> it's really only useful for listing repository URLs, and listing
> repository URLs is a major part of the problem.

Then where do you get the URL to clone the submodule from on "git
clone --recursive"?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-23  8:40                   ` Avery Pennarun
  2010-07-23 15:11                     ` Jens Lehmann
@ 2010-07-23 15:13                     ` Jens Lehmann
  1 sibling, 0 replies; 58+ messages in thread
From: Jens Lehmann @ 2010-07-23 15:13 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Chris Webb, Jakub Narebski, ?var Arnfj?r? Bjarmason,
	Bryan Larsen, git, Junio C Hamano, Linus Torvalds

Am 23.07.2010 10:40, schrieb Avery Pennarun:
> I think it would be better if we could abandon .gitmodules entirely;
> it's really only useful for listing repository URLs, and listing
> repository URLs is a major part of the problem.

Then where do you get the URL to clone the submodule from on "git
clone --recursive"?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-22 19:41               ` Avery Pennarun
                                   ` (2 preceding siblings ...)
  2010-07-23 15:10                 ` Jens Lehmann
@ 2010-07-23 15:19                 ` Marc Branchaud
  2010-07-23 22:50                   ` Avery Pennarun
  3 siblings, 1 reply; 58+ messages in thread
From: Marc Branchaud @ 2010-07-23 15:19 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Jakub Narebski, Jens Lehmann,
	Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano, Linus Torvalds

On 10-07-22 03:41 PM, Avery Pennarun wrote:
> 
> 1) Sometimes I want to clone only some subdirs of a project
> 2) Sometimes I don't want the entire history because it's too big.
> 3) Super huge git repositories start to degrade in performance.

The reason we turned to submodules is precisely to deal with repository size.
 Our code base encompasses the entire FreeBSD tree plus different versions of
the Linux kernel, along with various third-party libraries & apps.  You don't
need everything to build a given product (a FreeBSD product doesn't use any
Linux kernels, for example) but because all the products share common code we
need to be able to branch and tag the common code along with the uncommon code.

So a straight "git clone" that would need to fetch all of FreeBSD plus 4
different Linux kernels and check all that out is a major problem, especially
for our automated build system (which could definitely be implemented better,
but still).  In truth it's the checkout that takes the most time by far,
though commands like git-status also take inconveniently long.

We chose git-submodule over git-subtree mainly because git-submodule lets us
selectively checkout different parts of our code.  (AFAIK sparse checkouts
aren't yet an option.)  We didn't really consider git-subtree because it's
not an official part of git, and we didn't want to have to teach (and nag)
all our developers to install and maintain it in addition to keeping up with
git itself.  Besides, git-submodule's collection-of-independent-repos model
works fairly well in our situation, though the implementation could
definitely be improved (and Jens's list is a really good start).

Neither submodule nor subtree really solves our situation, but right now
git-submodule is the only thing "official" git offers to manage
loosely-coupled code.  It would be nice to see git-submodule added to the
toolkit, but it would be even nicer if git had better ways to deal with
"vast" repositories.

Another tool folks should keep in mind in this discussion is 'repo' which
Google built for the Android project.  Android's code base is also too vast
to work well in a single git repository, and I don't think subtrees or
submodules would be a good match for them either.

		M.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-23 15:10                 ` Jens Lehmann
@ 2010-07-23 16:05                   ` Bryan Larsen
  2010-07-23 17:11                     ` Jens Lehmann
  2010-07-23 22:32                   ` Avery Pennarun
  1 sibling, 1 reply; 58+ messages in thread
From: Bryan Larsen @ 2010-07-23 16:05 UTC (permalink / raw)
  To: Jens Lehmann
  Cc: Avery Pennarun, Jakub Narebski,
	Ævar Arnfjörð Bjarmason, git, Junio C Hamano,
	Linus Torvalds, Heiko Voigt

On 10-07-23 11:10 AM, Jens Lehmann wrote:
> Am 22.07.2010 21:41, schrieb Avery Pennarun:
>> I create an app named myapp on github:
>>
>>     git://github.com/apenwarr/myapp
>>
>> It uses 17 different ruby gems, which I import as subprojects.  I have
>> two choices:
>>
>> [1] .gitmodules can use absolute paths to the original gem locations:
>>
>>     git://github.com/rubygems/gem[1..n]
>>
>> [2] Or else I can fork them all and use relative paths in .gitmodules:
>>
>>     ../gem[1..n]
>>     translates to -->  git://github.com/apenwarr/gem[1..n]
>
> You forgot what we do as best practice at work:
>
> [3] Fork the gem repos on github (or another server reachable by your
>      co-workers) and use those, so you don't have to change the URL
>      later:
>
>      git://github.com/apenwarrrubygems/gem[1..n]
>
> Your problems go away, setup has to be done only once on project
> start and not for every developer, you can use your own branchnames
> and you have a staging repo from where you can push patches upstream
> if necessary.

What's best practice for open source projects?   I do this, but nobody 
except my coworkers can push to my forks, so it's a huge rigamarole just 
to get a fix into a submodule.

>
> That is just one example. Another one is code shared between
> different repos (think: libraries) where you want to make sure that
> a bugfix in the library made in project A will make it to the shared
> code repo and thus doesn't have to be fixed again by projects B to X.
> This was one of the reasons we preferred submodules over subtrees
> in our evaluation, because there is no incentive to push fixes inside
> the subtree back to its own repo like there is when using submodules.

But you stated above that each project has its own fork of the library. 
   So there's no special incentive to push changes from the fork back to 
its master repo.

>
>
>>>> It's very clear that git-submodule's current behaviour totally
>>>> mismatches the entire git philosophy.  That's why it's so impossible
>>>> to make the git-submodule command usable.
>>>
>>> That's very strong accusation.
>>
>> Agreed... but that doesn't make it wrong :)
>
> But calling a feature "impossible to make ... usable" is an
> interesting thing to say about a feature lots of people are
> using productively in their daily work, no? ;-)

In my experience, it's possible to make it usable if and only if:

1.  you have a small team
2.  all of whom are very comfortable with git
3.  changes inside submodules are either infrequent or only happen in a 
single direction
4.  the project is not public/open source

I think #4 is the killer reason why submodules don't work.  It works 
fine if the submodule is fairly independent, but if you have a patch to 
the submodule that was created for and in the context of the 
superproject, things get really annoying really quickly.

Bryan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-23 16:05                   ` Bryan Larsen
@ 2010-07-23 17:11                     ` Jens Lehmann
  2010-07-23 19:01                       ` Bryan Larsen
  0 siblings, 1 reply; 58+ messages in thread
From: Jens Lehmann @ 2010-07-23 17:11 UTC (permalink / raw)
  To: Bryan Larsen
  Cc: Avery Pennarun, Jakub Narebski,
	Ævar Arnfjörð Bjarmason, git, Junio C Hamano,
	Linus Torvalds, Heiko Voigt

Am 23.07.2010 18:05, schrieb Bryan Larsen:
> On 10-07-23 11:10 AM, Jens Lehmann wrote:
>> That is just one example. Another one is code shared between
>> different repos (think: libraries) where you want to make sure that
>> a bugfix in the library made in project A will make it to the shared
>> code repo and thus doesn't have to be fixed again by projects B to X.
>> This was one of the reasons we preferred submodules over subtrees
>> in our evaluation, because there is no incentive to push fixes inside
>> the subtree back to its own repo like there is when using submodules.
> 
> But you stated above that each project has its own fork of the library.   So there's no special incentive to push changes from the fork back to its master repo.

When you are not working on your own, it is preferable to be able to
get changes upstream into a submodules repo to share them.
So if you can do that (either via push or patches sent by email or
whatever), then use it's URL directly (and then you have the incentive
that fixes get pushed, which is nice).
Or you can't, then use a fork reachable by the people you work with
(then you still can see all fixes made by your group in the forked
repo and can decide to push them upstream). Then pushing fixes back
to the original repo is a matter of courtesy, as it is with every
other work flow I know.
And I think that is just the same thing we all do with plain git
repos when working with others: If you can push, you use it directly
to clone from, if you can't, you fork it.


> In my experience, it's possible to make it usable if and only if:
> 
> 1.  you have a small team
> 2.  all of whom are very comfortable with git
> 3.  changes inside submodules are either infrequent or only happen in a single direction
> 4.  the project is not public/open source
>
> I think #4 is the killer reason why submodules don't work.  It works fine if the submodule is fairly independent, but if you have a patch to the submodule that was created for and in the context of the superproject, things get really annoying really quickly.

What is the problem with the "forked repo" solution for #4?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-23 17:11                     ` Jens Lehmann
@ 2010-07-23 19:01                       ` Bryan Larsen
  0 siblings, 0 replies; 58+ messages in thread
From: Bryan Larsen @ 2010-07-23 19:01 UTC (permalink / raw)
  To: Jens Lehmann
  Cc: Avery Pennarun, Jakub Narebski,
	Ævar Arnfjörð Bjarmason, git, Junio C Hamano,
	Linus Torvalds, Heiko Voigt

On 10-07-23 01:11 PM, Jens Lehmann wrote:
> Am 23.07.2010 18:05, schrieb Bryan Larsen:
>> On 10-07-23 11:10 AM, Jens Lehmann wrote:
>>> That is just one example. Another one is code shared between
>>> different repos (think: libraries) where you want to make sure that
>>> a bugfix in the library made in project A will make it to the shared
>>> code repo and thus doesn't have to be fixed again by projects B to X.
>>> This was one of the reasons we preferred submodules over subtrees
>>> in our evaluation, because there is no incentive to push fixes inside
>>> the subtree back to its own repo like there is when using submodules.
>>
>> But you stated above that each project has its own fork of the library.   So there's no special incentive to push changes from the fork back to its master repo.
>
> When you are not working on your own, it is preferable to be able to
> get changes upstream into a submodules repo to share them.
> So if you can do that (either via push or patches sent by email or
> whatever), then use it's URL directly (and then you have the incentive
> that fixes get pushed, which is nice).
> Or you can't, then use a fork reachable by the people you work with
> (then you still can see all fixes made by your group in the forked
> repo and can decide to push them upstream). Then pushing fixes back
> to the original repo is a matter of courtesy, as it is with every
> other work flow I know.
> And I think that is just the same thing we all do with plain git
> repos when working with others: If you can push, you use it directly
> to clone from, if you can't, you fork it.

So basically you're saying: sometimes you can use a non-forked 
repository, which has a whole bunch of disadvantages, but has the minor 
advantage that you're "forced" to push your changes upstream.

Which I see as a disadvantage because that means you're pushing untested 
changes.

Or else you use a forked repo, which is basically the same as using 
git-subtree, except for a lot of additional admin hassle.

>
>
>> In my experience, it's possible to make it usable if and only if:
>>
>> 1.  you have a small team
>> 2.  all of whom are very comfortable with git
>> 3.  changes inside submodules are either infrequent or only happen in a single direction
>> 4.  the project is not public/open source
>>
>> I think #4 is the killer reason why submodules don't work.  It works fine if the submodule is fairly independent, but if you have a patch to the submodule that was created for and in the context of the superproject, things get really annoying really quickly.
>
> What is the problem with the "forked repo" solution for #4?
>

Please tell me how I can set up a public project on github where project 
A contains module X, so that Joe Average User can clone A, make a change 
in the module X and send a simple pull request to get that change into 
A.   The change is one that's inappropriate to push upstream to X 
without additional work, but is appropriate for A at this point in time. 
  Joe's a beginning git user.

That's actually a simple use case compared to others I've run into.

Bryan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-23 15:10                 ` Jens Lehmann
  2010-07-23 16:05                   ` Bryan Larsen
@ 2010-07-23 22:32                   ` Avery Pennarun
  2010-07-25 19:57                     ` Jens Lehmann
  1 sibling, 1 reply; 58+ messages in thread
From: Avery Pennarun @ 2010-07-23 22:32 UTC (permalink / raw)
  To: Jens Lehmann
  Cc: Jakub Narebski, Ævar Arnfjörð Bjarmason,
	Bryan Larsen, git, Junio C Hamano, Linus Torvalds, Heiko Voigt

On Fri, Jul 23, 2010 at 11:10 AM, Jens Lehmann <Jens.Lehmann@web.de> wrote:
> You forgot what we do as best practice at work:
>
> [3] Fork the gem repos on github (or another server reachable by your
>    co-workers) and use those, so you don't have to change the URL
>    later:
>
>    git://github.com/apenwarrrubygems/gem[1..n]
>
> Your problems go away, setup has to be done only once on project
> start and not for every developer, you can use your own branchnames
> and you have a staging repo from where you can push patches upstream
> if necessary.

Now all your fellow developers have to push their submodule code to a
single upstream repo?  That's rather centralized and un-git-like.

For the rest, Brian Larsen answered this one well, and I agree with him.

>> Surely including *repository URLs* inside the *repository content* is
>> at least as bad as including branch names.  If we're going to do one,
>> we might as well do the other.  But it won't help, because the stored
>> branch name will probably be 'master', and my personal hacked-up copy
>> of gem13 shouldn't be on a branch named master anyway.
>
> You sure are aware that having a branch name associated with a
> submodule checkout is a request repeatedly made?

Of course it is; I requested it myself.  Then, two years later after
thinking about the problem a lot and writing git-subtree out of
frustration, I realized that even if this feature existed, it wouldn't
help at all.

If you use git-submodule, you must push your submodule commits
separately or the supermodule is broken for everybody but you.  To
push a submodule, you need a) an upstream to push to and b) a branch
name.  It's easy to forget to create a branch name, so of course
people request that feature.

However, the real problem is "you must push your submodule commits
separately."  Fix that, and I can guarantee that the request for
submodule branch naming will disappear.

> That is just one example. Another one is code shared between
> different repos (think: libraries) where you want to make sure that
> a bugfix in the library made in project A will make it to the shared
> code repo and thus doesn't have to be fixed again by projects B to X.
> This was one of the reasons we preferred submodules over subtrees
> in our evaluation, because there is no incentive to push fixes inside
> the subtree back to its own repo like there is when using submodules.

I think you'd like svn; it's pretty cool.  All changes made to a
project need to get pushed to a central upstream repo so you never
forget to share them.

>>> rebase and merge needs separate    | rebase and merge works normally
>>> work in submodule currently        |
>>
>> True.
>
> Nope, there is a patch in pu doing
> that when it is a simple fast forward
> and giving you advice when both sides
> are already merged inside the submodule
> (CCed Heiko, because he is the author
> of that feature)

Fast forwards are not merges, and pu is not now.

> It is the /commits/ that have to be
> done twice, once in the submodule and
> then in the superproject. (But that is
> not necessarily bad, imagine having git
> gui as a submodule: you would be
> automagically reminded that stuff for
> git gui should be sent somewhere else
> than to Junio).

Yup, I agree that requiring a separate commit to the submodule repo is
not a bad idea.  I always do this anyway even when using git-subtree,
because I'm thinking ahead to the day when I'll push my submodule
changes upstream and I want my commit message to make sense.  But
that's because I think ahead like that.  Having the tool force me to
do it would be harmless and help people avoid mistakes.

The syntax for it ought to be nice though.  I should be able to do:

    git commit -- path/to/submodule

And have it commit everything in the submodule tree as a new commit in
the submodule.  I don't want to have to think about cd'ing to
path/to/submodule just so I can commit the files I changed in there.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-23 15:11                     ` Jens Lehmann
@ 2010-07-23 22:33                       ` Avery Pennarun
  0 siblings, 0 replies; 58+ messages in thread
From: Avery Pennarun @ 2010-07-23 22:33 UTC (permalink / raw)
  To: Jens Lehmann
  Cc: Chris Webb, Jakub Narebski, ?var Arnfj?r? Bjarmason,
	Bryan Larsen, git, Junio C Hamano, Linus Torvalds

On Fri, Jul 23, 2010 at 11:11 AM, Jens Lehmann <Jens.Lehmann@web.de> wrote:
> Am 23.07.2010 10:40, schrieb Avery Pennarun:
>> I think it would be better if we could abandon .gitmodules entirely;
>> it's really only useful for listing repository URLs, and listing
>> repository URLs is a major part of the problem.
>
> Then where do you get the URL to clone the submodule from on "git
> clone --recursive"?

If you're asking that question, you're missing my point entirely.  In
my proposed model, the submodule objects are all in the same repo as
the superproject, so there *is* no separate URL.  And thus there is no
more need for .gitmodules.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-23 15:19                 ` Marc Branchaud
@ 2010-07-23 22:50                   ` Avery Pennarun
  2010-07-24  0:58                     ` skillzero
                                       ` (3 more replies)
  0 siblings, 4 replies; 58+ messages in thread
From: Avery Pennarun @ 2010-07-23 22:50 UTC (permalink / raw)
  To: Marc Branchaud
  Cc: Jakub Narebski, Jens Lehmann,
	Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano, Linus Torvalds

On Fri, Jul 23, 2010 at 11:19 AM, Marc Branchaud <marcnarc@xiplink.com> wrote:
> On 10-07-22 03:41 PM, Avery Pennarun wrote:
>> 1) Sometimes I want to clone only some subdirs of a project
>> 2) Sometimes I don't want the entire history because it's too big.
>> 3) Super huge git repositories start to degrade in performance.
>
> The reason we turned to submodules is precisely to deal with repository size.

I believe that's very common.

However, I wonder whether that's actually a good reason for git to
develop better submodules, or actually just a good reason for git to
get better support for handling huge repositories.

My bup project (http://github.com/apenwarr/bup) is all about huge
repositories.  It handles repositories with hundreds of gigabytes, and
trees containing millions of files (entire filesystems), quite nicely.
 Of course, it's not a version control system, so it won't solve your
problems.  It's just evidence that large repositories are actually
quite manageable without changing the fundamentals of git.

>  Our code base encompasses the entire FreeBSD tree plus different versions of
> the Linux kernel, along with various third-party libraries & apps.  You don't
> need everything to build a given product (a FreeBSD product doesn't use any
> Linux kernels, for example) but because all the products share common code we
> need to be able to branch and tag the common code along with the uncommon code.

Honest question: do you care about the wasted disk space and download
time for these extra files?  Or just the fact that git gets slow when
you have them?

How people answer that question very much affects the way git should
be designed.

> So a straight "git clone" that would need to fetch all of FreeBSD plus 4
> different Linux kernels and check all that out is a major problem, especially
> for our automated build system (which could definitely be implemented better,
> but still).

To be absolutely pedantic, the four linux kernels likely share most of
their objects and so you're only paying the cost (at least during
fetch) of including it once :)

(If you're actually using git-submodule and each copy of the kernel is
its own module, then it might be cloning the kernel four times
separately, in which case the objects *don't* get shared, so this ends
up being much more expensive than it should be.  That could be fixed
by slightly improving git-submodule to share some objects rather than
rearchitecting it though.)

> In truth it's the checkout that takes the most time by far,
> though commands like git-status also take inconveniently long.

Yeah, git could stand to be optimized a bit here.  And since Windows
stats files about 10x slower than Linux, this problem occurs about 10x
sooner on Windows, which makes using git on Windows (which sadly I
have to do sometimes) extremely painful compared to Linux.

IMHO, the correct answer here is to have an inotify-based daemon prod
at the .git/index automatically when files get updated, so that git
itself doesn't have to stat/readdir through the entire tree in order
to do any of its operations.  (Windows also has something like inotify
that would work.)  If you had this, then git
status/diff/checkout/commit would be just as fast with zillions of
files as with 10 files.  Sooner or later, if nobody implements this, I
promise I'll get around to it since inotify is actually easy to code
for :)

Also note that the only reason submodules are faster here is that
they're ignoring possibly important changes.  Notably, when you do
'git status' from the top level, it won't warn you if you have any
not-yet-committed files in any of your submodules.  Personally, I
consider that to be really important information, but to obtain it
would make 'git status' take just as long as without submodules, so
you wouldn't get any benefit.  (I think nowadays there's a way to get
this recursive status information if you want it, but it'll be slow of
course.)

> We chose git-submodule over git-subtree mainly because git-submodule lets us
> selectively checkout different parts of our code.  (AFAIK sparse checkouts
> aren't yet an option.)

Fair enough.  If you could confirm or deny my theory that this is
*entirely* a performance related concern (as opposed to disk space /
download time), that would be helpful.

> We didn't really consider git-subtree because it's
> not an official part of git, and we didn't want to have to teach (and nag)
> all our developers to install and maintain it in addition to keeping up with
> git itself.

Arguably, this is a vote for including git-subtree into the core
(which was Bryan's point when he started this thread); it obviously is
being rejected sometimes by git users simply because it's not in the
core, even though it could help them.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-23 22:50                   ` Avery Pennarun
@ 2010-07-24  0:58                     ` skillzero
  2010-07-24  1:20                       ` Avery Pennarun
  2010-07-26  8:56                       ` Jakub Narebski
  2010-07-24 20:07                     ` Sverre Rabbelier
                                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 58+ messages in thread
From: skillzero @ 2010-07-24  0:58 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Marc Branchaud, Jakub Narebski, Jens Lehmann,
	Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano, Linus Torvalds

On Fri, Jul 23, 2010 at 3:50 PM, Avery Pennarun <apenwarr@gmail.com> wrote:

> Honest question: do you care about the wasted disk space and download
> time for these extra files?  Or just the fact that git gets slow when
> you have them?

I have the similar situation to the original poster (huge trees) and
for me it's all three: disk space, download time, and performance. My
tree has a few relatively small (< 20 MB) shared directories of common
code, a few large (2-6 GB) directories of code for OS's, and then
several medium size (< 500 MB) directories for application code. The
application developers only care about the app+shared directories (and
are very annoyed by the massive space and performance impact of the OS
directories). The firmware-only developers only care about OS+shared
and are mildly annoyed by the medium space and performance impact of
the app directories. I work on all of the pieces, but even I would
prefer to have things separated so when I work on the apps, git
status/etc doesn't take a big hit for close to a million files in the
OS directories (particularly when doing git status on Windows). Even
when using the -uno option to git status, it's still pretty slow (over
a minute).

git-submodule might be technically possible in this situation, but
having to commit and push each submodule and then commit and push the
super module makes it slightly worse than just dealing with the
space/download/performance issues of one huge repository.

git-subtree could also possibly help, but there's still extra work to
split and merge each repository. And I'm not sure how it handles
commit IDs across the repositories because I want to be able to say "I
fixed that bug in shared/code.c in commit abc123" and have both the
OS+shared and the apps+shared people be able git log abc123 and see
the same change (and merge/cherry-pick/etc.).

I think what I want is a way to do a sparse checkout where some sort
of module is maintained in the git repository (probably just an
INI-style file with paths) so I can clone directly from the server and
it figures out the objects I need for the full history of only
apps+shared (or firmware+shared, etc.) on the server side and only
sends those objects. I still want to be able to branch, tag, and refer
to commit IDs. So I only take the space/download/performance hit of
directories included in the module, but I don't have to manually
maintain that view of the repository (as I do with git-submodule and
git-subtree).

The closest thing to that so far for me has been the sparse checkout
support added in git 1.7 combined with a convenience script I wrote.
Everyone still has a huge download and .git directory, but at least
the working copy is limited to the paths specified in the module so
git status isn't super slow (although just having all those objects in
the .git directory still slows it down quite a bit).

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-24  0:58                     ` skillzero
@ 2010-07-24  1:20                       ` Avery Pennarun
  2010-07-24 19:40                         ` skillzero
  2010-07-26 16:37                         ` Marc Branchaud
  2010-07-26  8:56                       ` Jakub Narebski
  1 sibling, 2 replies; 58+ messages in thread
From: Avery Pennarun @ 2010-07-24  1:20 UTC (permalink / raw)
  To: skillzero
  Cc: Marc Branchaud, Jakub Narebski, Jens Lehmann,
	Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano, Linus Torvalds

On Fri, Jul 23, 2010 at 8:58 PM,  <skillzero@gmail.com> wrote:
> On Fri, Jul 23, 2010 at 3:50 PM, Avery Pennarun <apenwarr@gmail.com> wrote:
>> Honest question: do you care about the wasted disk space and download
>> time for these extra files?  Or just the fact that git gets slow when
>> you have them?
>
> I have the similar situation to the original poster (huge trees) and
> for me it's all three: disk space, download time, and performance. My
> tree has a few relatively small (< 20 MB) shared directories of common
> code, a few large (2-6 GB) directories of code for OS's, and then
> several medium size (< 500 MB) directories for application code. The
> application developers only care about the app+shared directories (and
> are very annoyed by the massive space and performance impact of the OS
> directories).

Given how cheap disk space is nowadays, I'm curious about this.  Are
they really just annoyed by the performance problem, and they complain
about the extra size because they blame the performance on the extra
files?  Or are they honestly short of disk space?

Similarly, are all your developers located at the same office?  If so,
then bandwidth ought not be an issue.

I'm pushing extra hard on this because I believe there are lots of
opportunities to just improve git performance on huge repositories.
And if the only *real* reason people need to split repositories is
that performance goes down, then that's fixable, and you may need
neither git-submodule nor git-subtree.

> I work on all of the pieces, but even I would
> prefer to have things separated so when I work on the apps, git
> status/etc doesn't take a big hit for close to a million files in the
> OS directories (particularly when doing git status on Windows). Even
> when using the -uno option to git status, it's still pretty slow (over
> a minute).

This is indeed a problem with large repositories.  Of course,
splitting them with git-submodule is kind of cheating, because it just
makes git-status *not look* to see if those files are dirty or not.
If they are dirty and you forget to commit them, you'll never know
until someone tells you later.  It would be functionally equivalent to
just have git-status not look inside certain subdirs of a single
repository.

In any case, this is a pretty clear optimization target (especially
since Windows is so amazingly slow at statting files): just have a
daemon running inotify (or the Windows equivalent) that tracks whether
files are up-to-date or not.  Then git would never need to recurse
through the entire tree, and operations like status, diff, checkout,
and commit could be fast even with a million-file repository.

> git-subtree could also possibly help, but there's still extra work to
> split and merge each repository. And I'm not sure how it handles
> commit IDs across the repositories because I want to be able to say "I
> fixed that bug in shared/code.c in commit abc123" and have both the
> OS+shared and the apps+shared people be able git log abc123 and see
> the same change (and merge/cherry-pick/etc.).

git-subtree (if you don't use --squash) keeps all the commit IDs.  It
is extra work to split and merge between repositories, though.  It
doesn't solve your repository-is-too-large problem.

> I think what I want is a way to do a sparse checkout where some sort
> of module is maintained in the git repository (probably just an
> INI-style file with paths) so I can clone directly from the server and
> it figures out the objects I need for the full history of only
> apps+shared (or firmware+shared, etc.) on the server side and only
> sends those objects. I still want to be able to branch, tag, and refer
> to commit IDs. So I only take the space/download/performance hit of
> directories included in the module, but I don't have to manually
> maintain that view of the repository (as I do with git-submodule and
> git-subtree).

Yes, better sparse checkout and sparse fetch would be very valuable
here and would eliminate a lot of the reasons people have for misusing
submodules.

> (although just having all those objects in
> the .git directory still slows it down quite a bit).

You're the second person who has mentioned this today (the first one
was to me in a private email).  I'd like to understand this better.

In my bup project (http://github.com/apenwarr/bup) we regularly create
git repositories with hundreds of gigabytes of packs, comprising tens
or hundreds of millions of objects, and the repository doesn't get
slow.  (Obviously this is a separate issue from having a huge work
tree with a million files in it.)  In repositories this thoroughly
huge, we did find a way to improve memory usage versus git's pack .idx
files (bup has '.midx' files that combine multiple indexes into one,
thus reducing the binary search steps).  But this only matters when
you get well over 10 gigabytes of stuff and you're wading through it
using crappy python code (as bup does) and frequently inserting a
million objects at a time (as bup does).  The git usage pattern is
much simpler and therefore faster.

How big is your .git directory and what performance problems do you
see?  I assume you've done 'git gc' to clean up all the loose objects,
right?

Have fun,

Avery

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-24  1:20                       ` Avery Pennarun
@ 2010-07-24 19:40                         ` skillzero
  2010-07-25  1:47                           ` Nguyen Thai Ngoc Duy
  2010-07-26 13:13                           ` Jakub Narebski
  2010-07-26 16:37                         ` Marc Branchaud
  1 sibling, 2 replies; 58+ messages in thread
From: skillzero @ 2010-07-24 19:40 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Marc Branchaud, Jakub Narebski, Jens Lehmann,
	Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano, Linus Torvalds

On Fri, Jul 23, 2010 at 6:20 PM, Avery Pennarun <apenwarr@gmail.com> wrote:
> On Fri, Jul 23, 2010 at 8:58 PM,  <skillzero@gmail.com> wrote:
>> On Fri, Jul 23, 2010 at 3:50 PM, Avery Pennarun <apenwarr@gmail.com> wrote:
>>> Honest question: do you care about the wasted disk space and download
>>> time for these extra files?  Or just the fact that git gets slow when
>>> you have them?
>>
>> I have the similar situation to the original poster (huge trees) and
>> for me it's all three: disk space, download time, and performance. My
>> tree has a few relatively small (< 20 MB) shared directories of common
>> code, a few large (2-6 GB) directories of code for OS's, and then
>> several medium size (< 500 MB) directories for application code. The
>> application developers only care about the app+shared directories (and
>> are very annoyed by the massive space and performance impact of the OS
>> directories).
>
> Given how cheap disk space is nowadays, I'm curious about this.  Are
> they really just annoyed by the performance problem, and they complain
> about the extra size because they blame the performance on the extra
> files?  Or are they honestly short of disk space?

I think it's both space and performance. When you're using SSD drives,
storage still pretty expensive. A 128 GB or less SSD is pretty common
in a laptop so you can run out pretty quick, especially when you're
working concurrently on a few different branches at the same time.
It's useful to keep multiple working copies (e.g. git-new-workdir)
because rebuild time can be significant when switching branches.

> Similarly, are all your developers located at the same office?  If so,
> then bandwidth ought not be an issue.

Bandwidth isn't a big problem because you don't need to re-download
the repo very often. However, people work at home a lot where
bandwidth is more limited. The biggest complaint I hear about
bandwidth is that people tend to re-download when something goes wrong
(i.e. inexperience with git resulting in a repository they can't
recover due to git resets, etc).

> I'm pushing extra hard on this because I believe there are lots of
> opportunities to just improve git performance on huge repositories.
> And if the only *real* reason people need to split repositories is
> that performance goes down, then that's fixable, and you may need
> neither git-submodule nor git-subtree.

Performance degradation is my biggest complaint with large
repositories. Your inotify/FSEvents/etc daemon idea sounds interesting
to deal with the stat issue.

> This is indeed a problem with large repositories.  Of course,
> splitting them with git-submodule is kind of cheating, because it just
> makes git-status *not look* to see if those files are dirty or not.
> If they are dirty and you forget to commit them, you'll never know
> until someone tells you later.  It would be functionally equivalent to
> just have git-status not look inside certain subdirs of a single
> repository.

I think it's only cheating if you're using all of the submodules. The
main purpose of submodules for me (although I don't currently use
submodules) would be so I don't need to keep modules on disk that I
don't care about. If a developer is working on an app, they don't need
the OS directories/modules so they get much faster git status/etc and
there wouldn't be other directories to have dirty files in. That said,
if I was using git submodule, I'd want git status to show me all the
submodules that were checked out.

>> (although just having all those objects in
>> the .git directory still slows it down quite a bit).
>
> You're the second person who has mentioned this today (the first one
> was to me in a private email).  I'd like to understand this better.

What I'm basing this on is that even when I'm using a sparse checkout
such that I have only a small subset of the files in my working
directory, git status seems singifncantly slower for me than an
equivalent git repository that only has that subset of files. That's
not very scientific, but that's what made me think just having a large
.git directory with lots of objects/history slows down git status even
if the working copy doesn't have a lot of files.

I will try to experiment and see if I can narrow it down with some real numbers.

BTW...what's the policy on CC'ing people on git mailing list replies?
Should it be trimmed or not? I've received complaints in the past, but
I was never really clear what the recommended policy is.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-23 22:50                   ` Avery Pennarun
  2010-07-24  0:58                     ` skillzero
@ 2010-07-24 20:07                     ` Sverre Rabbelier
  2010-07-26  8:51                     ` Jakub Narebski
  2010-07-26 15:15                     ` Marc Branchaud
  3 siblings, 0 replies; 58+ messages in thread
From: Sverre Rabbelier @ 2010-07-24 20:07 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Marc Branchaud, Jakub Narebski, Jens Lehmann,
	Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano, Linus Torvalds

Heya,

On Fri, Jul 23, 2010 at 17:50, Avery Pennarun <apenwarr@gmail.com> wrote:
> IMHO, the correct answer here is to have an inotify-based daemon prod
> at the .git/index automatically when files get updated, so that git
> itself doesn't have to stat/readdir through the entire tree in order
> to do any of its operations.  (Windows also has something like inotify
> that would work.)  If you had this, then git
> status/diff/checkout/commit would be just as fast with zillions of
> files as with 10 files.  Sooner or later, if nobody implements this, I
> promise I'll get around to it since inotify is actually easy to code
> for :)

From what I've heard both SVN and Mercurial have something like that
and it's incredible unstable and icky and nasty and bad and will eat
your babies. Then again, I don't have any experience with inotify, so
if you say that it's all good and awesome, who am I to doubt that :).

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-22 18:23               ` Bryan Larsen
@ 2010-07-24 22:36                 ` Jakub Narebski
  0 siblings, 0 replies; 58+ messages in thread
From: Jakub Narebski @ 2010-07-24 22:36 UTC (permalink / raw)
  To: Bryan Larsen
  Cc: Avery Pennarun, Jens Lehmann, git, Junio C Hamano,
	Linus Torvalds, Ævar Arnfjörð Bjarmason

Dnia czwartek 22. lipca 2010 20:23, Bryan Larsen napisał:
> >
> > Using git-subtree has its warts too: I don't think for example that there is
> > a way to get a log _automatically excluding_ history subtree-merged
> > subprojects.  Or is it there?
> >
> 
> It works exactly right for me when I used git-subtree in "squashed" 
> mode.  Changes which were done in tree show up separately in the log, 
> changes which were pulled in via git-subtree pull show up as a single 
> summary entry in the log.
> 
> This discussion has been about how to improve git submodules, which is 
> sorely needed.   However, it's quite clear that git submodules will 
> never work as well as git subtrees in certain quite common situations. 
>   If fixed, git submodules will be more appropriate in other situations. 
>    However, I'm not asking to remove git submodules or prevent anybody 
> from fixing them, I'm just asking that git subtree be merged.
> 
> Does anybody actually oppose the merger of git-subtree, which has (at 
> least) hundreds of users despite its out-of-tree status?

I am very much *for* merging git-subtree into git core.  It is not that
much different from e.g. "git submodule" or "git remote" porcelain
commands.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-24 19:40                         ` skillzero
@ 2010-07-25  1:47                           ` Nguyen Thai Ngoc Duy
  2010-07-28 22:27                             ` Jakub Narebski
  2010-07-26 13:13                           ` Jakub Narebski
  1 sibling, 1 reply; 58+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2010-07-25  1:47 UTC (permalink / raw)
  To: skillzero
  Cc: Avery Pennarun, Marc Branchaud, Jakub Narebski, Jens Lehmann,
	Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano, Linus Torvalds

On Sun, Jul 25, 2010 at 5:40 AM,  <skillzero@gmail.com> wrote:
>>> (although just having all those objects in
>>> the .git directory still slows it down quite a bit).
>>
>> You're the second person who has mentioned this today (the first one
>> was to me in a private email).  I'd like to understand this better.
>
> What I'm basing this on is that even when I'm using a sparse checkout
> such that I have only a small subset of the files in my working
> directory, git status seems singifncantly slower for me than an
> equivalent git repository that only has that subset of files. That's
> not very scientific, but that's what made me think just having a large
> .git directory with lots of objects/history slows down git status even
> if the working copy doesn't have a lot of files.

Hmm... I recall I experienced some slower operations on webkit with
sparse checkout too.

>
> I will try to experiment and see if I can narrow it down with some real numbers.

Yes, I'd appreciate that.

By the way, how hard is it to use git-replace to implement narrow clone?
-- 
Duy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-23 22:32                   ` Avery Pennarun
@ 2010-07-25 19:57                     ` Jens Lehmann
  2010-07-27 18:40                       ` Avery Pennarun
  0 siblings, 1 reply; 58+ messages in thread
From: Jens Lehmann @ 2010-07-25 19:57 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Jakub Narebski, Ævar Arnfjörð Bjarmason,
	Bryan Larsen, git, Junio C Hamano, Linus Torvalds, Heiko Voigt

Am 24.07.2010 00:32, schrieb Avery Pennarun:
> On Fri, Jul 23, 2010 at 11:10 AM, Jens Lehmann <Jens.Lehmann@web.de> wrote:
>> You forgot what we do as best practice at work:
>>
>> [3] Fork the gem repos on github (or another server reachable by your
>>    co-workers) and use those, so you don't have to change the URL
>>    later:
>>
>>    git://github.com/apenwarrrubygems/gem[1..n]
>>
>> Your problems go away, setup has to be done only once on project
>> start and not for every developer, you can use your own branchnames
>> and you have a staging repo from where you can push patches upstream
>> if necessary.
> 
> Now all your fellow developers have to push their submodule code to a
> single upstream repo?  That's rather centralized and un-git-like.

But isn't that exactly the same thing you would have to do for your
superproject too to be able to push your changes for your fellows?


>> It is the /commits/ that have to be
>> done twice, once in the submodule and
>> then in the superproject. (But that is
>> not necessarily bad, imagine having git
>> gui as a submodule: you would be
>> automagically reminded that stuff for
>> git gui should be sent somewhere else
>> than to Junio).
> 
> Yup, I agree that requiring a separate commit to the submodule repo is
> not a bad idea.  I always do this anyway even when using git-subtree,
> because I'm thinking ahead to the day when I'll push my submodule
> changes upstream and I want my commit message to make sense.  But
> that's because I think ahead like that.  Having the tool force me to
> do it would be harmless and help people avoid mistakes.

And submodules force you to do that.


> The syntax for it ought to be nice though.  I should be able to do:
> 
>     git commit -- path/to/submodule
> 
> And have it commit everything in the submodule tree as a new commit in
> the submodule.  I don't want to have to think about cd'ing to
> path/to/submodule just so I can commit the files I changed in there.

Yes, that would be a nice feature (assuming you have a branch in the
submodule to commit these changes to ;-).

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-23 22:50                   ` Avery Pennarun
  2010-07-24  0:58                     ` skillzero
  2010-07-24 20:07                     ` Sverre Rabbelier
@ 2010-07-26  8:51                     ` Jakub Narebski
  2010-07-27 19:15                       ` Avery Pennarun
  2010-07-26 15:15                     ` Marc Branchaud
  3 siblings, 1 reply; 58+ messages in thread
From: Jakub Narebski @ 2010-07-26  8:51 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Marc Branchaud, Jens Lehmann,
	Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano, Linus Torvalds

On Sat, 24 Jul 2010 00:50, Avery Pennarun wrote:
> On Fri, Jul 23, 2010 at 11:19 AM, Marc Branchaud <marcnarc@xiplink.com> wrote:
>> On 10-07-22 03:41 PM, Avery Pennarun wrote:
>>> 1) Sometimes I want to clone only some subdirs of a project
>>> 2) Sometimes I don't want the entire history because it's too big.
>>> 3) Super huge git repositories start to degrade in performance.
>>
>> The reason we turned to submodules is precisely to deal with repository size.
> 
> I believe that's very common.
> 
> However, I wonder whether that's actually a good reason for git to
> develop better submodules, or actually just a good reason for git to
> get better support for handling huge repositories.
> 
> My bup project (http://github.com/apenwarr/bup) is all about huge
> repositories.  It handles repositories with hundreds of gigabytes, and
> trees containing millions of files (entire filesystems), quite nicely.
>  Of course, it's not a version control system, so it won't solve your
> problems.  It's just evidence that large repositories are actually
> quite manageable without changing the fundamentals of git.

There is also git-bigfiles project, although it is more about large
[binary] files than large repositories per se (many files, long history).

Note that with 'bup' you might not see problems with large repositories
because it does not examine code paths that are slow in large repositories
(gc, log, path-delimited log).

>>  Our code base encompasses the entire FreeBSD tree plus different versions of
>> the Linux kernel, along with various third-party libraries & apps.  You don't
>> need everything to build a given product (a FreeBSD product doesn't use any
>> Linux kernels, for example) but because all the products share common code we
>> need to be able to branch and tag the common code along with the uncommon code.

Sidenote: I have noticed there very important ability of submodules, which
git-subtree lacks, or at least doesn't have it directly, namely ability
to tag in submodule separately of tagging superproject as whole (so e.g.
superproject v1.6.2 includes subproject 'foo' v0.99 which is foo/v0.99
tag in superproject).
 
>> So a straight "git clone" that would need to fetch all of FreeBSD plus 4
>> different Linux kernels and check all that out is a major problem, especially
>> for our automated build system (which could definitely be implemented better,
>> but still).
> 
> To be absolutely pedantic, the four linux kernels likely share most of
> their objects and so you're only paying the cost (at least during
> fetch) of including it once :)
> 
> (If you're actually using git-submodule and each copy of the kernel is
> its own module, then it might be cloning the kernel four times
> separately, in which case the objects *don't* get shared, so this ends
> up being much more expensive than it should be.  That could be fixed
> by slightly improving git-submodule to share some objects rather than
> rearchitecting it though.)

This issue is orthogonal to the fact of using submodules, it is a matter
of setting up alternates to share object storage.
 
>> In truth it's the checkout that takes the most time by far,
>> though commands like git-status also take inconveniently long.
> 
> Yeah, git could stand to be optimized a bit here.  And since Windows
> stats files about 10x slower than Linux, this problem occurs about 10x
> sooner on Windows, which makes using git on Windows (which sadly I
> have to do sometimes) extremely painful compared to Linux.
> 
> IMHO, the correct answer here is to have an inotify-based daemon prod
> at the .git/index automatically when files get updated, so that git
> itself doesn't have to stat/readdir through the entire tree in order
> to do any of its operations.  (Windows also has something like inotify
> that would work.)  If you had this, then git
> status/diff/checkout/commit would be just as fast with zillions of
> files as with 10 files.  Sooner or later, if nobody implements this, I
> promise I'll get around to it since inotify is actually easy to code
> for :)

IIUC the problem is that inotify is not automatically recursive, so
daemon would have to take care of adding inotify trigger to each newly
created subdirectory.

> Also note that the only reason submodules are faster here is that
> they're ignoring possibly important changes.  Notably, when you do
> 'git status' from the top level, it won't warn you if you have any
> not-yet-committed files in any of your submodules.  Personally, I
> consider that to be really important information, but to obtain it
> would make 'git status' take just as long as without submodules, so
> you wouldn't get any benefit.  (I think nowadays there's a way to get
> this recursive status information if you want it, but it'll be slow of
> course.)

Errr... didn't it got improved in recent git?  I think git-status now
includes information about submodules if configured so / unless configured
otherwise.  Isn't it?

>> We chose git-submodule over git-subtree mainly because git-submodule lets us
>> selectively checkout different parts of our code.  (AFAIK sparse checkouts
>> aren't yet an option.)

Sparse checkouts are here, IIRC, but they do not solve problem of disk
space (they are still in repository, even if not checked out), and speed
(they still need to be fetched, even if not checked out).

> Fair enough.  If you could confirm or deny my theory that this is
> *entirely* a performance related concern (as opposed to disk space /
> download time), that would be helpful.
> 
>> We didn't really consider git-subtree because it's
>> not an official part of git, and we didn't want to have to teach (and nag)
>> all our developers to install and maintain it in addition to keeping up with
>> git itself.
> 
> Arguably, this is a vote for including git-subtree into the core
> (which was Bryan's point when he started this thread); it obviously is
> being rejected sometimes by git users simply because it's not in the
> core, even though it could help them.

Well, patch management interfaces such as StGIT, Guilt and TopGit are
also outside git code (and should be), same with GUI tools such as qgit.
That shouldn't prevent people from using them ;-)

But I am all for having git-subtree in core: we have git-remote, haven't
we?  Besides git-subtree fits some workflows better than git-submodule
(and vice versa).

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-24  0:58                     ` skillzero
  2010-07-24  1:20                       ` Avery Pennarun
@ 2010-07-26  8:56                       ` Jakub Narebski
  2010-07-27 18:36                         ` Avery Pennarun
  1 sibling, 1 reply; 58+ messages in thread
From: Jakub Narebski @ 2010-07-26  8:56 UTC (permalink / raw)
  To: skillzero
  Cc: Avery Pennarun, Marc Branchaud, Jens Lehmann,
	Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano, Linus Torvalds

On Sat, Jul 24, 2010, skillzero@gmail.com napisał:
> On Fri, Jul 23, 2010 at 3:50 PM, Avery Pennarun <apenwarr@gmail.com> wrote:
> 
> > Honest question: do you care about the wasted disk space and download
> > time for these extra files?  Or just the fact that git gets slow when
> > you have them?
> 
> I have the similar situation to the original poster (huge trees) and
> for me it's all three: disk space, download time, and performance. My
> tree has a few relatively small (< 20 MB) shared directories of common
> code, a few large (2-6 GB) directories of code for OS's, and then
> several medium size (< 500 MB) directories for application code. The
> application developers only care about the app+shared directories (and
> are very annoyed by the massive space and performance impact of the OS
> directories). The firmware-only developers only care about OS+shared
> and are mildly annoyed by the medium space and performance impact of
> the app directories. I work on all of the pieces, but even I would
> prefer to have things separated so when I work on the apps, git
> status/etc doesn't take a big hit for close to a million files in the
> OS directories (particularly when doing git status on Windows). Even
> when using the -uno option to git status, it's still pretty slow (over
> a minute).
> 
> git-submodule might be technically possible in this situation, but
> having to commit and push each submodule and then commit and push the
> super module makes it slightly worse than just dealing with the
> space/download/performance issues of one huge repository.

But this is just a matter for improving UI for dealing with submodules,
isn't it.   For example having "git commit --recursive" would help
with 'having to commit each submodule', though how you would write commit
messages then: perhaps supermodule commit message could be by default
composed out of submodules commits (if any).  "git push --recursive"
(or some support for push in "git remote") would help with 'having to
push each submodule'.

Isn't it?
-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-24 19:40                         ` skillzero
  2010-07-25  1:47                           ` Nguyen Thai Ngoc Duy
@ 2010-07-26 13:13                           ` Jakub Narebski
  1 sibling, 0 replies; 58+ messages in thread
From: Jakub Narebski @ 2010-07-26 13:13 UTC (permalink / raw)
  To: skillzero
  Cc: Avery Pennarun, Marc Branchaud, Jens Lehmann,
	Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano, Linus Torvalds

On Sat, Jul 24, 2010, skillzero@gmail.com wrote:
> On Fri, Jul 23, 2010 at 6:20 PM, Avery Pennarun <apenwarr@gmail.com> wrote:
>> On Fri, Jul 23, 2010 at 8:58 PM,  <skillzero@gmail.com> wrote:
>>> On Fri, Jul 23, 2010 at 3:50 PM, Avery Pennarun <apenwarr@gmail.com> wrote:

>> This is indeed a problem with large repositories.  Of course,
>> splitting them with git-submodule is kind of cheating, because it just
>> makes git-status *not look* to see if those files are dirty or not.
>> If they are dirty and you forget to commit them, you'll never know
>> until someone tells you later.  It would be functionally equivalent to
>> just have git-status not look inside certain subdirs of a single
>> repository.
> 
> I think it's only cheating if you're using all of the submodules. The
> main purpose of submodules for me (although I don't currently use
> submodules) would be so I don't need to keep modules on disk that I
> don't care about. If a developer is working on an app, they don't need
> the OS directories/modules so they get much faster git status/etc and
> there wouldn't be other directories to have dirty files in. [...]

There are two issues that make submodules or git-subtree a better
solution.  If you work with subprojects via upstream subproject 
repository, and you don't always need / want all subprojects, 
git-submodule is better.  If you always have checked out all subprojects,
and you edit them in superproject, git-subtree is better.
 

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-23 22:50                   ` Avery Pennarun
                                       ` (2 preceding siblings ...)
  2010-07-26  8:51                     ` Jakub Narebski
@ 2010-07-26 15:15                     ` Marc Branchaud
  3 siblings, 0 replies; 58+ messages in thread
From: Marc Branchaud @ 2010-07-26 15:15 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Jakub Narebski, Jens Lehmann,
	Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano, Linus Torvalds

On 10-07-23 06:50 PM, Avery Pennarun wrote:
> On Fri, Jul 23, 2010 at 11:19 AM, Marc Branchaud <marcnarc@xiplink.com> wrote:
>> On 10-07-22 03:41 PM, Avery Pennarun wrote:
>>> 1) Sometimes I want to clone only some subdirs of a project
>>> 2) Sometimes I don't want the entire history because it's too big.
>>> 3) Super huge git repositories start to degrade in performance.
>>
>> The reason we turned to submodules is precisely to deal with repository size.
> 
> I believe that's very common.
> 
> However, I wonder whether that's actually a good reason for git to
> develop better submodules, or actually just a good reason for git to
> get better support for handling huge repositories.

I think that's a fundamental question, but part of the problem in coming up
with an answer is that there's no agreed-upon definition of how to handle
huge repos.  People have provided tools that answer the question in ways they
like, but I think the fact that these issues keep coming up is proof that git
isn't there yet.

>>  Our code base encompasses the entire FreeBSD tree plus different versions of
>> the Linux kernel, along with various third-party libraries & apps.  You don't
>> need everything to build a given product (a FreeBSD product doesn't use any
>> Linux kernels, for example) but because all the products share common code we
>> need to be able to branch and tag the common code along with the uncommon code.
> 
> Honest question: do you care about the wasted disk space and download
> time for these extra files?  Or just the fact that git gets slow when
> you have them?

It's not the disk space or the extra download time.  It's how long takes to
checkout all those files, and how long it takes to "git status" in a unified
repo.

>> So a straight "git clone" that would need to fetch all of FreeBSD plus 4
>> different Linux kernels and check all that out is a major problem, especially
>> for our automated build system (which could definitely be implemented better,
>> but still).
> 
> To be absolutely pedantic, the four linux kernels likely share most of
> their objects and so you're only paying the cost (at least during
> fetch) of including it once :)

That is true, but like I said the problem is the checkout.  Our different
products use different kernels (or FreeBSD):

	Product 1 -- Linux vX
	Product 2 -- Linux vY
	Product 3 -- FreeBSD

(Luckily we're currently only using one version of FreeBSD...)

All the products use common code.  When we release, we need to tag the common
code and the particular Linux kernel (or FreeBSD) we built the product with.
 We can't stuff all the Linux kernels into a single submodule, because then
the repo will be "dirty" if we checkout a different Linux kernel to build a
different product.  Even in a unified repo we'd need the kernels to live in
their own trees.

So we've ended up with individual submodules for each Linux kernel, and we've
taught our automated build to only clone/checkout the kernel it needs to
build the target product.  Otherwise the checkout I/O overshadows the actual
build time, especially when we try to run several builds in parallel on one
slave machine.

> (If you're actually using git-submodule and each copy of the kernel is
> its own module, then it might be cloning the kernel four times
> separately, in which case the objects *don't* get shared, so this ends
> up being much more expensive than it should be.  That could be fixed
> by slightly improving git-submodule to share some objects rather than
> rearchitecting it though.)

Even with the --reference parameter, it's still a problem.

>>  In truth it's the checkout that takes the most time by far,
>> though commands like git-status also take inconveniently long.
> 
> Yeah, git could stand to be optimized a bit here.  And since Windows
> stats files about 10x slower than Linux, this problem occurs about 10x
> sooner on Windows, which makes using git on Windows (which sadly I
> have to do sometimes) extremely painful compared to Linux.
> 
> IMHO, the correct answer here is to have an inotify-based daemon prod
> at the .git/index automatically when files get updated, so that git
> itself doesn't have to stat/readdir through the entire tree in order
> to do any of its operations.  (Windows also has something like inotify
> that would work.)  If you had this, then git
> status/diff/checkout/commit would be just as fast with zillions of
> files as with 10 files.  Sooner or later, if nobody implements this, I
> promise I'll get around to it since inotify is actually easy to code
> for :)
> 
> Also note that the only reason submodules are faster here is that
> they're ignoring possibly important changes.  Notably, when you do
> 'git status' from the top level, it won't warn you if you have any
> not-yet-committed files in any of your submodules.  Personally, I
> consider that to be really important information, but to obtain it
> would make 'git status' take just as long as without submodules, so
> you wouldn't get any benefit.  (I think nowadays there's a way to get
> this recursive status information if you want it, but it'll be slow of
> course.)

I'm happy with a "git status" that can ignore uninitialized submodules and
still probe into initialized/cloned ones.  I agree that it's important for
"git status" to be correct.

>> We chose git-submodule over git-subtree mainly because git-submodule lets us
>> selectively checkout different parts of our code.  (AFAIK sparse checkouts
>> aren't yet an option.)
> 
> Fair enough.  If you could confirm or deny my theory that this is
> *entirely* a performance related concern (as opposed to disk space /
> download time), that would be helpful.

Consider it confirmed.  Honestly, disk space is a complete non-issue.  It's
always nice to have faster download times, but it hasn't been an issue for us
and there are already several ways to work around it anyway.

>>  We didn't really consider git-subtree because it's
>> not an official part of git, and we didn't want to have to teach (and nag)
>> all our developers to install and maintain it in addition to keeping up with
>> git itself.
> 
> Arguably, this is a vote for including git-subtree into the core
> (which was Bryan's point when he started this thread); it obviously is
> being rejected sometimes by git users simply because it's not in the
> core, even though it could help them.

Yes, I have no objection to seeing git-subtree becoming an official part of
git.  My only complaint would be that it doesn't really help git deal with
huge repos.

		M.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-24  1:20                       ` Avery Pennarun
  2010-07-24 19:40                         ` skillzero
@ 2010-07-26 16:37                         ` Marc Branchaud
  2010-07-26 16:41                           ` Linus Torvalds
  1 sibling, 1 reply; 58+ messages in thread
From: Marc Branchaud @ 2010-07-26 16:37 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: skillzero, Jakub Narebski, Jens Lehmann,
	Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano, Linus Torvalds

On 10-07-23 09:20 PM, Avery Pennarun wrote:
> 
> I'm pushing extra hard on this because I believe there are lots of
> opportunities to just improve git performance on huge repositories.
> And if the only *real* reason people need to split repositories is
> that performance goes down, then that's fixable, and you may need
> neither git-submodule nor git-subtree.

I think I should mention one aspect of what we're doing, which is that a lot
of our submodules are based on external code, and that we occasionally need
to modify or customize some of that code.  So it's quite nice for us to
maintain private git mirrors of the external repos, with our own private
branches that contain our modifications.  Although we want to get much of our
changes incorporated into the upstream code bases, upstream release cycles
are rarely in sync with ours.

So it's very convenient for use to have our external-code modifications
contained in private branches in our private mirrors, and to rebase those
branches to keep up with upstream releases.  We also often use these private
branches to maintain the code that integrates the external code bases into
our overall build system.

I mention this purely because this pattern is so convenient that I don't want
to see it get lost in whatever may arise from this discussion.

		M.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-26 16:37                         ` Marc Branchaud
@ 2010-07-26 16:41                           ` Linus Torvalds
  2010-07-26 17:36                             ` Bryan Larsen
  2010-07-27 18:28                             ` Avery Pennarun
  0 siblings, 2 replies; 58+ messages in thread
From: Linus Torvalds @ 2010-07-26 16:41 UTC (permalink / raw)
  To: Marc Branchaud
  Cc: Avery Pennarun, skillzero, Jakub Narebski, Jens Lehmann,
	Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano

On Mon, Jul 26, 2010 at 9:37 AM, Marc Branchaud <marcnarc@xiplink.com> wrote:
>
> I think I should mention one aspect of what we're doing, which is that a lot
> of our submodules are based on external code, and that we occasionally need
> to modify or customize some of that code.  So it's quite nice for us to
> maintain private git mirrors of the external repos, with our own private
> branches that contain our modifications.  Although we want to get much of our
> changes incorporated into the upstream code bases, upstream release cycles
> are rarely in sync with ours.

THIS.

This is why I always thought that submodules absolutely have to be
commits, not trees. It's why the git submodule data structures are
done the way they are. Anything that makes the submodule just a tree
is fundamentally broken, I think.

That said, I'm not competent to comment on the actual user interface
issues. I can well believe that git-subtree has a nicer interface.

             Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-22 21:33                     ` Avery Pennarun
  2010-07-23 15:10                       ` Jens Lehmann
@ 2010-07-26 17:34                       ` Eugene Sajine
  1 sibling, 0 replies; 58+ messages in thread
From: Eugene Sajine @ 2010-07-26 17:34 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Jakub Narebski, Jens Lehmann, Bryan Larsen, git, Junio C Hamano,
	Linus Torvalds

On Thu, Jul 22, 2010 at 5:33 PM, Avery Pennarun <apenwarr@gmail.com> wrote:
> On Thu, Jul 22, 2010 at 4:17 PM, Ęvar Arnfjörš Bjarmason
> <avarab@gmail.com> wrote:
>> But it's also clear that we have a lot of tribal knowledge about the
>> lackings of git submodule / git subtree. It would be *really* useful
>> if people like Avery and Jens which have obviously thought hard about
>> the submodule/subtree issues would draft up some (calmly written) docs
>> about how the two differ (with comparison tables etc.).
>>
>> That'd be a very helpful resource for Git users in deciding which one
>> to use.
>
> I think I'm too biased to write that, but if someone else wants to
> take the lead, I could certainly contribute.
>
> Have fun,
>
> Avery


I personally tried to understand submodules, but my attempts to find
easy way to use them have failed miserably;) probably i have to spend
even more time in order to understand if i can benefit from them or
not. So, i think this kind of comparison would be very beneficial for
"mere mortals"

I would like to share an idea how it can be organized:

We could create a file in doc section of git.git or in Avery's repo
named git_submodule_vs_git_subtree or just use a separate topic of the
list.

The file would look like this:

git-submodule |           feature                  | git-subtree
______________________________________________________________________
     +        | ability to tag submodule without   |     -
   (comments) | tagging the whole tree             |  (comments)
______________________________________________________________________


Avery and Jens could add features they think are beneficial for one
project or another and answer to each other this way. They could mark
just presence or abscence of the feature by +/- like above or specify
key approaches how to do different things.
For example, how to configure new submodule (main sequence of commands
to create, add ), how to do that with sub-tree...

I think this simple feature matrix will answer a lot of questions.

just my 2 cents...

Thanks,
Eugene

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-26 16:41                           ` Linus Torvalds
@ 2010-07-26 17:36                             ` Bryan Larsen
  2010-07-26 17:48                               ` Linus Torvalds
  2010-07-27 18:28                             ` Avery Pennarun
  1 sibling, 1 reply; 58+ messages in thread
From: Bryan Larsen @ 2010-07-26 17:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marc Branchaud, Avery Pennarun, skillzero, Jakub Narebski,
	Jens Lehmann, Ævar Arnfjörð Bjarmason, git,
	Junio C Hamano

On 10-07-26 12:41 PM, Linus Torvalds wrote:
> On Mon, Jul 26, 2010 at 9:37 AM, Marc Branchaud<marcnarc@xiplink.com>  wrote:
>>
>> I think I should mention one aspect of what we're doing, which is that a lot
>> of our submodules are based on external code, and that we occasionally need
>> to modify or customize some of that code.  So it's quite nice for us to
>> maintain private git mirrors of the external repos, with our own private
>> branches that contain our modifications.  Although we want to get much of our
>> changes incorporated into the upstream code bases, upstream release cycles
>> are rarely in sync with ours.
>
> THIS.
>
> This is why I always thought that submodules absolutely have to be
> commits, not trees. It's why the git submodule data structures are
> done the way they are. Anything that makes the submodule just a tree
> is fundamentally broken, I think.
>
> That said, I'm not competent to comment on the actual user interface
> issues. I can well believe that git-subtree has a nicer interface.
>
>               Linus
>

To me, that's what git-subtree is: an internal private mirror of an 
external repo.   Using git submodule moves that into a separately 
managed repo, which is just unnecessary hassle.  Why maintain repo 
called "clone of library X for project A" when you can just stick it 
inside of project A without any downsides?

For us, changes are made in the superproject and tested in the 
superproject.  Once they're tested, a git subtree push or a git subtree 
split pushes the patches to the subproject.   Once the subproject has 
accepted the patches, a git subtree pull merges them.   Same workflow as 
the "private git mirror of external repo" listed above, just without the 
hassle of having another repo to manage.

Bryan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-26 17:36                             ` Bryan Larsen
@ 2010-07-26 17:48                               ` Linus Torvalds
  0 siblings, 0 replies; 58+ messages in thread
From: Linus Torvalds @ 2010-07-26 17:48 UTC (permalink / raw)
  To: Bryan Larsen
  Cc: Marc Branchaud, Avery Pennarun, skillzero, Jakub Narebski,
	Jens Lehmann, Ævar Arnfjörð Bjarmason, git,
	Junio C Hamano

On Mon, Jul 26, 2010 at 10:36 AM, Bryan Larsen <bryan.larsen@gmail.com> wrote:
>
> To me, that's what git-subtree is: an internal private mirror of an external
> repo.   Using git submodule moves that into a separately managed repo, which
> is just unnecessary hassle.  Why maintain repo called "clone of library X
> for project A" when you can just stick it inside of project A without any
> downsides?

Without any downsides?

What about merging? What about complex history? IOW, what about
_anything_ but a few extra one-liner patches?

Background: the only time I ever used CVS modules, we had submodules
for things like gcc, binutils, etc. And maintained them separately
from upstream for _years_. Not with some simple one-liner fixes, but
with big fundamental changes that couldn't be sent upstream (and
wouldn't have been accepted anyway) etc.

THAT is the problem space. Not "just a mirror of another project".

                   Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-26 16:41                           ` Linus Torvalds
  2010-07-26 17:36                             ` Bryan Larsen
@ 2010-07-27 18:28                             ` Avery Pennarun
  2010-07-27 20:25                               ` Junio C Hamano
  1 sibling, 1 reply; 58+ messages in thread
From: Avery Pennarun @ 2010-07-27 18:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marc Branchaud, skillzero, Jakub Narebski, Jens Lehmann,
	Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano

On Mon, Jul 26, 2010 at 09:41:42AM -0700, Linus Torvalds wrote:

> On Mon, Jul 26, 2010 at 9:37 AM, Marc Branchaud <marcnarc@xiplink.com> wrote:
> >
> > I think I should mention one aspect of what we're doing, which is that a lot
> > of our submodules are based on external code, and that we occasionally need
> > to modify or customize some of that code.  So it's quite nice for us to
> > maintain private git mirrors of the external repos, with our own private
> > branches that contain our modifications.  Although we want to get much of our
> > changes incorporated into the upstream code bases, upstream release cycles
> > are rarely in sync with ours.
> 
> THIS.
> 
> This is why I always thought that submodules absolutely have to be
> commits, not trees. It's why the git submodule data structures are
> done the way they are. Anything that makes the submodule just a tree
> is fundamentally broken, I think.

I agree completely.  The major failing of git-subtree is that it uses
tree->tree links instead of tree->commit links.

This was necessary only because git fundamentally *mistreats* tree->commit
links: it refuses to push or fetch through them automatically.  That is,
when I fetch a superproject that has a tree->commit link in it, git won't
fetch the subproject's history starting at the targeted commit, even if the
remote repo *has* that history.  And if I make a patch to the subproject,
pushing the superproject won't push that patch.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-26  8:56                       ` Jakub Narebski
@ 2010-07-27 18:36                         ` Avery Pennarun
  2010-07-28 13:36                           ` Marc Branchaud
  2010-07-28 18:32                           ` Jakub Narebski
  0 siblings, 2 replies; 58+ messages in thread
From: Avery Pennarun @ 2010-07-27 18:36 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: skillzero, Marc Branchaud, Jens Lehmann,
	Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano, Linus Torvalds

On Mon, Jul 26, 2010 at 10:56:58AM +0200, Jakub Narebski wrote:
> On Sat, Jul 24, 2010, skillzero@gmail.com napisał:
> > git-submodule might be technically possible in this situation, but
> > having to commit and push each submodule and then commit and push the
> > super module makes it slightly worse than just dealing with the
> > space/download/performance issues of one huge repository.
> 
> But this is just a matter for improving UI for dealing with submodules,
> isn't it.   For example having "git commit --recursive" would help
> with 'having to commit each submodule', though how you would write commit
> messages then: perhaps supermodule commit message could be by default
> composed out of submodules commits (if any).  "git push --recursive"
> (or some support for push in "git remote") would help with 'having to
> push each submodule'.

For "recursive" commit, for my own workflow, I would rather have it work
like this: from the toplevel, I can 'git commit' any set of files, as long
as they all fall inside a particular submodule.  That is, if I do

	git commit mod1/*.c mod2/*.c
	
it should reject it (with a helpful message), because the commit would cross
submodule boundaries.  But if I do

	git commit mod1/*.c
	
I think it should create a new commit in mod1, leave my superproject
pointing at that new commit, and stop (ie. without the superproject having
committed the new commit pointer).

Why?  Because my normal workflow is:

  - make a bunch of superproject/submodule changes until they work.
  - commit the submodule changes with a submodule-relevant message
  - commit the superproject change with a supermodule-relevant message
  
I wouldn't want to share commit messages between the two, so actually having
a single commit process be "recursive" would not do me any good.

However, pushing is a separate issue entirely.  Having push be recursive
would be easy, but it doesn't solve the *real* problem with pushing: git
doesn't know what branch to push to in the submodule, and the submodule most
likely isn't pointing at a pushable repo at all, even if the supermodule is. 
This is why I keep coming back to the idea that I really want to push all
the submodule objects into the superproject's repo.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-25 19:57                     ` Jens Lehmann
@ 2010-07-27 18:40                       ` Avery Pennarun
  2010-07-27 21:14                         ` Jens Lehmann
  0 siblings, 1 reply; 58+ messages in thread
From: Avery Pennarun @ 2010-07-27 18:40 UTC (permalink / raw)
  To: Jens Lehmann
  Cc: Jakub Narebski, Ævar Arnfjörð Bjarmason,
	Bryan Larsen, git, Junio C Hamano, Linus Torvalds, Heiko Voigt

On Sun, Jul 25, 2010 at 09:57:55PM +0200, Jens Lehmann wrote:

> Am 24.07.2010 00:32, schrieb Avery Pennarun:
> > On Fri, Jul 23, 2010 at 11:10 AM, Jens Lehmann <Jens.Lehmann@web.de> wrote:
> >> You forgot what we do as best practice at work:
> >>
> >> [3] Fork the gem repos on github (or another server reachable by your
> >>    co-workers) and use those, so you don't have to change the URL
> >>    later:
> >>
> >>    git://github.com/apenwarrrubygems/gem[1..n]
> >>
> >> Your problems go away, setup has to be done only once on project
> >> start and not for every developer, you can use your own branchnames
> >> and you have a staging repo from where you can push patches upstream
> >> if necessary.
> > 
> > Now all your fellow developers have to push their submodule code to a
> > single upstream repo?  That's rather centralized and un-git-like.
> 
> But isn't that exactly the same thing you would have to do for your
> superproject too to be able to push your changes for your fellows?

No.  On github, only I can push to my superproject's history, and yet
everyone can still pull from me.

With what you're proposing, for all my submodules, we can't each have our
own project; we all have to push to the shared one.

(Just to be clear: I don't want to fork *every submodule by hand every
time*.  I just want *my* stuff to be in *my* repo.  The easiest way to do
this would be to have all my changes in a single repo, ie. my fork of the
superproject.)

> >> It is the /commits/ that have to be
> >> done twice, once in the submodule and
> >> then in the superproject. (But that is
> >> not necessarily bad, imagine having git
> >> gui as a submodule: you would be
> >> automagically reminded that stuff for
> >> git gui should be sent somewhere else
> >> than to Junio).
> > 
> > Yup, I agree that requiring a separate commit to the submodule repo is
> > not a bad idea.  I always do this anyway even when using git-subtree,
> > because I'm thinking ahead to the day when I'll push my submodule
> > changes upstream and I want my commit message to make sense.  But
> > that's because I think ahead like that.  Having the tool force me to
> > do it would be harmless and help people avoid mistakes.
> 
> And submodules force you to do that.

Yes.  This is a limitation of submodules, but not one that bothers me.  And
it encourages good behaviour.

> > The syntax for it ought to be nice though.  I should be able to do:
> > 
> >     git commit -- path/to/submodule
> > 
> > And have it commit everything in the submodule tree as a new commit in
> > the submodule.  I don't want to have to think about cd'ing to
> > path/to/submodule just so I can commit the files I changed in there.
> 
> Yes, that would be a nice feature (assuming you have a branch in the
> submodule to commit these changes to ;-).

No, I explicitly *don't* want to have to have a branch in the submodule;
that's too much extra thinking at that stage.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-26  8:51                     ` Jakub Narebski
@ 2010-07-27 19:15                       ` Avery Pennarun
  0 siblings, 0 replies; 58+ messages in thread
From: Avery Pennarun @ 2010-07-27 19:15 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Marc Branchaud, Jens Lehmann,
	Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano, Linus Torvalds

On Mon, Jul 26, 2010 at 4:51 AM, Jakub Narebski <jnareb@gmail.com> wrote:
> On Sat, 24 Jul 2010 00:50, Avery Pennarun wrote:
>> My bup project (http://github.com/apenwarr/bup) is all about huge
>> repositories.  It handles repositories with hundreds of gigabytes, and
>> trees containing millions of files (entire filesystems), quite nicely.
>>  Of course, it's not a version control system, so it won't solve your
>> problems.  It's just evidence that large repositories are actually
>> quite manageable without changing the fundamentals of git.
>
> There is also git-bigfiles project, although it is more about large
> [binary] files than large repositories per se (many files, long history).

Right.  git-bigfiles is valuable, but it's valuable with or without
submodules.  (If you have large blobs, submodules won't save you.)

bup happens to have its own way of dealing with large files too, but
it may not be applicable to git.  It does result in lots and lots of
smaller objects, though, which is why I know git repositories are
fundamentally capable of handling lots and lots of smaller objects :)

> Note that with 'bup' you might not see problems with large repositories
> because it does not examine code paths that are slow in large repositories
> (gc, log, path-delimited log).

gc is a huge problem.  bup avoids it entirely (it foregoes delta
compression); git gc fails completely on such large repositories (100+
GB).  There's no reason this has to be true forever, but yes, to
support really big repos, git gc would need to be improved somewhat.
For most reasonably sane repos (a few GB) you can get reasonable
performance by just making your biggest packfiles .keep so they don't
keep getting repacked all the time.

Compared to that, log feels like not a problem at all :)  At least
performance-wise.  The thing that sucks about log using git-subtree,
of course, is that you get all these log messages from multiple
projects jammed together into a single repo, which is rarely what you
want, even if it's fast.  I think the "best" solution is a single repo
with all your objects, but still keeping the histories of each
submodule separate.

>> IMHO, the correct answer here is to have an inotify-based daemon prod
>> at the .git/index automatically when files get updated, so that git
>> itself doesn't have to stat/readdir through the entire tree in order
>> to do any of its operations.  (Windows also has something like inotify
>> that would work.)  If you had this, then git
>> status/diff/checkout/commit would be just as fast with zillions of
>> files as with 10 files.  Sooner or later, if nobody implements this, I
>> promise I'll get around to it since inotify is actually easy to code
>> for :)
>
> IIUC the problem is that inotify is not automatically recursive, so
> daemon would have to take care of adding inotify trigger to each newly
> created subdirectory.

Yeah, the inotify API is kind of gross that way.  But it can be done,
and people do.  (eg. the beagle project)

>> Also note that the only reason submodules are faster here is that
>> they're ignoring possibly important changes.  Notably, when you do
>> 'git status' from the top level, it won't warn you if you have any
>> not-yet-committed files in any of your submodules.  Personally, I
>> consider that to be really important information, but to obtain it
>> would make 'git status' take just as long as without submodules, so
>> you wouldn't get any benefit.  (I think nowadays there's a way to get
>> this recursive status information if you want it, but it'll be slow of
>> course.)
>
> Errr... didn't it got improved in recent git?  I think git-status now
> includes information about submodules if configured so / unless configured
> otherwise.  Isn't it?

Yes, but you're still left with the choice between slow (checks all
files in all submodules) and not slow (might miss stuff).  This isn't
a submodule question, really, it's an overall performance question
with huge checkouts with or without submodules.

>>> We chose git-submodule over git-subtree mainly because git-submodule lets us
>>> selectively checkout different parts of our code.  (AFAIK sparse checkouts
>>> aren't yet an option.)
>
> Sparse checkouts are here, IIRC, but they do not solve problem of disk
> space (they are still in repository, even if not checked out), and speed
> (they still need to be fetched, even if not checked out).

Hmm, don't mix bandwidth usage (and thus the slowness of fetch) with
slowness during everyday usage.  I don't mind a slow fetch now and
then, but 'git status' should be fast. AFAIK, sparse checkouts
*should* make git status faster.  If they don't, it's probably just a
bug.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-27 18:28                             ` Avery Pennarun
@ 2010-07-27 20:25                               ` Junio C Hamano
  2010-07-27 20:57                                 ` Avery Pennarun
  0 siblings, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2010-07-27 20:25 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Linus Torvalds, Marc Branchaud, skillzero, Jakub Narebski,
	Jens Lehmann, Ævar Arnfjörð Bjarmason,
	Bryan Larsen, git

Avery Pennarun <apenwarr@gmail.com> writes:

> On Mon, Jul 26, 2010 at 09:41:42AM -0700, Linus Torvalds wrote:
>
>> On Mon, Jul 26, 2010 at 9:37 AM, Marc Branchaud <marcnarc@xiplink.com> wrote:
>> >
>> > I think I should mention one aspect of what we're doing, which is that a lot
>> > of our submodules are based on external code, and that we occasionally need
>> > to modify or customize some of that code.  So it's quite nice for us to
>> > maintain private git mirrors of the external repos, with our own private
>> > branches that contain our modifications.  Although we want to get much of our
>> > changes incorporated into the upstream code bases, upstream release cycles
>> > are rarely in sync with ours.
>> 
>> THIS.
>> 
>> This is why I always thought that submodules absolutely have to be
>> commits, not trees. It's why the git submodule data structures are
>> done the way they are. Anything that makes the submodule just a tree
>> is fundamentally broken, I think.
>
> I agree completely.  The major failing of git-subtree is that it uses
> tree->tree links instead of tree->commit links.
>
> This was necessary only because git fundamentally *mistreats* tree->commit
> links: it refuses to push or fetch through them automatically.

I do not think that is so "fundamental" as you seem to think.

Isn't it just the matter of how the default UI of object transfer commands
(like push and fetch) are set up?

Admittedly, the way the default UI is set up is to strongly favor the
early design decision we made back when Linus did his initial "gitlink"
implementation, which is "separate project lives in a separate repository,
and not having to check out any subproject should be the norm for using a
superproject".  

Some "recursive" operations have been added to commands for which it makes
sense (e.g. "clone --recursive") by people who cared enough.  Even though
there are a few other commands that shouldn't ever learn the recursive
mode (e.g. "commit --recursive -m $msg" would not make sense), there still
are some commands where a similar "--recursive" option would make sense
but haven't learned it (e.g. "push --recursive").

I also consider it merely a lack of UI enhancement that you have to clone
the submodule again (or cannot switch to a clean slate very easily) when
switching between revisions of superproject before and after you add a
submodule, and nothing fundamental.  

When switching back in history to lose a recent submodule, the user
experience should be like switching to a revision that didn't have a
directory.  You shouldn't be able to lose your change in that directory,
but if the directory is clean, you should be able to lose it.  And when
you switch to a more recent revision that has the submodule, you should be
able to get it back (again, if you have a precious file there, the
checkout should barf).

We have added support for having "gitdir: $dir" in a regular file .git
exactly because we wanted to be able to stash away the submodule's .git
directory somewhere inside .git (e.g. .git/modules/<submodulename>) in the
superproject when we do that kind of branch switching, so that we can get
it back when switching back to a revision with the submodule without
having to re-clone (also this presumably would help when you move the
submodule in the superproject tree), but there haven't been further work
to make use of this in "git submodule update" (it probably needs to start
by teaching "git clone" how to make use of "gitdir: $dir", if anybody is
interested).

By the way, I also do not think it is such a bad thing that git-subtree
does not bind commit into its superproject tree while it is working
"natively" (in a "git-subtree" workflow), but allows users to easily split
the history into an exportable shape to upstreams of its submodules when
such an operqation is needed.  If you rarely push back to upstreams but
constantly consume their changes, that sounds like a reasonable way to go.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-27 20:25                               ` Junio C Hamano
@ 2010-07-27 20:57                                 ` Avery Pennarun
  2010-07-27 21:14                                   ` Junio C Hamano
  2010-07-27 21:32                                   ` Jens Lehmann
  0 siblings, 2 replies; 58+ messages in thread
From: Avery Pennarun @ 2010-07-27 20:57 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Linus Torvalds, Marc Branchaud, skillzero, Jakub Narebski,
	Jens Lehmann, Ævar Arnfjörð,
	Bryan Larsen, git

On Tue, Jul 27, 2010 at 4:25 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Avery Pennarun <apenwarr@gmail.com> writes:
>> I agree completely.  The major failing of git-subtree is that it uses
>> tree->tree links instead of tree->commit links.
>>
>> This was necessary only because git fundamentally *mistreats* tree->commit
>> links: it refuses to push or fetch through them automatically.
>
> I do not think that is so "fundamental" as you seem to think.
>
> Isn't it just the matter of how the default UI of object transfer commands
> (like push and fetch) are set up?

Well, I call it fundamental because there's currently no way to get
the git UI to do otherwise.  It's not really just a "default."  To
depend on this changing would have prevented me from writing
git-subtree, which is why I didn't depend on it.  However, I agree
that it's fixable.

Note that the way git treats a checked-out submodule (as you describe
below) is also very fundamental to how this works.  git-subtree
wouldn't have the usability that it does if 'git checkout branchname'
didn't work perfectly will all the subtrees, which it currently does,
but which it wouldn't if I had relied on tree->commit links.

> Some "recursive" operations have been added to commands for which it makes
> sense (e.g. "clone --recursive") by people who cared enough.  Even though
> there are a few other commands that shouldn't ever learn the recursive
> mode (e.g. "commit --recursive -m $msg" would not make sense), there still
> are some commands where a similar "--recursive" option would make sense
> but haven't learned it (e.g. "push --recursive").

One problem with this line of reasoning is that "--recursive" is
always an option.  But if submodules are ever to be easy to use, I
think it should be the default (or settable as a default using git
config).  This would take us a *long* way towards usability (of
course, in addition to adding the missing features, as you mention).

Also, I haven't tried it, but I think 'git gc' will prune away objects
if the only reference to them is a 'commit' link from a tree.  This
would be undesirable too.

> I also consider it merely a lack of UI enhancement that you have to clone
> the submodule again (or cannot switch to a clean slate very easily) when
> switching between revisions of superproject before and after you add a
> submodule, and nothing fundamental.

I mostly agree with this.  There is one problem I don't know how to
solve with this idea, though: what happens when commit A adds a
submodule in modules/mod1, commit B removes it, and then commit C
re-adds the same submodules in modules/mod1-again?  Will it reuse the
same submodule .git directory or a new one?  Share objects or not?
Share branch names or not?  Share .git/config or not?

Unless you have some kind of "unique id" scheme for submodules, this
gets impossible to handle correctly.  And the git objects themselves
(trees that link to commits) have nowhere to put such things.

By comparison, simply putting all the stuff related to all the
submodules into the supermodule's repo creates none of these confusing
problems.  You could even still choose not to checkout individual
submodules' trees if you wanted.

> When switching back in history to lose a recent submodule, the user
> experience should be like switching to a revision that didn't have a
> directory.  You shouldn't be able to lose your change in that directory,
> but if the directory is clean, you should be able to lose it.  And when
> you switch to a more recent revision that has the submodule, you should be
> able to get it back (again, if you have a precious file there, the
> checkout should barf).

It sounds like you're proposing that we delete the entire submodule's
directory hierarchy when the submodule commit link goes away.  Note
that this isn't what happens in the non-submodule case: all the *.o
files, for example, in a deleted subdirectory are not automatically
deleted by git.  And I think this is the behaviour we should expect.

With that in mind, the situations where checkout barfs because of a
"precious" file should be the same as they are in normal git: it
should only be a problem if the files in question differ between the
originally-checked-out tree and the newly-checked-out tree.

Apologies if that's what you meant in the first place.

> We have added support for having "gitdir: $dir" in a regular file .git
> exactly because we wanted to be able to stash away the submodule's .git
> directory somewhere inside .git (e.g. .git/modules/<submodulename>) in the
> superproject when we do that kind of branch switching, so that we can get
> it back when switching back to a revision with the submodule without
> having to re-clone (also this presumably would help when you move the
> submodule in the superproject tree), but there haven't been further work
> to make use of this in "git submodule update" (it probably needs to start
> by teaching "git clone" how to make use of "gitdir: $dir", if anybody is
> interested).

I guess the real question is: just how much of a "real" repository do
we want a submodule to act like?

Thoughts:

- object store: I think this should just always be shared with the
superproject.  There's no reason to separate them that I can see.

- branches: should be a way to simply not worry about branches and
just use what's in the superproject.  Other people seem to want to be
able to have a set of branches/tags for their submodule.

- .git/config: entirely shared?  entirely separate?

- remotes: I would want my submodules to never do their own
pushing/pulling, and leave that to the supermodule; other people seem
to disagree.

For the particular model I'm proposing, I'm just not sure that *any*
of the features of a separate repo are warranted... and having them
adds a lot of complication.  (In the most basic level, you suddenly
need to track .git directories as submodules are added/deleted/moved
around when you checkout different revisions of the superproject, and
there seems to be no way to do that elegantly.)

Have fun,

Avery

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-27 20:57                                 ` Avery Pennarun
@ 2010-07-27 21:14                                   ` Junio C Hamano
  2010-07-27 21:32                                   ` Jens Lehmann
  1 sibling, 0 replies; 58+ messages in thread
From: Junio C Hamano @ 2010-07-27 21:14 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Junio C Hamano, Linus Torvalds, Marc Branchaud, skillzero,
	Jakub Narebski, Jens Lehmann, Ævar Arnfjörð,
	Bryan Larsen, git

Avery Pennarun <apenwarr@gmail.com> writes:

> ...  There is one problem I don't know how to
> solve with this idea, though: what happens when commit A adds a
> submodule in modules/mod1, commit B removes it, and then commit C
> re-adds the same submodules in modules/mod1-again?  Will it reuse the
> same submodule .git directory or a new one?  Share objects or not?
> Share branch names or not?  Share .git/config or not?
>
> Unless you have some kind of "unique id" scheme for submodules, this
> gets impossible to handle correctly.  And the git objects themselves
> (trees that link to commits) have nowhere to put such things.

I vaguely recall that we already had discussed and more or less resolved
it at the design level at some point.  Looking for "three-level thing" in
the gmane archive might be beneficial, although all I recall these three
words as search keywords and do not have a detailed recollection of actual
discussion ;-)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-27 18:40                       ` Avery Pennarun
@ 2010-07-27 21:14                         ` Jens Lehmann
  0 siblings, 0 replies; 58+ messages in thread
From: Jens Lehmann @ 2010-07-27 21:14 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Jakub Narebski, Ævar Arnfjörð Bjarmason,
	Bryan Larsen, git, Junio C Hamano, Linus Torvalds, Heiko Voigt

Am 27.07.2010 20:40, schrieb Avery Pennarun:
> With what you're proposing, for all my submodules, we can't each have our
> own project; we all have to push to the shared one.
> 
> (Just to be clear: I don't want to fork *every submodule by hand every
> time*.  I just want *my* stuff to be in *my* repo.  The easiest way to do
> this would be to have all my changes in a single repo, ie. my fork of the
> superproject.)

Fair enough, but that would not be the Right Thing for my use cases.
(E.g. I am using submodules to have a single upstream repo for a library
which I use in almost all my projects. And fixes to that library I do in
one of these projects shall be fetchable in all other projects right
after I pushed them to the submodules repo, without having to push them
out of the superprojects repo into the shared one /again/. The situation
at dayjob is the same and I assume a lot of people are using submodules
this way).

So I would vote for not breaking the *feature* submodules currently have:
to use a different repo than that used for the superproject. Because that
enables you to have shared content. I am not against having the /choice/
to have the submodules objects in the same repo as the superproject, but
that should be an option and not mandatory.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-27 20:57                                 ` Avery Pennarun
  2010-07-27 21:14                                   ` Junio C Hamano
@ 2010-07-27 21:32                                   ` Jens Lehmann
  1 sibling, 0 replies; 58+ messages in thread
From: Jens Lehmann @ 2010-07-27 21:32 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Junio C Hamano, Linus Torvalds, Marc Branchaud, skillzero,
	Jakub Narebski, Ævar Arnfjörð,
	Bryan Larsen, git

Am 27.07.2010 22:57, schrieb Avery Pennarun:
> One problem with this line of reasoning is that "--recursive" is
> always an option.  But if submodules are ever to be easy to use, I
> think it should be the default (or settable as a default using git
> config).  This would take us a *long* way towards usability (of
> course, in addition to adding the missing features, as you mention).

And that is exactly what I am currently doing:

- I already teached diff and status to always recurse (and just
  sent a patch to add a config option for that behavior, as some
  users either can't pay the performance costs or don't want to
  see submodules show up as modified just because they contain
  untracked files).

- I posted a WIP patch doing recursive checkouts (that is basically
  working but I still have to put in the safety checks so that no
  modifications to submodules are accidentally discarded unless -f
  is used).

- I am working on a recursive fetch too.

And then there is other stuff on my list to be tackled; I try to
fix these issues so that the most annoying problems get solved
first.

Unfortunately that does not proceed as fast as i wished, but
hopefully I can show some progress in the near future. Of course
any help would greatly be appreciated ;-)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-27 18:36                         ` Avery Pennarun
@ 2010-07-28 13:36                           ` Marc Branchaud
  2010-07-28 18:32                           ` Jakub Narebski
  1 sibling, 0 replies; 58+ messages in thread
From: Marc Branchaud @ 2010-07-28 13:36 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Jakub Narebski, skillzero, Jens Lehmann,
	Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano, Linus Torvalds

On 10-07-27 02:36 PM, Avery Pennarun wrote:
> 
> For "recursive" commit, for my own workflow, I would rather have it work
> like this: from the toplevel, I can 'git commit' any set of files, as long
> as they all fall inside a particular submodule.  That is, if I do
> 
> 	git commit mod1/*.c mod2/*.c
> 	
> it should reject it (with a helpful message), because the commit would cross
> submodule boundaries.  But if I do
> 
> 	git commit mod1/*.c
> 	
> I think it should create a new commit in mod1, leave my superproject
> pointing at that new commit, and stop (ie. without the superproject having
> committed the new commit pointer).

I think that makes perfect sense.  I'd also want the updated pointer to be
unstaged.

> Why?  Because my normal workflow is:
> 
>   - make a bunch of superproject/submodule changes until they work.
>   - commit the submodule changes with a submodule-relevant message
>   - commit the superproject change with a supermodule-relevant message
>   
> I wouldn't want to share commit messages between the two, so actually having
> a single commit process be "recursive" would not do me any good.

That's the workflow I'd like to follow as well.

In terms of achieving this workflow with submodules and branching, what's
required is that branching in the superproject takes the submodules off of
the detached HEAD and onto something that won't get automatically
garbage-collected in a few weeks.

That could be done simply by applying the superproject's branch to all the
submodules.  A command like

	superproject/$ git branch foo origin/master

would create the submodule branches on the commits identified for the
submodules in the superproject's origin/master commit.  To make that work
smoothly I think requires all the submodules' .git directories, so the branch
name can be recorded in all of them.

And so I think that either "git fetch" has to recursively obtain (and update)
all submodule repos, or there needs to be some kind of on-demand retrieval
mechanism.  Other ideas for grand-unified object stores (which I haven't been
following too closely) could work as well.

So with unified branching and available .git directories, I think a recursive
checkout is doable and makes sense.  I'd still like to control which
submodules a checkout might recurse through, but I think the sparse-checkout
system is the way to handle that.

I also suspect that non-fast-forward submodule merges could be workable,
where regular merges are performed individually in the submodules before
merging in the superproject.

One final, somewhat orthogonal thought:  I think that "git commit
submodule-dir" should require -f if the remote associated with the submodule
doesn't have the commit ID you're trying to commit.

> However, pushing is a separate issue entirely.  Having push be recursive
> would be easy, but it doesn't solve the *real* problem with pushing: git
> doesn't know what branch to push to in the submodule, and the submodule most
> likely isn't pointing at a pushable repo at all, even if the supermodule is. 
> This is why I keep coming back to the idea that I really want to push all
> the submodule objects into the superproject's repo.

I agree that recursive pushing doesn't make much sense, so there's no need to
try to implement it.  I think having "git commit" reject unpushed submodule
updates in the superproject goes a long way to alleviating misordered pushing.

		M.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-27 18:36                         ` Avery Pennarun
  2010-07-28 13:36                           ` Marc Branchaud
@ 2010-07-28 18:32                           ` Jakub Narebski
  1 sibling, 0 replies; 58+ messages in thread
From: Jakub Narebski @ 2010-07-28 18:32 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: skillzero, Marc Branchaud, Jens Lehmann,
	Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano, Linus Torvalds

On Tue, Jul 27, 2010, Avery Pennarun wrote:
> On Mon, Jul 26, 2010 at 10:56:58AM +0200, Jakub Narebski wrote:
> > On Sat, Jul 24, 2010, skillzero@gmail.com napisał:
> > >
> > > git-submodule might be technically possible in this situation, but
> > > having to commit and push each submodule and then commit and push the
> > > super module makes it slightly worse than just dealing with the
> > > space/download/performance issues of one huge repository.
> > 
> > But this is just a matter for improving UI for dealing with submodules,
> > isn't it.   For example having "git commit --recursive" would help
> > with 'having to commit each submodule', though how you would write commit
> > messages then: perhaps supermodule commit message could be by default
> > composed out of submodules commits (if any).  "git push --recursive"
> > (or some support for push in "git remote") would help with 'having to
> > push each submodule'.
> 
> For "recursive" commit, for my own workflow, I would rather have it work
> like this: from the toplevel, I can 'git commit' any set of files, as long
> as they all fall inside a particular submodule.  That is, if I do
> 
> 	git commit mod1/*.c mod2/*.c
> 	
> it should reject it (with a helpful message), because the commit would cross
> submodule boundaries.  But if I do
> 
> 	git commit mod1/*.c
> 	
> I think it should create a new commit in mod1, leave my superproject
> pointing at that new commit, and stop (ie. without the superproject having
> committed the new commit pointer).
> 
> Why?  Because my normal workflow is:
> 
>   - make a bunch of superproject/submodule changes until they work.
>   - commit the submodule changes with a submodule-relevant message
>   - commit the superproject change with a supermodule-relevant message
>   
> I wouldn't want to share commit messages between the two, so actually having
> a single commit process be "recursive" would not do me any good.

I think it is quite good idea, but it covers only one of the three most
common (I think) used versions of git-commit:
 * git commit <files>        # your proposal covers this
 * git commit -a             # but I think either this
 * git commit                # or this is actually more common

Also "git commit ." in a submodule cannot be done in this proposal,
because it is indistinguishable from "git commit <submodule>" committing
state of submodule in supermodule.

Perhaps it would be matter of porting "--relative=<path>" or adding
"--submodule=<name>" option to git-commit?

> However, pushing is a separate issue entirely.  Having push be recursive
> would be easy, but it doesn't solve the *real* problem with pushing: git
> doesn't know what branch to push to in the submodule, and the submodule most
> likely isn't pointing at a pushable repo at all, even if the supermodule is. 
> This is why I keep coming back to the idea that I really want to push all
> the submodule objects into the superproject's repo.

I think there should be two easy to obtain variants of recursive clone:

1. Current one, where each submodule gets its own repository in the place
   it is checked out in working area (in worktree) of supermodule.

2. New one, where submodule repositories are in .git/submodules/<name>
   in supermodule GIT_DIR, and submodules use gitfiles (probably with
   some notation that path is relative to supermodule, like e.g. //<path>
   or .../<path>).

I'm not sure though how it would translate into pushing...

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Avery Pennarun's git-subtree?
  2010-07-25  1:47                           ` Nguyen Thai Ngoc Duy
@ 2010-07-28 22:27                             ` Jakub Narebski
  0 siblings, 0 replies; 58+ messages in thread
From: Jakub Narebski @ 2010-07-28 22:27 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy
  Cc: skillzero, Avery Pennarun, Marc Branchaud, Jens Lehmann,
	Ævar Arnfjörð Bjarmason, Bryan Larsen, git,
	Junio C Hamano, Linus Torvalds

Dnia niedziela 25. lipca 2010 03:47, Nguyen Thai Ngoc Duy napisał:
> 
> By the way, how hard is it to use git-replace to implement narrow clone?

I don't think that git-replace should be used to implement narrow clone,
although it could probable be abused to do so.  The refs/replaces 
mechanism is about static replacements...

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2010-07-28 22:28 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-21 17:15 Avery Pennarun's git-subtree? Bryan Larsen
2010-07-21 19:43 ` Ævar Arnfjörð Bjarmason
2010-07-21 19:56   ` Avery Pennarun
2010-07-21 20:36     ` Ævar Arnfjörð Bjarmason
2010-07-21 21:09       ` Avery Pennarun
2010-07-21 21:20         ` Avery Pennarun
2010-07-21 22:46         ` Jens Lehmann
2010-07-22  1:09           ` Avery Pennarun
     [not found]             ` <m31vavn8la.fsf@localhost.localdomain>
2010-07-22 18:23               ` Bryan Larsen
2010-07-24 22:36                 ` Jakub Narebski
2010-07-22 19:41               ` Avery Pennarun
2010-07-22 19:56                 ` Jonathan Nieder
2010-07-22 20:06                   ` Avery Pennarun
2010-07-22 20:17                   ` Ævar Arnfjörð Bjarmason
2010-07-22 21:33                     ` Avery Pennarun
2010-07-23 15:10                       ` Jens Lehmann
2010-07-26 17:34                       ` Eugene Sajine
2010-07-22 20:43                   ` Elijah Newren
2010-07-22 21:32                     ` Avery Pennarun
2010-07-23  8:31                 ` Chris Webb
2010-07-23  8:40                   ` Avery Pennarun
2010-07-23 15:11                     ` Jens Lehmann
2010-07-23 22:33                       ` Avery Pennarun
2010-07-23 15:13                     ` Jens Lehmann
2010-07-23 15:10                 ` Jens Lehmann
2010-07-23 16:05                   ` Bryan Larsen
2010-07-23 17:11                     ` Jens Lehmann
2010-07-23 19:01                       ` Bryan Larsen
2010-07-23 22:32                   ` Avery Pennarun
2010-07-25 19:57                     ` Jens Lehmann
2010-07-27 18:40                       ` Avery Pennarun
2010-07-27 21:14                         ` Jens Lehmann
2010-07-23 15:19                 ` Marc Branchaud
2010-07-23 22:50                   ` Avery Pennarun
2010-07-24  0:58                     ` skillzero
2010-07-24  1:20                       ` Avery Pennarun
2010-07-24 19:40                         ` skillzero
2010-07-25  1:47                           ` Nguyen Thai Ngoc Duy
2010-07-28 22:27                             ` Jakub Narebski
2010-07-26 13:13                           ` Jakub Narebski
2010-07-26 16:37                         ` Marc Branchaud
2010-07-26 16:41                           ` Linus Torvalds
2010-07-26 17:36                             ` Bryan Larsen
2010-07-26 17:48                               ` Linus Torvalds
2010-07-27 18:28                             ` Avery Pennarun
2010-07-27 20:25                               ` Junio C Hamano
2010-07-27 20:57                                 ` Avery Pennarun
2010-07-27 21:14                                   ` Junio C Hamano
2010-07-27 21:32                                   ` Jens Lehmann
2010-07-26  8:56                       ` Jakub Narebski
2010-07-27 18:36                         ` Avery Pennarun
2010-07-28 13:36                           ` Marc Branchaud
2010-07-28 18:32                           ` Jakub Narebski
2010-07-24 20:07                     ` Sverre Rabbelier
2010-07-26  8:51                     ` Jakub Narebski
2010-07-27 19:15                       ` Avery Pennarun
2010-07-26 15:15                     ` Marc Branchaud
2010-07-21 23:46         ` Ævar Arnfjörð Bjarmason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.