git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Configuring git to for forget removed files
@ 2010-02-20 10:37 Andrew Benton
  2010-02-20 15:41 ` Tim Visher
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Andrew Benton @ 2010-02-20 10:37 UTC (permalink / raw)
  To: git

Hello world
I have a project that I store in a git repository. It's a bunch of source tarballs and
some bash scripts to compile it all. Git makes it easy to distribute any changes I make
across the computers I run. The problem I have is that over time the repository gets ever
larger. When I update to a newer version of something I git rm the old tarball but git
still keeps a copy and the folder grows ever larger. At the moment the only solution I
have is to periodically rm -rf .git and start again. This works but is less than ideal
because I lose all the history for my build scripts.
What I would like is to be able to tell git to not keep a copy of anything that has been
git rm. The build scripts never get removed, only altered so their history would be
preserved. Is it possible to make git delete its backup copies of removed files?

Andy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Configuring git to for forget removed files
  2010-02-20 10:37 Configuring git to for forget removed files Andrew Benton
@ 2010-02-20 15:41 ` Tim Visher
  2010-02-20 18:50 ` Avery Pennarun
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Tim Visher @ 2010-02-20 15:41 UTC (permalink / raw)
  To: Andrew Benton; +Cc: git

Hi Andy,

On Sat, Feb 20, 2010 at 5:37 AM, Andrew Benton <b3nton@gmail.com> wrote:
> I have a project that I store in a git repository. It's a bunch of source
> tarballs and some bash scripts to compile it all. Git makes it easy to
> distribute any changes I make across the computers I run. The problem I have
> is that over time the repository gets ever larger. When I update to a newer
> version of something I git rm the old tarball but git still keeps a copy and
> the folder grows ever larger. At the moment the only solution I have is to
> periodically rm -rf .git and start again. This works but is less than ideal
> because I lose all the history for my build scripts.
>
> What I would like is to be able to tell git to not keep a copy of anything
> that has been git rm. The build scripts never get removed, only altered so
> their history would be preserved. Is it possible to make git delete its backup
> copies of removed files?

I don't know if I can really speak to your hoped for conclusion
although `git filter-branch` is where you want to look for rewriting
history.  However, that's also an entirely impractical solution if
your repo is at all public because it would completely break sharing.

That being said, have you thought of changing your repo strategy?
IMHO, storing binary blobs that change at all regularly in _any_ SCMS
is a problem waiting to happen.  It's different if you have assets
that are fairly stable like images for a system's UI or dependencies
that have been stabilized, but that doesn't sound like your situation.

As a thought, why not try to do something along the lines of
maintaining a symlink to whatever tarballs your project currently
depends on as a 'foolib-latest' and then having a separate directory
that has a file that you can change.  You could maintain backups of
that using a tool like rsync (since you obviously aren't concerned
with maintaining history there) rather than git.  Then you could
decide arbitrarily how many backups you want to make and try to
maintain what version of the file went with which commit in your repo.
 The main problem I see with that is that you loose a lot of the
advantages of having a SCMS because you can't reliably checkout a
previous commit and build it; at least not without some very serious
effort.

Another possible solution if you maintain the sources that are
generating the tarballs is to treat the tarballs as artifacts of the
build rather than as assets that should be managed by the SCMS.  In
that way, you might spend more time during each build but your repo
would be much cleaner and would have the added advantage of being able
to completely build itself at every commit point.

Anyway, just food for thought.


-- 

In Christ,

Timmy V.

http://burningones.com/
http://five.sentenc.es/ - Spend less time on e-mail

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Configuring git to for forget removed files
  2010-02-20 10:37 Configuring git to for forget removed files Andrew Benton
  2010-02-20 15:41 ` Tim Visher
@ 2010-02-20 18:50 ` Avery Pennarun
  2010-02-20 19:16 ` Junio C Hamano
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Avery Pennarun @ 2010-02-20 18:50 UTC (permalink / raw)
  To: Andrew Benton; +Cc: git

On Sat, Feb 20, 2010 at 5:37 AM, Andrew Benton <b3nton@gmail.com> wrote:
> I have a project that I store in a git repository. It's a bunch of source
> tarballs and
> some bash scripts to compile it all. Git makes it easy to distribute any
> changes I make
> across the computers I run. The problem I have is that over time the
> repository gets ever
> larger. When I update to a newer version of something I git rm the old
> tarball but git
> still keeps a copy and the folder grows ever larger.

You can use 'git filter-branch', as Tim already mentioned, or use a
git 'shallow clone' to only get the most recent versions of things.

Alternatively, have you thought about storing *uncompressed* tarballs
in git instead of compressed ones?  Then when you update to a newer
version, git can compute an xdelta from one to the other and store
only the changes.  That means you can have full history *and* not
waste too much disk space.  Git compresses the objects anyway when it
stores them in the repository.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Configuring git to for forget removed files
  2010-02-20 10:37 Configuring git to for forget removed files Andrew Benton
  2010-02-20 15:41 ` Tim Visher
  2010-02-20 18:50 ` Avery Pennarun
@ 2010-02-20 19:16 ` Junio C Hamano
  2010-02-21  2:47 ` Jonathan Nieder
  2010-02-21 20:32 ` Larry D'Anna
  4 siblings, 0 replies; 8+ messages in thread
From: Junio C Hamano @ 2010-02-20 19:16 UTC (permalink / raw)
  To: Andrew Benton; +Cc: git

Andrew Benton <b3nton@gmail.com> writes:

> I have a project that I store in a git repository. It's a bunch of source tarballs and
> some bash scripts to compile it all. Git makes it easy to distribute any changes I make
> across the computers I run. The problem I have is that over time the repository gets ever
> larger. When I update to a newer version of something I git rm the old tarball but git
> still keeps a copy and the folder grows ever larger. At the moment the only solution I
> have is to periodically rm -rf .git and start again. This works but is less than ideal
> because I lose all the history for my build scripts.
> What I would like is to be able to tell git to not keep a copy of anything that has been
> git rm. The build scripts never get removed, only altered so their history would be
> preserved. Is it possible to make git delete its backup copies of removed files?

You are either being unreasonable, or haven't thought things through.

Let's say you have your build script with a tarball of frotz-1.42.tar.gz
in the initial revision.  The script extracts from tarball and builds.

Now you update your build script once, and make a commit.

Then you add frotz-1.43.tar.gz and remove frotz-1.42.tar.gz.  You may
adjust the build script to extract frotz-1.43 instead of frotz-1.42 in the
same commit, or your script may be written loosely and extract any tarball
that matches frotz-*.tar.gz wildcard in which case the build script may
not change.

You now have three commits:

 - initial one: ships frotz-1.42 and builds it;
 - second one: ships frotz-1.42 and builds it better;
 - third one: ships frotz-1.43 and builds it in some way.

You clone it to some other machine and build the tip; everything goes well
and you are happy.  What should happen if you do:

 $ git checkout HEAD^
 $ make

Should it build frotz-1.43, or should it fail?

If you somehow obliterate frotz-1.42.tar.gz out of the history with some
magic you described, there should not be any frotz-1.42.tar.gz in the
history, so there is no way you can build frotz-1.42 out of this checkout.
Your "second" tree can only have one or two shapes:

 - It can record only build script and nothing else, in which case the
   above "make" will have to fail.

 - With some magic you described, it records your build script and
   frotz-1.43.tar.gz, and frotz-1.43 is built.

You need to realize that the magic have to adjust your build script so
that it does not require the exact version of frotz-1.42.  Namely, the
build script you wrote not only knew that the next version of tarball will
match frotz-*.tar.gz (and that is why you can extract the contents from
it), but also somehow anticipated the build infrastructure change the
upstream will make when they update from 1.42 to 1.43 and was magically
capable of building either versions.  And you did that back when you
didn't have the source to frotz-1.43 and how it would look like.

You also need to realize that nowhere in your set-up up to the point you
made three commits, you never told anybody that frotz-1.43.tar.gz replaces
frotz-1.42.tar.gz. The only thing you said was to remove frotz-1.42.tar.gz.

If you make the checkout of second one to fail to build because your
"obliterate" is not to include any tarball in the second version, then you
are being unreasonable.

If you are asking for the magic to include frotz-1.43 instead of
frotz-1.42, and further adjust your old build script to anticipate
the change between 1.42 and 1.43, you haven't described how that magic
should happen, so you haven't thought things through.

One way out would be to do it like this instead:

 - initial one: your build script, and frotz-1.42 extracted in frotz/
   directory already.  Do not ship a tarball.

 - second one: your improved build script, and the same frotz/ directory
   without any change.

 - third one: your build script, either improved or the same from the
   previous one, and frotz-1.43 extracted in frotz/ directory.

This way, the checkout from the second one will build frotz-1.42.  Also
you could see if your build scripts from the second version would build
frotz-1.43 as well by doing something like:

    $ git checkout HEAD^
    $ git checkout master -- frotz/
    $ make

You will ship both versions of frotz, but between 1.42 and 1.43 there will
be a lot of similarities, so packed result will be far smaller than
storing two compressed tarballs.  In fact, I wouldn't be surprised if it
were smaller than storing one compressed tarball.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Configuring git to for forget removed files
  2010-02-20 10:37 Configuring git to for forget removed files Andrew Benton
                   ` (2 preceding siblings ...)
  2010-02-20 19:16 ` Junio C Hamano
@ 2010-02-21  2:47 ` Jonathan Nieder
  2010-02-21 13:32   ` Andrew Benton
  2010-02-21 20:32 ` Larry D'Anna
  4 siblings, 1 reply; 8+ messages in thread
From: Jonathan Nieder @ 2010-02-21  2:47 UTC (permalink / raw)
  To: Andrew Benton; +Cc: git

Hi Andy,

Andrew Benton wrote:

> I have a project that I store in a git repository. It's a bunch of
> source tarballs and some bash scripts to compile it all. Git makes
> it easy to distribute any changes I make across the computers I run.

This is not really what git is intended to do.

 - git generally works better with files that are easy to diff; see
   the “filter” attribute in gitattributes(5) for one way this is
   sometimes achieved with meaningful binary files (e.g., the
   compressed files OpenOffice produces)

 - Though git can cope with large projects, it generally works best
   when track the smallest meaningful unit that can be tested alone.
   Submodules can be used to stitch them together.

Thus if it is important to you to track the history of this project, I
would suggest giving each source tree its own repository and stitching
them together with a “supermodule” that tracks your scripts and
includes references to the appropriate versions of each source
package.

See http://who-t.blogspot.com/2009/04/big-fat-xorg-supermodule.html
for an example of this kind of thing.

On the other hand, I don’t get the impression it is so important here
to track the history from the beginning, so:

> The problem I have is that over time the repository gets ever
> larger. When I update to a newer version of something I git rm the
> old tarball but git still keeps a copy and the folder grows ever
> larger. At the moment the only solution I have is to periodically rm
> -rf .git and start again. This works but is less than ideal because
> I lose all the history for my build scripts.

Maybe you could keep the build scripts in a git repository and
synchronizing the tarballs out of line with some other tool, such as
rsync or unison.

> Is it possible to make git delete its
> backup copies of removed files?

git is not intended to be a backup tool; as you’ve noticed, the older
versions gradually accumulate and it becomes apparent over time that
it would be really nice for older commits to expire.

I am guessing, but it sounds to me like what you are looking for is
something that is distributed like git but is a backup system.  Or in
other words, a way to record a few snapshots like LVM or btrfs, but
such that new snapshots can be easily transfered to another computer.
At least I would be glad to learn of such a tool. ;-)

Hope that helps,
Jonathan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Configuring git to for forget removed files
  2010-02-21  2:47 ` Jonathan Nieder
@ 2010-02-21 13:32   ` Andrew Benton
  0 siblings, 0 replies; 8+ messages in thread
From: Andrew Benton @ 2010-02-21 13:32 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git

On 21/02/10 02:47, Jonathan Nieder wrote:
> Maybe you could keep the build scripts in a git repository and
> synchronizing the tarballs out of line with some other tool, such as
> rsync or unison.

Thanks, that seems to do what I want. I'll use git to keep track of the bash scripts and
rsync -r --delete --ignore-existing --progress --exclude '.git'
to synchronise the source tarballs

Andy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Configuring git to for forget removed files
  2010-02-20 10:37 Configuring git to for forget removed files Andrew Benton
                   ` (3 preceding siblings ...)
  2010-02-21  2:47 ` Jonathan Nieder
@ 2010-02-21 20:32 ` Larry D'Anna
  2010-02-21 21:14   ` Jacob Helwig
  4 siblings, 1 reply; 8+ messages in thread
From: Larry D'Anna @ 2010-02-21 20:32 UTC (permalink / raw)
  To: Andrew Benton; +Cc: git

* Andrew Benton (b3nton@gmail.com) [100220 05:37]:
> Hello world
> I have a project that I store in a git repository. It's a bunch of source tarballs and
> some bash scripts to compile it all. Git makes it easy to distribute any changes I make
> across the computers I run. The problem I have is that over time the repository gets ever
> larger. When I update to a newer version of something I git rm the old tarball but git
> still keeps a copy and the folder grows ever larger. At the moment the only solution I
> have is to periodically rm -rf .git and start again. This works but is less than ideal
> because I lose all the history for my build scripts.
> What I would like is to be able to tell git to not keep a copy of anything that has been
> git rm. The build scripts never get removed, only altered so their history would be
> preserved. Is it possible to make git delete its backup copies of removed files?

This reminds me of a scenario I wish git had some way of supporting: I have a
large collection of mp3s that I have duplicated across several computers.  I
would love to be able to use git to sync changes between the copies, but there
are several problems: 

1) git is really slow when dealing with thousands of multi-megabyte blobs.

2) commiting it to git is going to double the size of the directory, and I don't
really have space for that on one of the computers that the directory lives on.

3) there's no way to discard old history without breaking push and pull.

I'm not sure exactly what it would take to address 1, but 2 could be addressed
pretty easily using btrfs file clones (once btrfs is stable), and 3 could be
dealt with by improving support for shallow clones.

     --larry


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Configuring git to for forget removed files
  2010-02-21 20:32 ` Larry D'Anna
@ 2010-02-21 21:14   ` Jacob Helwig
  0 siblings, 0 replies; 8+ messages in thread
From: Jacob Helwig @ 2010-02-21 21:14 UTC (permalink / raw)
  To: Larry D'Anna; +Cc: Andrew Benton, git

On 15:32 Sun 21 Feb     , Larry D'Anna wrote:
> * Andrew Benton (b3nton@gmail.com) [100220 05:37]:
> > Hello world
> > I have a project that I store in a git repository. It's a bunch of source tarballs and
> > some bash scripts to compile it all. Git makes it easy to distribute any changes I make
> > across the computers I run. The problem I have is that over time the repository gets ever
> > larger. When I update to a newer version of something I git rm the old tarball but git
> > still keeps a copy and the folder grows ever larger. At the moment the only solution I
> > have is to periodically rm -rf .git and start again. This works but is less than ideal
> > because I lose all the history for my build scripts.
> > What I would like is to be able to tell git to not keep a copy of anything that has been
> > git rm. The build scripts never get removed, only altered so their history would be
> > preserved. Is it possible to make git delete its backup copies of removed files?
> 
> This reminds me of a scenario I wish git had some way of supporting: I have a
> large collection of mp3s that I have duplicated across several computers.  I
> would love to be able to use git to sync changes between the copies, but there
> are several problems: 
> 
> 1) git is really slow when dealing with thousands of multi-megabyte blobs.
> 
> 2) commiting it to git is going to double the size of the directory, and I don't
> really have space for that on one of the computers that the directory lives on.
> 
> 3) there's no way to discard old history without breaking push and pull.
> 
> I'm not sure exactly what it would take to address 1, but 2 could be addressed
> pretty easily using btrfs file clones (once btrfs is stable), and 3 could be
> dealt with by improving support for shallow clones.
> 
>      --larry

In all seriousness: Why not use a tool that was actually designed for
what you're trying to do? (Sync a music collection across computers.)
Something like syrep[0]?

[0] http://0pointer.de/lennart/projects/syrep

-- 
Jacob Helwig

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-02-21 21:14 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-20 10:37 Configuring git to for forget removed files Andrew Benton
2010-02-20 15:41 ` Tim Visher
2010-02-20 18:50 ` Avery Pennarun
2010-02-20 19:16 ` Junio C Hamano
2010-02-21  2:47 ` Jonathan Nieder
2010-02-21 13:32   ` Andrew Benton
2010-02-21 20:32 ` Larry D'Anna
2010-02-21 21:14   ` Jacob Helwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).