All of lore.kernel.org
 help / color / mirror / Atom feed
* [script] find largest pack objects
@ 2009-07-10  1:16 Antony Stubbs
  2009-07-10  3:34 ` Nicolas Pitre
  2009-07-10 11:43 ` Björn Steinbrink
  0 siblings, 2 replies; 4+ messages in thread
From: Antony Stubbs @ 2009-07-10  1:16 UTC (permalink / raw)
  To: git

Blog post about git pruning history and finding large objects in your  
repo: http://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/

This is a script I put together after migrating the Spring Modules  
project from CVS, using git-cvsimport (which I also had to patch, to  
get to work on OS X / MacPorts). I wrote it because I wanted to get  
rid of all the large jar files, and documentation etc, that had been  
put into source control. However, if _large files_ are deleted in the  
latest revision, then they can be hard to track down.

#!/bin/bash
#set -x

# Shows you the largest objects in your repo's pack file.
# Written for osx.
#
# @see http://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
# @author Antony Stubbs

# set the internal field spereator to line break, so that we can  
iterate easily over the verify-pack output
IFS=$'\n';

# list all objects including their size, sort by size, take top 10
objects=`git verify-pack -v .git/objects/pack/pack-*.idx | grep -v  
chain | sort -k3nr | head`

echo "All sizes are in kB's. The pack column is the size of the  
object, compressed, inside the pack file."

output="size,pack,SHA,location"
for y in $objects
do
	# extract the size in bytes
	size=$((`echo $y | cut -f 5 -d ' '`/1024))
	# extract the compressed size in bytes
	compressedSize=$((`echo $y | cut -f 6 -d ' '`/1024))
	# extract the SHA
	sha=`echo $y | cut -f 1 -d ' '`
	# find the objects location in the repository tree
	other=`git rev-list --all --objects | grep $sha`
	#lineBreak=`echo -e "\n"`
	output="${output}\n${size},${compressedSize},${other}"
done

echo -e $output | column -t -s ', '

More info on the blog post: http://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/

Regards,
Antony Stubbs

Talk to me about Wicket, Spring, Maven consulting, small scale  
outsourcing to Australasia and India and Open Source development!

Check out the Spring Modules fork at http://wiki.github.com/astubbs/spring-modules 
  ! We've just done the first release of the project in over a year!

Website: http://sharca.com
Blog: http://stubbisms.wordpress.com
Linked In: http://www.linkedin.com/in/antonystubbs
Podcast: http://www.illegalargument.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [script] find largest pack objects
  2009-07-10  1:16 [script] find largest pack objects Antony Stubbs
@ 2009-07-10  3:34 ` Nicolas Pitre
  2009-08-31 13:25   ` Antony Stubbs
  2009-07-10 11:43 ` Björn Steinbrink
  1 sibling, 1 reply; 4+ messages in thread
From: Nicolas Pitre @ 2009-07-10  3:34 UTC (permalink / raw)
  To: Antony Stubbs; +Cc: git

On Fri, 10 Jul 2009, Antony Stubbs wrote:

> Blog post about git pruning history and finding large objects in your repo:
> http://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
> 
> This is a script I put together after migrating the Spring Modules project
> from CVS, using git-cvsimport (which I also had to patch, to get to work on OS
> X / MacPorts). I wrote it because I wanted to get rid of all the large jar
> files, and documentation etc, that had been put into source control. However,
> if _large files_ are deleted in the latest revision, then they can be hard to
> track down.
> 
> #!/bin/bash
> #set -x
> 
> # Shows you the largest objects in your repo's pack file.
> # Written for osx.
> #
> # @see
> http://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
> # @author Antony Stubbs
> 
> # set the internal field spereator to line break, so that we can iterate
> easily over the verify-pack output
> IFS=$'\n';
> 
> # list all objects including their size, sort by size, take top 10
> objects=`git verify-pack -v .git/objects/pack/pack-*.idx | grep -v chain |
> sort -k3nr | head`
> 
> echo "All sizes are in kB's. The pack column is the size of the object,
> compressed, inside the pack file."
> 
> output="size,pack,SHA,location"
> for y in $objects
> do
> 	# extract the size in bytes
> 	size=$((`echo $y | cut -f 5 -d ' '`/1024))
> 	# extract the compressed size in bytes
> 	compressedSize=$((`echo $y | cut -f 6 -d ' '`/1024))
> 	# extract the SHA
> 	sha=`echo $y | cut -f 1 -d ' '`
> 	# find the objects location in the repository tree
> 	other=`git rev-list --all --objects | grep $sha`
> 	#lineBreak=`echo -e "\n"`
> 	output="${output}\n${size},${compressedSize},${other}"
> done
> 
> echo -e $output | column -t -s ', '

This is certainly useful.  Mind submitting a patch adding this script to 
contrib/stats/ ?


Nicolas

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [script] find largest pack objects
  2009-07-10  1:16 [script] find largest pack objects Antony Stubbs
  2009-07-10  3:34 ` Nicolas Pitre
@ 2009-07-10 11:43 ` Björn Steinbrink
  1 sibling, 0 replies; 4+ messages in thread
From: Björn Steinbrink @ 2009-07-10 11:43 UTC (permalink / raw)
  To: Antony Stubbs; +Cc: git

On 2009.07.10 13:16:50 +1200, Antony Stubbs wrote:
> Blog post about git pruning history and finding large objects in
> your repo: http://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
> 
> This is a script I put together after migrating the Spring Modules
> project from CVS, using git-cvsimport (which I also had to patch, to
> get to work on OS X / MacPorts). I wrote it because I wanted to get
> rid of all the large jar files, and documentation etc, that had been
> put into source control. However, if _large files_ are deleted in
> the latest revision, then they can be hard to track down.

Here's my script, basically for the same purpose, but instead of looking
at the packfiles, it looks at the rev-list output to find those objects
that aren't prunable (ignoring the reflog). I'm also using some kind of
ugly sed invocation to run rev-list only twice, regardless of the number
of objects to be shown, which greatly reduces the time required to run
the script.

#!/bin/sh
git rev-list --all --objects |
	sed -n $(git rev-list --objects --all |
		cut -f1 -d' ' | git cat-file --batch-check | grep blob |
		sort -n -k3 | tail -n$1 | while read hash type size;
		do
			echo -n "-e s/$hash/$size/p ";
		done) |
	sort -n -k1

It takes the number of objects to be shown as an argument, so for the
top ten run as "git find-large 10" (assuming that the script is in $PATH
and called git-find-large).

It doesn't list as much information as yours does, e.g. the compressed
size is missing, but it's good enough for me, and speed was far more
important for me, especially since the "rev-list --all --objects" trick
gets you only a single filename for the blob, so if there were renames,
you may need to run it again after having deleted one version via
filter-branch.

Something similar applies to deltified stuff. As verify-pack shows the
size of the delta, your script might miss some file B if that is a
currently stored as a delta against some other large file A. Only after
the blob for A got deleted, B will be shown (as it is no longer
deltified).

OTOH, this means that the output of my script is likely to have the same
filename over and over again. If that gets out of hand, I usually do
something like:
git find-large 100 | cut -d' ' -f2 | sort -u

So I get just the filenames, hoping that the top 100 include all
interesting things ;-)

Maybe this helps someone to come up with a smart combination of our
scripts.

Björn

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [script] find largest pack objects
  2009-07-10  3:34 ` Nicolas Pitre
@ 2009-08-31 13:25   ` Antony Stubbs
  0 siblings, 0 replies; 4+ messages in thread
From: Antony Stubbs @ 2009-08-31 13:25 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git

Sorry Nicolas,
Completely missed your message amongst the torrent of the git mailing  
list - which I'm now unsubscribed from.. But sure, I'll ad this to my  
todo list :)

Cheers,
Antony Stubbs,

sharca.com

On 10/07/2009, at 5:34 AM, Nicolas Pitre wrote:

> On Fri, 10 Jul 2009, Antony Stubbs wrote:
>
>> Blog post about git pruning history and finding large objects in  
>> your repo:
>> http://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
>>
>> This is a script I put together after migrating the Spring Modules  
>> project
>> from CVS, using git-cvsimport (which I also had to patch, to get to  
>> work on OS
>> X / MacPorts). I wrote it because I wanted to get rid of all the  
>> large jar
>> files, and documentation etc, that had been put into source  
>> control. However,
>> if _large files_ are deleted in the latest revision, then they can  
>> be hard to
>> track down.
>>
>> #!/bin/bash
>> #set -x
>>
>> # Shows you the largest objects in your repo's pack file.
>> # Written for osx.
>> #
>> # @see
>> http://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
>> # @author Antony Stubbs
>>
>> # set the internal field spereator to line break, so that we can  
>> iterate
>> easily over the verify-pack output
>> IFS=$'\n';
>>
>> # list all objects including their size, sort by size, take top 10
>> objects=`git verify-pack -v .git/objects/pack/pack-*.idx | grep -v  
>> chain |
>> sort -k3nr | head`
>>
>> echo "All sizes are in kB's. The pack column is the size of the  
>> object,
>> compressed, inside the pack file."
>>
>> output="size,pack,SHA,location"
>> for y in $objects
>> do
>> 	# extract the size in bytes
>> 	size=$((`echo $y | cut -f 5 -d ' '`/1024))
>> 	# extract the compressed size in bytes
>> 	compressedSize=$((`echo $y | cut -f 6 -d ' '`/1024))
>> 	# extract the SHA
>> 	sha=`echo $y | cut -f 1 -d ' '`
>> 	# find the objects location in the repository tree
>> 	other=`git rev-list --all --objects | grep $sha`
>> 	#lineBreak=`echo -e "\n"`
>> 	output="${output}\n${size},${compressedSize},${other}"
>> done
>>
>> echo -e $output | column -t -s ', '
>
> This is certainly useful.  Mind submitting a patch adding this  
> script to
> contrib/stats/ ?
>
>
> Nicolas

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-08-31 13:26 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-07-10  1:16 [script] find largest pack objects Antony Stubbs
2009-07-10  3:34 ` Nicolas Pitre
2009-08-31 13:25   ` Antony Stubbs
2009-07-10 11:43 ` Björn Steinbrink

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.