git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* fix git-pack-redundant crashing sometimes
@ 2005-11-15 15:49     ` Alex Riesen
  2005-11-15 16:08       ` Timo Hirvonen
                         ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Alex Riesen @ 2005-11-15 15:49 UTC (permalink / raw)
  To: git; +Cc: Lukas Sandström, junkio

[-- Attachment #1: Type: text/plain, Size: 278 bytes --]

llist_sorted_difference_inplace didn't handle the match in the first
sha1 correctly and the lists went wild everywhere.

---

I noticed it on a very big repository (more than 100k files), trying
to prune it.
The code could profit from double-linked lists greatly, IMHO.

[-- Attachment #2: fix-pack-redundant.patch --]
[-- Type: application/xxxxx, Size: 421 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: fix git-pack-redundant crashing sometimes
  2005-11-15 15:49     ` fix git-pack-redundant crashing sometimes Alex Riesen
@ 2005-11-15 16:08       ` Timo Hirvonen
  2005-11-15 16:11         ` Alex Riesen
  2005-11-15 21:38         ` Lukas Sandström
  2005-11-15 21:24       ` [PATCH] Fix llist_sorted_difference_inplace in git-pack-redundant Lukas Sandström
  2005-11-15 21:34       ` fix git-pack-redundant crashing sometimes Alex Riesen
  2 siblings, 2 replies; 25+ messages in thread
From: Timo Hirvonen @ 2005-11-15 16:08 UTC (permalink / raw)
  To: Alex Riesen; +Cc: git, lukass, junkio

On Tue, 15 Nov 2005 16:49:30 +0100
Alex Riesen <raa.lkml@gmail.com> wrote:

> llist_sorted_difference_inplace didn't handle the match in the first
> sha1 correctly and the lists went wild everywhere.
> 
> ---
> 
> I noticed it on a very big repository (more than 100k files), trying
> to prune it.
> The code could profit from double-linked lists greatly, IMHO.

I use list.h from Linux when I need double-linked lists.  It is very
easy to use, efficient and fast.

-- 
http://onion.dynserv.net/~timo/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: fix git-pack-redundant crashing sometimes
  2005-11-15 16:08       ` Timo Hirvonen
@ 2005-11-15 16:11         ` Alex Riesen
  2005-11-15 17:28           ` Linus Torvalds
  2005-11-15 21:38         ` Lukas Sandström
  1 sibling, 1 reply; 25+ messages in thread
From: Alex Riesen @ 2005-11-15 16:11 UTC (permalink / raw)
  To: Timo Hirvonen; +Cc: git, lukass, junkio

On 11/15/05, Timo Hirvonen <tihirvon@gmail.com> wrote:
> > The code could profit from double-linked lists greatly, IMHO.
>
> I use list.h from Linux when I need double-linked lists.  It is very
> easy to use, efficient and fast.
>

Yes, I had it in mind, too.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: fix git-pack-redundant crashing sometimes
  2005-11-15 16:11         ` Alex Riesen
@ 2005-11-15 17:28           ` Linus Torvalds
  0 siblings, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2005-11-15 17:28 UTC (permalink / raw)
  To: Alex Riesen; +Cc: Timo Hirvonen, git, lukass, junkio



On Tue, 15 Nov 2005, Alex Riesen wrote:

> On 11/15/05, Timo Hirvonen <tihirvon@gmail.com> wrote:
> > > The code could profit from double-linked lists greatly, IMHO.
> >
> > I use list.h from Linux when I need double-linked lists.  It is very
> > easy to use, efficient and fast.
> 
> Yes, I had it in mind, too.

There's a "ptrlist" implementation in sparse, which does a very 
space-efficient implementation of arbitrary lists of pointers. It's also 
very CPU-efficient (in fact, thanks to being very cache-friendly, more so 
than the kernel list.h implementation) for most well-behaved operations 
(ie it's not hugely friendly to removal, but it's great for "add to end" 
and "traverse the list").

The ptrlist implementation is optimized for cases where the lists are 
usually shortish, but not trivial. So they are worthwhile if you expect 
the list length to be on the order of 5-10 entries rather than just 1-2 
(if you have just a couple of entries, the kernel lists are better). But 
perhaps more importantly, they are very space-efficient when the lists 
grow bigger (ie they average to just about 5% over the memory cost of a 
single pointer per list element for long lists).

Sparse in general is under the OSL license, but I'll happily dual-license 
the ptrlist thing, it's trivial and all my code (and I already checked 
with the person who did the list sorting code that we can dual-license 
that too).

			Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] Fix llist_sorted_difference_inplace in git-pack-redundant
  2005-11-15 15:49     ` fix git-pack-redundant crashing sometimes Alex Riesen
  2005-11-15 16:08       ` Timo Hirvonen
@ 2005-11-15 21:24       ` Lukas Sandström
  2005-11-15 21:34       ` fix git-pack-redundant crashing sometimes Alex Riesen
  2 siblings, 0 replies; 25+ messages in thread
From: Lukas Sandström @ 2005-11-15 21:24 UTC (permalink / raw)
  To: junkio; +Cc: Alex Riesen, git, Lukas Sandström

Simplify and actually make llist_sorted_difference_inplace work
by using llist_sorted_remove instead of duplicating parts of the
code.

Signed-off-by: Lukas Sandström <lukass@etek.chalmers.se>

---
Looking at the code; I must have been tired when I wrote llist_sorted_difference_inplace.
This should hopefully fix things up nicely. Note that I think the old version got the
case "remove last element from list" wrong too.

I think the Signed-off line is in utf8 this time. Tell me if it works better, of if I 
just should change my name ;)

 pack-redundant.c |   47 +++++++++++++++--------------------------------
 1 files changed, 15 insertions(+), 32 deletions(-)

applies-to: 4b6dbe856a3e63699b299c76f4f1fc5cb34cbe26
560aa97889ef028fed0678389d4bb05ebf3ead62
diff --git a/pack-redundant.c b/pack-redundant.c
index 28b82ee..fcb36ff 100644
--- a/pack-redundant.c
+++ b/pack-redundant.c
@@ -127,38 +127,6 @@ inline struct llist_item * llist_insert_
 	return llist_insert_back(list, sha1);
 }
 
-/* computes A\B */
-void llist_sorted_difference_inplace(struct llist *A,
-				     struct llist *B)
-{
-	struct llist_item *prev, *a, *b, *x;
-
-	prev = a = A->front;
-	b = B->front;
-
-	while (a != NULL && b != NULL) {
-		int cmp = memcmp(a->sha1, b->sha1, 20);
-		if (!cmp) {
-			x = a;
-			if (a == A->front)
-				A->front = a->next;
-			a = prev->next = a->next;
-
-			if (a == NULL) /* end of list */
-				A->back = prev;
-			A->size--;
-			free(x);
-			b = b->next;
-		} else
-			if (cmp > 0)
-				b = b->next;
-			else {
-				prev = a;
-				a = a->next;
-			}
-	}
-}
-
 /* returns a pointer to an item in front of sha1 */
 inline struct llist_item * llist_sorted_remove(struct llist *list, char *sha1,
 					       struct llist_item *hint)
@@ -194,6 +162,21 @@ redo_from_start:
 	return prev;
 }
 
+/* computes A\B */
+void llist_sorted_difference_inplace(struct llist *A,
+				     struct llist *B)
+{
+	struct llist_item *hint, *b;
+
+	hint = NULL;
+	b = B->front;
+
+	while (b) {
+		hint = llist_sorted_remove(A, b->sha1, hint);
+		b = b->next;
+	}
+}
+
 inline struct pack_list * pack_list_insert(struct pack_list **pl,
 					   struct pack_list *entry)
 {
---
0.99.9.GIT

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: fix git-pack-redundant crashing sometimes
  2005-11-15 15:49     ` fix git-pack-redundant crashing sometimes Alex Riesen
  2005-11-15 16:08       ` Timo Hirvonen
  2005-11-15 21:24       ` [PATCH] Fix llist_sorted_difference_inplace in git-pack-redundant Lukas Sandström
@ 2005-11-15 21:34       ` Alex Riesen
  2005-11-15 21:41         ` Lukas Sandström
  2 siblings, 1 reply; 25+ messages in thread
From: Alex Riesen @ 2005-11-15 21:34 UTC (permalink / raw)
  To: git; +Cc: Lukas Sandström, junkio

Alex Riesen, Tue, Nov 15, 2005 16:49:30 +0100:
> llist_sorted_difference_inplace didn't handle the match in the first
> sha1 correctly and the lists went wild everywhere.

This wasn't enough. It never (>5 min on 2.4GHz PIV) finishes on
my local mirror of git, which has 22 packs by now.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: fix git-pack-redundant crashing sometimes
  2005-11-15 16:08       ` Timo Hirvonen
  2005-11-15 16:11         ` Alex Riesen
@ 2005-11-15 21:38         ` Lukas Sandström
  1 sibling, 0 replies; 25+ messages in thread
From: Lukas Sandström @ 2005-11-15 21:38 UTC (permalink / raw)
  To: git; +Cc: Timo Hirvonen, junkio

Timo Hirvonen wrote:
> On Tue, 15 Nov 2005 16:49:30 +0100
> Alex Riesen <raa.lkml@gmail.com> wrote:
> 
> 
>>llist_sorted_difference_inplace didn't handle the match in the first
>>sha1 correctly and the lists went wild everywhere.
>>
>>---
>>
>>I noticed it on a very big repository (more than 100k files), trying
>>to prune it.
>>The code could profit from double-linked lists greatly, IMHO.
> 
> 
> I use list.h from Linux when I need double-linked lists.  It is very
> easy to use, efficient and fast.
> 
Well, well. Live and learn I suppose. I have never heard of it before.

How portable is it to #include <linux/list.h>? (non-rethorical)

Would there be a performance benefit from doubly-linked list or would
it just simplify parts of the code? My first implementation used d-lists,
but I dropped the ->prev when I realised it was never used.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: fix git-pack-redundant crashing sometimes
  2005-11-15 21:34       ` fix git-pack-redundant crashing sometimes Alex Riesen
@ 2005-11-15 21:41         ` Lukas Sandström
  2005-11-15 22:33           ` Alex Riesen
  0 siblings, 1 reply; 25+ messages in thread
From: Lukas Sandström @ 2005-11-15 21:41 UTC (permalink / raw)
  To: Alex Riesen, git; +Cc: junkio

Alex Riesen wrote:
> Alex Riesen, Tue, Nov 15, 2005 16:49:30 +0100:
> 
>>llist_sorted_difference_inplace didn't handle the match in the first
>>sha1 correctly and the lists went wild everywhere.
> 
> 
> This wasn't enough. It never (>5 min on 2.4GHz PIV) finishes on
> my local mirror of git, which has 22 packs by now.
> 

If the patch I just sent out doesn't fix the problem, and you don't
want to debug it yourself, could you please publish the .idx files
so I could have a look at them?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: fix git-pack-redundant crashing sometimes
  2005-11-15 21:41         ` Lukas Sandström
@ 2005-11-15 22:33           ` Alex Riesen
  2005-11-15 23:13             ` Lukas Sandström
                               ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Alex Riesen @ 2005-11-15 22:33 UTC (permalink / raw)
  To: Lukas Sandström; +Cc: git, junkio

Lukas Sandström, Tue, Nov 15, 2005 22:41:34 +0100:
> >>llist_sorted_difference_inplace didn't handle the match in the first
> >>sha1 correctly and the lists went wild everywhere.
> > 
> > This wasn't enough. It never (>5 min on 2.4GHz PIV) finishes on
> > my local mirror of git, which has 22 packs by now.
> 
> If the patch I just sent out doesn't fix the problem, and you don't

Sorry, it doesn't. The code loops here:

401             /* find the permutations which contain all missing objects */
402             perm_all = perm = get_all_permutations(non_unique);
403             while (perm) {
404                     if (is_superset(perm->pl, missing)) {
405                             new_perm = xmalloc(sizeof(struct pll));
406                             new_perm->pl = perm->pl;
407                             new_perm->next = perm_ok;
408                             perm_ok = new_perm;
(gdb) 
409                     }
410                     perm = perm->next;
411             }
412
413             if (perm_ok == NULL)


> want to debug it yourself, could you please publish the .idx files
> so I could have a look at them?

Of course: http://home.arcor.de/fork0/download/idx.tar.gz

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: fix git-pack-redundant crashing sometimes
  2005-11-15 22:33           ` Alex Riesen
@ 2005-11-15 23:13             ` Lukas Sandström
  2005-11-16  7:01               ` Alex Riesen
  2005-11-16 21:11               ` Alex Riesen
  2005-11-15 23:58             ` Linus Torvalds
  2005-11-17 13:11             ` [PATCH] Make git-pack-redundant non-horribly slow on large sets of packs Lukas Sandström
  2 siblings, 2 replies; 25+ messages in thread
From: Lukas Sandström @ 2005-11-15 23:13 UTC (permalink / raw)
  To: git; +Cc: Alex Riesen, junkio

Alex Riesen wrote:
> Lukas Sandström, Tue, Nov 15, 2005 22:41:34 +0100:
> 
>>>>llist_sorted_difference_inplace didn't handle the match in the first
>>>>sha1 correctly and the lists went wild everywhere.
>>>
>>>This wasn't enough. It never (>5 min on 2.4GHz PIV) finishes on
>>>my local mirror of git, which has 22 packs by now.
>>
>>If the patch I just sent out doesn't fix the problem, and you don't
> 
> 
> Sorry, it doesn't. The code loops here:
> 
> 401             /* find the permutations which contain all missing objects */
> 402             perm_all = perm = get_all_permutations(non_unique);
> 403             while (perm) {
> 404                     if (is_superset(perm->pl, missing)) {
> 405                             new_perm = xmalloc(sizeof(struct pll));
> 406                             new_perm->pl = perm->pl;
> 407                             new_perm->next = perm_ok;
> 408                             perm_ok = new_perm;
> (gdb) 
> 409                     }
> 410                     perm = perm->next;
> 411             }
> 412
> 413             if (perm_ok == NULL)
> 
> 
> 
>>want to debug it yourself, could you please publish the .idx files
>>so I could have a look at them?
> 
> 
> Of course: http://home.arcor.de/fork0/download/idx.tar.gz
> 

After giving it a quick look, I don't think it locks up. It is just horribly
slow. get_all_permutations returns (nice ASCII-art follows)

    ___n___
    \
     \ ____1____
n! * /  k!(n-k)!
    /______
      k=1

permutations, which for your case (22 packs) adds up to 4194303.

I'll look into an optimization so we won't have to call is_superset
for every one of them.

OTOH, I might be wrong and it could very well be an infinite loop.
Mind running it over the night? I won't look further into this in
20 hours or so anyway.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: fix git-pack-redundant crashing sometimes
  2005-11-15 22:33           ` Alex Riesen
  2005-11-15 23:13             ` Lukas Sandström
@ 2005-11-15 23:58             ` Linus Torvalds
  2005-11-16 20:13               ` Junio C Hamano
  2005-11-16 21:37               ` Lukas Sandström
  2005-11-17 13:11             ` [PATCH] Make git-pack-redundant non-horribly slow on large sets of packs Lukas Sandström
  2 siblings, 2 replies; 25+ messages in thread
From: Linus Torvalds @ 2005-11-15 23:58 UTC (permalink / raw)
  To: Alex Riesen; +Cc: Lukas Sandström, git, junkio



On Tue, 15 Nov 2005, Alex Riesen wrote:
> 
> Sorry, it doesn't. The code loops here:
> 
> 401             /* find the permutations which contain all missing objects */
> 402             perm_all = perm = get_all_permutations(non_unique);

Looks like the whole thing is exponential.

A good way to do sane pack handling is to keep a _sorted_ list of all 
objects each pack has. At that point it becomes much easier to see which 
objects only exist in one particular pack.

The sorting itself is O(nlogn), and the "does this pack have any unique 
objects" (or "is this pack a superset of all other packs") question should 
then be O(n).

This is how the packs are generated in the first place - by sorting the 
objects and limiting the diff generation to the "neighbors" of the 
objects, you can generate pack-files in O(n logn) rather than O(n**2). 
Which is pretty important when there are currently 140,000+ objects in the 
kernel tree.

		Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: fix git-pack-redundant crashing sometimes
  2005-11-15 23:13             ` Lukas Sandström
@ 2005-11-16  7:01               ` Alex Riesen
  2005-11-16 21:11               ` Alex Riesen
  1 sibling, 0 replies; 25+ messages in thread
From: Alex Riesen @ 2005-11-16  7:01 UTC (permalink / raw)
  To: Lukas Sandström; +Cc: git, junkio

On 11/16/05, Lukas Sandström <lukass@etek.chalmers.se> wrote:
> OTOH, I might be wrong and it could very well be an infinite loop.
> Mind running it over the night? I won't look further into this in
> 20 hours or so anyway.

You missed by some minutes and I fell asleep :) Now, it is some 12
hours until I get to the machine with the lot of packs.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: fix git-pack-redundant crashing sometimes
  2005-11-15 23:58             ` Linus Torvalds
@ 2005-11-16 20:13               ` Junio C Hamano
  2005-11-16 21:37               ` Lukas Sandström
  1 sibling, 0 replies; 25+ messages in thread
From: Junio C Hamano @ 2005-11-16 20:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alex Riesen, Lukas Sandström, git, junkio

Linus Torvalds <torvalds@osdl.org> writes:

> On Tue, 15 Nov 2005, Alex Riesen wrote:
>> 
>> Sorry, it doesn't. The code loops here:
>> 
>> 401             /* find the permutations which contain all missing objects */
>> 402             perm_all = perm = get_all_permutations(non_unique);
>
> Looks like the whole thing is exponential.

This reminds me of one thing I have been meaning to remove.

The $GIT_OBJECT_DIRECTORY/info/pack file produced by
update-server-info records not just the packs (P lines), but its
dependencies (D lines) and top objects (T lines).  If there are
more than one packs, D lines are meant to help the downloader to
figure out what other packs are needed to complete.

"meant to" is the keyword here.  The original heuristics was
very bad (still correct but way suboptimal) and it has never
been updated.  I do not use that information myself in http
fetch, and I do not think any Porcelain uses it.

Top objects are tags and commits in the pack that are not
reachable from any other object in the same pack.  This can be
useful if you obtain just a packfile -- you can run server-info
to produce info/pack file, and resurrect the branch heads and
tags by using them --- you would not know what they are called,
though.  I have been thinking it may be worthwhile to also
record bottom objects (B lines, naturally) --- tags and commits
in the pack that refer to tags or commits not present in the
same pack.  This would help commit walkers to decide which pack
is more efficient to slurp when faced with multiple choices.  If
you see a pack all whose bottom objects you have in your
repository, fetching that would complete the commit walking, so
the choice would become the matter of choosing such a pack among
multiple that has smallest number of objects you do not have
locally -- you would have the index file for each pack at that
point.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: fix git-pack-redundant crashing sometimes
  2005-11-15 23:13             ` Lukas Sandström
  2005-11-16  7:01               ` Alex Riesen
@ 2005-11-16 21:11               ` Alex Riesen
  1 sibling, 0 replies; 25+ messages in thread
From: Alex Riesen @ 2005-11-16 21:11 UTC (permalink / raw)
  To: Lukas Sandström; +Cc: git, junkio

Lukas Sandström, Wed, Nov 16, 2005 00:13:14 +0100:
> After giving it a quick look, I don't think it locks up. It is just horribly
> slow. get_all_permutations returns (nice ASCII-art follows)
> 
>     ___n___
>     \
>      \ ____1____
> n! * /  k!(n-k)!
>     /______
>       k=1
> 
> permutations, which for your case (22 packs) adds up to 4194303.
> 
> I'll look into an optimization so we won't have to call is_superset
> for every one of them.
> 
> OTOH, I might be wrong and it could very well be an infinite loop.
> Mind running it over the night? I won't look further into this in
> 20 hours or so anyway.

It doesn't lock up:

$ time git-pack-redundant --all --alt-odb
.git/objects/pack/pack-3c5133604508466855453f3e609428f4bbba9131.idx
.git/objects/pack/pack-3c5133604508466855453f3e609428f4bbba9131.pack
...

real    24m55.277s
user    24m50.778s
sys     0m0.582s

Now I shall be called "too impatient", perhaps?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: fix git-pack-redundant crashing sometimes
  2005-11-15 23:58             ` Linus Torvalds
  2005-11-16 20:13               ` Junio C Hamano
@ 2005-11-16 21:37               ` Lukas Sandström
  2005-11-16 23:59                 ` Lukas Sandström
  2005-11-17  7:08                 ` Fredrik Kuivinen
  1 sibling, 2 replies; 25+ messages in thread
From: Lukas Sandström @ 2005-11-16 21:37 UTC (permalink / raw)
  To: git; +Cc: Linus Torvalds, Alex Riesen, junkio, Lukas Sandström

Linus Torvalds wrote:
> 
> On Tue, 15 Nov 2005, Alex Riesen wrote:
> 
>>Sorry, it doesn't. The code loops here:
>>
>>401             /* find the permutations which contain all missing objects */
>>402             perm_all = perm = get_all_permutations(non_unique);
> 
> 
> Looks like the whole thing is exponential.
> 
Slightly less, but not far from.
> A good way to do sane pack handling is to keep a _sorted_ list of all 
> objects each pack has. At that point it becomes much easier to see which 
> objects only exist in one particular pack.
> 
git-pack-redundant already does this.
> The sorting itself is O(nlogn), and the "does this pack have any unique 
> objects" (or "is this pack a superset of all other packs") question should 
> then be O(n).
> 
Ah, but the question is: "Does this set of packs contain a superset of
all objects available in packfiles?" The answer to the question for a
certain set is indeed O(n), but the number of sets which have to be tested
are ~ O(e**n). (Where n is the number of non-unique packs.)

/Lukas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* git-pack-redundant returns the most containing pack
@ 2005-11-16 23:09 Alex Riesen
  2005-11-16 23:23 ` Lukas Sandström
  0 siblings, 1 reply; 25+ messages in thread
From: Alex Riesen @ 2005-11-16 23:09 UTC (permalink / raw)
  To: Lukas Sandström; +Cc: git

...which very confusing: "git-repack -a -d" leaves the repository with
exactly the same packs as before, by creating a super-pack, and then
happily removing it, because pack-redundant returns the newly created
pack!

So, even if it is logically correct, it's hardly useful in practice.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-pack-redundant returns the most containing pack
  2005-11-16 23:09 git-pack-redundant returns the most containing pack Alex Riesen
@ 2005-11-16 23:23 ` Lukas Sandström
  2005-11-17  7:45   ` Alex Riesen
  0 siblings, 1 reply; 25+ messages in thread
From: Lukas Sandström @ 2005-11-16 23:23 UTC (permalink / raw)
  To: git; +Cc: Alex Riesen, Lukas Sandström

Alex Riesen wrote:
> ...which very confusing: "git-repack -a -d" leaves the repository with
> exactly the same packs as before, by creating a super-pack, and then
> happily removing it, because pack-redundant returns the newly created
> pack!
> 
> So, even if it is logically correct, it's hardly useful in practice.
> 

That's bad. Your new pack should contain some objects not present in
the older packfiles and thus it shouldn't be removed, unless there
were no new objects to pack. 

If no new objects were packed, the sum of the old packs might be smaller 
than the new superpack, or the old packs could contain unreachable objects,
which makes git-pack-redundant unable to detect that they should be removed.

Could you try updating to the latest snapshot? There was a bug in a
list handling function which was fixed recently, perhaps your problem
is related.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: fix git-pack-redundant crashing sometimes
  2005-11-16 21:37               ` Lukas Sandström
@ 2005-11-16 23:59                 ` Lukas Sandström
  2005-11-17 16:56                   ` Matthias Urlichs
  2005-11-17  7:08                 ` Fredrik Kuivinen
  1 sibling, 1 reply; 25+ messages in thread
From: Lukas Sandström @ 2005-11-16 23:59 UTC (permalink / raw)
  To: git; +Cc: Lukas Sandström, Linus Torvalds, Alex Riesen, junkio

Lukas Sandström wrote:
> Linus Torvalds wrote:
> 
>>On Tue, 15 Nov 2005, Alex Riesen wrote:
>>
>>
>>>Sorry, it doesn't. The code loops here:
>>>
>>>401             /* find the permutations which contain all missing objects */
>>>402             perm_all = perm = get_all_permutations(non_unique);
>>
>>
>>Looks like the whole thing is exponential.
>>
> 
> Slightly less, but not far from.
> 
After doing some maths; actually it is very exponential. 2**n - 1 to be precise.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: fix git-pack-redundant crashing sometimes
  2005-11-16 21:37               ` Lukas Sandström
  2005-11-16 23:59                 ` Lukas Sandström
@ 2005-11-17  7:08                 ` Fredrik Kuivinen
  1 sibling, 0 replies; 25+ messages in thread
From: Fredrik Kuivinen @ 2005-11-17  7:08 UTC (permalink / raw)
  To: Lukas Sandström; +Cc: git, Linus Torvalds, Alex Riesen, junkio

On Wed, Nov 16, 2005 at 10:37:42PM +0100, Lukas Sandström wrote:
> Linus Torvalds wrote:
> > 
> > On Tue, 15 Nov 2005, Alex Riesen wrote:
> > 
> >>Sorry, it doesn't. The code loops here:
> >>
> >>401             /* find the permutations which contain all missing objects */
> >>402             perm_all = perm = get_all_permutations(non_unique);
> > 
> > 
> > Looks like the whole thing is exponential.
> > 
> Slightly less, but not far from.
> > A good way to do sane pack handling is to keep a _sorted_ list of all 
> > objects each pack has. At that point it becomes much easier to see which 
> > objects only exist in one particular pack.
> > 
> git-pack-redundant already does this.
> > The sorting itself is O(nlogn), and the "does this pack have any unique 
> > objects" (or "is this pack a superset of all other packs") question should 
> > then be O(n).
> > 
> Ah, but the question is: "Does this set of packs contain a superset of
> all objects available in packfiles?" The answer to the question for a
> certain set is indeed O(n), but the number of sets which have to be tested
> are ~ O(e**n). (Where n is the number of non-unique packs.)
> 

If haven't looked closely at neither the packing nor the pack
redundant code, but the problem looks very similar to Minimum Set
Cover (see http://www.nada.kth.se/~viggo/wwwcompendium/node146.html
for more information), which is NP-hard. So, if Minimum Set Cover is
the problem we are trying to solve we can't hope to find a polynomial
time algorithm. However, there is a quite simple greedy approximation
algorithm in [1] which is referenced from the URL above.


- Fredrik

[1]: Johnson, D. S. (1974), "Approximation algorithms for
     combinatorial problems", J. Comput. System Sci. 9, 256-278.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-pack-redundant returns the most containing pack
  2005-11-16 23:23 ` Lukas Sandström
@ 2005-11-17  7:45   ` Alex Riesen
  2005-11-15 15:49     ` fix git-pack-redundant crashing sometimes Alex Riesen
  0 siblings, 1 reply; 25+ messages in thread
From: Alex Riesen @ 2005-11-17  7:45 UTC (permalink / raw)
  To: Lukas Sandström; +Cc: git

On 11/17/05, Lukas Sandström <lukass@etek.chalmers.se> wrote:
> > ...which very confusing: "git-repack -a -d" leaves the repository with
> > exactly the same packs as before, by creating a super-pack, and then
> > happily removing it, because pack-redundant returns the newly created
> > pack!
> >
> > So, even if it is logically correct, it's hardly useful in practice.
>
> That's bad. Your new pack should contain some objects not present in
> the older packfiles and thus it shouldn't be removed, unless there
> were no new objects to pack.

there weren't: "ls .git/objects" showed only pack/ and info/

> If no new objects were packed, the sum of the old packs might be smaller
> than the new superpack, or the old packs could contain unreachable objects,
> which makes git-pack-redundant unable to detect that they should be removed.

that _could_ be the case. I run git-fsck-objects --full in that
repository and saw some unreferenced tags.

> Could you try updating to the latest snapshot? There was a bug in a
> list handling function which was fixed recently, perhaps your problem
> is related.

will try, but I didn't realize yesterday that it might be a good idea
to keep the old repository around. The lot of packs was automatically
created by incremental repacking after every pull. Sorry...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] Make git-pack-redundant non-horribly slow on large sets of packs
  2005-11-15 22:33           ` Alex Riesen
  2005-11-15 23:13             ` Lukas Sandström
  2005-11-15 23:58             ` Linus Torvalds
@ 2005-11-17 13:11             ` Lukas Sandström
  2005-11-17 20:39               ` Alex Riesen
                                 ` (2 more replies)
  2 siblings, 3 replies; 25+ messages in thread
From: Lukas Sandström @ 2005-11-17 13:11 UTC (permalink / raw)
  To: git; +Cc: Alex Riesen, junkio, Lukas Sandström

Change the smallest-set detection algortithm so that when
we have found a good set, we don't check any larger sets.

Signed-off-by: Lukas Sandström <lukass@etek.chalmers.se>

---
This change might make git-pack-redundant return non-optimal
pack-sets in some cases. OTOH, the case which Alex reported
completes in 0.6 sec instead of several minutes.

 pack-redundant.c |   62 +++++++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 49 insertions(+), 13 deletions(-)

applies-to: 1e3fcf60526c196a46433e6947c9104ca236f230
498cc056d758b9e1cd21e984be281d55e24f1117
diff --git a/pack-redundant.c b/pack-redundant.c
index fcb36ff..51d7341 100644
--- a/pack-redundant.c
+++ b/pack-redundant.c
@@ -33,6 +33,7 @@ struct pack_list {
 struct pll {
 	struct pll *next;
 	struct pack_list *pl;
+	size_t pl_size;
 };
 
 inline void llist_free(struct llist *list)
@@ -249,18 +250,45 @@ void cmp_two_packs(struct pack_list *p1,
 	}
 }
 
+void pll_insert(struct pll **pll, struct pll **hint_table)
+{
+	struct pll *prev;
+	int i = (*pll)->pl_size - 1;
+
+	if (hint_table[i] == NULL) {
+		hint_table[i--] = *pll;
+		for (; i >= 0; --i) {
+			if (hint_table[i] != NULL)
+				break;
+		}
+		if (hint_table[i] == NULL) /* no elements in list */
+			die("Why did this happen?");
+	}
+
+	prev = hint_table[i];
+	while (prev->next && prev->next->pl_size < (*pll)->pl_size)
+		prev = prev->next;
+
+	(*pll)->next = prev->next;
+	prev->next = *pll;
+}
+
 /* all the permutations have to be free()d at the same time,
  * since they refer to each other
  */
 struct pll * get_all_permutations(struct pack_list *list)
 {
 	struct pll *subset, *pll, *new_pll = NULL; /*silence warning*/
-
+	static struct pll **hint = NULL;
+	if (hint == NULL)
+		hint = xcalloc(pack_list_size(list), sizeof(struct pll *));
+		
 	if (list == NULL)
 		return NULL;
 
 	if (list->next == NULL) {
 		new_pll = xmalloc(sizeof(struct pll));
+		hint[0] = new_pll;
 		new_pll->next = NULL;
 		new_pll->pl = list;
 		return new_pll;
@@ -268,24 +296,30 @@ struct pll * get_all_permutations(struct
 
 	pll = subset = get_all_permutations(list->next);
 	while (pll) {
+		if (pll->pl->pack == list->pack) {
+			pll = pll->next;
+			continue;
+		}
 		new_pll = xmalloc(sizeof(struct pll));
-		new_pll->next = pll->next;
-		pll->next = new_pll;
 
 		new_pll->pl = xmalloc(sizeof(struct pack_list));
 		memcpy(new_pll->pl, list, sizeof(struct pack_list));
 		new_pll->pl->next = pll->pl;
+		new_pll->pl_size = pll->pl_size + 1;
+		
+		pll_insert(&new_pll, hint);
+
+		pll = pll->next;
+	}
+	/* add ourself */
+	new_pll = xmalloc(sizeof(struct pll));
+	new_pll->pl = xmalloc(sizeof(struct pack_list));
+	memcpy(new_pll->pl, list, sizeof(struct pack_list));
+	new_pll->pl->next = NULL;
+	new_pll->pl_size = 1;
+	pll_insert(&new_pll, hint);
 
-		pll = new_pll->next;
-	}
-	/* add ourself to the end */
-	new_pll->next = xmalloc(sizeof(struct pll));
-	new_pll->next->pl = xmalloc(sizeof(struct pack_list));
-	new_pll->next->next = NULL;
-	memcpy(new_pll->next->pl, list, sizeof(struct pack_list));
-	new_pll->next->pl->next = NULL;
-
-	return subset;
+	return hint[0];
 }
 
 int is_superset(struct pack_list *pl, struct llist *list)
@@ -401,6 +435,8 @@ void minimize(struct pack_list **min)
 	/* find the permutations which contain all missing objects */
 	perm_all = perm = get_all_permutations(non_unique);
 	while (perm) {
+		if (perm_ok && perm->pl_size > perm_ok->pl_size)
+			break; /* ignore all larger permutations */
 		if (is_superset(perm->pl, missing)) {
 			new_perm = xmalloc(sizeof(struct pll));
 			new_perm->pl = perm->pl;
---
0.99.9.GIT

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: fix git-pack-redundant crashing sometimes
  2005-11-16 23:59                 ` Lukas Sandström
@ 2005-11-17 16:56                   ` Matthias Urlichs
  0 siblings, 0 replies; 25+ messages in thread
From: Matthias Urlichs @ 2005-11-17 16:56 UTC (permalink / raw)
  To: git

Hi, Lukas Sandström wrote:

> After doing some maths; actually it is very exponential. 2**n - 1 to be precise.

In the general case, yes. However, you can just sort the packs by date
and sequentially drop all that can be dropped. That's not optimal in the
general case, but for us it should be close enough.

-- 
Matthias Urlichs   |   {M:U} IT Design @ m-u-it.de   |  smurf@smurf.noris.de
Disclaimer: The quote was selected randomly. Really. | http://smurf.noris.de
 - -
When was the last time you were drunk?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Make git-pack-redundant non-horribly slow on large sets of packs
  2005-11-17 13:11             ` [PATCH] Make git-pack-redundant non-horribly slow on large sets of packs Lukas Sandström
@ 2005-11-17 20:39               ` Alex Riesen
  2005-11-18 16:30               ` [PATCH] Fix bug introduced by the latest changes to git-pack-redundant Lukas Sandström
  2005-11-18 21:53               ` [PATCH] Fix a bug in get_all_permutations Lukas Sandström
  2 siblings, 0 replies; 25+ messages in thread
From: Alex Riesen @ 2005-11-17 20:39 UTC (permalink / raw)
  To: Lukas Sandström; +Cc: git, junkio

Lukas Sandström, Thu, Nov 17, 2005 14:11:56 +0100:
> Change the smallest-set detection algortithm so that when
> we have found a good set, we don't check any larger sets.

I used latest git (a575603af20a795584d79b32c30bda10fcae5d3f Merge
branch 'tojunio' of http://locke.catalyst.net.nz/git/git-martinlanghoff)
and the patch on top.

Alex Riesen, Thu, Nov 17, 2005 08:45:46 +0100:
> > Could you try updating to the latest snapshot? There was a bug in a
> > list handling function which was fixed recently, perhaps your problem
> > is related.
> 
> will try, but I didn't realize yesterday that it might be a good idea
> to keep the old repository around. The lot of packs was automatically
> created by incremental repacking after every pull. Sorry...

ok, pack-redundant --all in a repository with 25 packs completed in
1.346 sec. That's definitely an improvement.

But, the repository (a clone of Linus' kernel) didn't have a
superpack. I created one (git repack -a) and started
"pack-redundant --all". And almost oomed the system (it has 1Gb RAM +
512Mb(historically) swap, BTW): oom-killer killed git-pack-redundant.
Peak memory usage was around 1.3Gb, than everything stopped. The
repository had 140718 objects.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] Fix bug introduced by the latest changes to git-pack-redundant
  2005-11-17 13:11             ` [PATCH] Make git-pack-redundant non-horribly slow on large sets of packs Lukas Sandström
  2005-11-17 20:39               ` Alex Riesen
@ 2005-11-18 16:30               ` Lukas Sandström
  2005-11-18 21:53               ` [PATCH] Fix a bug in get_all_permutations Lukas Sandström
  2 siblings, 0 replies; 25+ messages in thread
From: Lukas Sandström @ 2005-11-18 16:30 UTC (permalink / raw)
  To: git; +Cc: Lukas Sandström, Alex Riesen, junkio

I forgot to initialize part of the pll struct when copying it.
Found by valgrind.

Signed-off-by: Lukas Sandström <lukass@etek.chalmers.se>
---

 pack-redundant.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

applies-to: 500482a7b3ebb2ebac182136696cb124332feba8
b51be59ab5b323ef148f833976a9c324d6a27404
diff --git a/pack-redundant.c b/pack-redundant.c
index 2d7183e..3123f45 100644
--- a/pack-redundant.c
+++ b/pack-redundant.c
@@ -442,7 +442,7 @@ void minimize(struct pack_list **min)
 			break; /* ignore all larger permutations */
 		if (is_superset(perm->pl, missing)) {
 			new_perm = xmalloc(sizeof(struct pll));
-			new_perm->pl = perm->pl;
+			memcpy(new_perm, perm, sizeof(struct pll));
 			new_perm->next = perm_ok;
 			perm_ok = new_perm;
 		}
---
0.99.9.GIT

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH] Fix a bug in get_all_permutations.
  2005-11-17 13:11             ` [PATCH] Make git-pack-redundant non-horribly slow on large sets of packs Lukas Sandström
  2005-11-17 20:39               ` Alex Riesen
  2005-11-18 16:30               ` [PATCH] Fix bug introduced by the latest changes to git-pack-redundant Lukas Sandström
@ 2005-11-18 21:53               ` Lukas Sandström
  2 siblings, 0 replies; 25+ messages in thread
From: Lukas Sandström @ 2005-11-18 21:53 UTC (permalink / raw)
  To: junkio; +Cc: Alex Riesen, git, Lukas Sandström

This line was missing in the previous patch for some reason.

Signed-off-by: Lukas Sandström <lukass@etek.chalmers.se>

---

 pack-redundant.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

applies-to: 5b820842ce0afb836ddbeded3a3f9e8d0ba223f9
e65d77b8ece9f6e463de0ceeaf6812a2df6d7d96
diff --git a/pack-redundant.c b/pack-redundant.c
index 3655609..3e3f33a 100644
--- a/pack-redundant.c
+++ b/pack-redundant.c
@@ -291,6 +291,7 @@ struct pll * get_all_permutations(struct
 		hint[0] = new_pll;
 		new_pll->next = NULL;
 		new_pll->pl = list;
+		new_pll->pl_size = 1;
 		return new_pll;
 	}
 
---
0.99.9.GIT

^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2005-11-18 21:53 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-11-16 23:09 git-pack-redundant returns the most containing pack Alex Riesen
2005-11-16 23:23 ` Lukas Sandström
2005-11-17  7:45   ` Alex Riesen
2005-11-15 15:49     ` fix git-pack-redundant crashing sometimes Alex Riesen
2005-11-15 16:08       ` Timo Hirvonen
2005-11-15 16:11         ` Alex Riesen
2005-11-15 17:28           ` Linus Torvalds
2005-11-15 21:38         ` Lukas Sandström
2005-11-15 21:24       ` [PATCH] Fix llist_sorted_difference_inplace in git-pack-redundant Lukas Sandström
2005-11-15 21:34       ` fix git-pack-redundant crashing sometimes Alex Riesen
2005-11-15 21:41         ` Lukas Sandström
2005-11-15 22:33           ` Alex Riesen
2005-11-15 23:13             ` Lukas Sandström
2005-11-16  7:01               ` Alex Riesen
2005-11-16 21:11               ` Alex Riesen
2005-11-15 23:58             ` Linus Torvalds
2005-11-16 20:13               ` Junio C Hamano
2005-11-16 21:37               ` Lukas Sandström
2005-11-16 23:59                 ` Lukas Sandström
2005-11-17 16:56                   ` Matthias Urlichs
2005-11-17  7:08                 ` Fredrik Kuivinen
2005-11-17 13:11             ` [PATCH] Make git-pack-redundant non-horribly slow on large sets of packs Lukas Sandström
2005-11-17 20:39               ` Alex Riesen
2005-11-18 16:30               ` [PATCH] Fix bug introduced by the latest changes to git-pack-redundant Lukas Sandström
2005-11-18 21:53               ` [PATCH] Fix a bug in get_all_permutations Lukas Sandström

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).