All of lore.kernel.org
 help / color / mirror / Atom feed
* with reuse-delta patches, fetching with bitmaps segfaults due to possibly incomplete bitmap traverse
@ 2014-03-22  2:58 Siddharth Agarwal
  2014-03-22 12:56 ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Siddharth Agarwal @ 2014-03-22  2:58 UTC (permalink / raw)
  To: git; +Cc: bmaurer, Jeff King, Aaron Kushner

Hi all,

At Facebook we've found that fetch speed is a bottleneck for our Git 
repos, so we've been looking to deploy bitmaps to speed up fetches. 
We've been trying out git-next with the top two patches from 
https://github.com/peff/git/commits/jk/bitmap-reuse-delta, but the 
following is reproducible with tip of that branch, currently 81cdec2.

We're finding that git fetches with bitmaps enabled are malfunctioning 
under some circumstances. For at least one of our repositories, a 
particular git fetch that an engineer ran segfaulted on the rmeote git 
pack-objects because of writing to an array out of bounds. This is 
consistently reproducible with the particular pair of (remote, local) 
repositories.

The error looks like:

(1) $ git fetch
remote: Counting objects: 201614, done.
remote: Compressing objects: 100% (36262/36262), done.
error: pack-objects died of signal 11
error: git upload-pack: git-pack-objects died with error.
fatal: git upload-pack: aborting due to possible repository corruption 
on the remote side.
remote: aborting due to possible repository corruption on the remote side.
fatal: protocol error: bad pack header

In contrast, when I disable bitmaps by removing the .bitmap file on the 
server while using the same Git binary, I get

(2) $ git fetch
remote: Counting objects: 203466, done.
remote: Compressing objects: 100% (46014/46104), done.
Receiving objects: 100% (203466/203466), 55.52 MiB | 7.45 MiB/s, done.
remote: Total 203466 (delta 169894), reused 187613 (delta 156621)

Note the differences in the "Counting objects" and "Compressing objects" 
figures between (1) and (2). I'm not sure if they're relevant.

As a baseline, if I run the same test with an older git -- version 
1.8.1, I get

(3) $ git fetch
remote: Counting objects: 235298, done.
remote: Compressing objects: 100% (46104/46104), done.
remote: Total 203466 (delta 169894), reused 187613 (delta 156621)
Receiving objects: 100% (203466/203466), 55.52 MiB | 11.15 MiB/s, done.

As a control, I'm using the same version of Git on the client in all 
three cases above -- git 1.8.1. The client Git doesn't matter -- using 
81cdec2 has the same effect. The transport is ssh in all three cases.

I dug into this a bit and it looks like at this point:

https://github.com/peff/git/blob/81cdec28fa24fdc613ab7c3406c1c67975dbf22f/builtin/pack-objects.c#L700

at some object that add_family_to_write_order is called for, wo_end 
exceeds to_pack.nr_objects by over 1000 objects. More precisely, at the 
point it crashes, wo_end is 218081 while to_pack.nr_objects is 201614. 
(This means wo_end overshot to_pack.nr_objects some time ago.)

I bumped up the malloc at 
https://github.com/peff/git/blob/81cdec28fa24fdc613ab7c3406c1c67975dbf22f/builtin/pack-objects.c#L628 
in order to prevent segfaulting, and the remote process dies with:

(4) $ git fetch
remote: Counting objects: 201614, done.
remote: Compressing objects: 100% (36262/36262), done.
remote: fatal: ordered 226227 objects, expected 201614

In contrast, in case (2) above, at the end of compute_pack_order, wo_end 
and to_pack.nr_objects are both 235298. I found it interesting that this 
number is not reflected anywhere in the output of (2) but is the same as 
the output of the "Counting objects" step of (3). I'm not sure if this 
is a red herring or not.

I suspect that the traverse_bitmap_commit_list call at 
https://github.com/peff/git/blob/81cdec28fa24fdc613ab7c3406c1c67975dbf22f/builtin/pack-objects.c#L2476 
is somehow skipping objects.

Other notes:
- I cannot reproduce this with a plain old 'git clone <remote>' with 
bitmaps enabled and used on the remote. There's something particular 
about either thin packs or the client repository that's triggering this.
- There is exactly one pack containing slightly over 3.5 million 
objects, and three loose objects in the remote repo.
- The remote repo is isolated -- there are no concurrent writes going on.
- While generating the bitmap I did not reuse existing any existing 
bitmaps: I removed the existing bitmap and ran git repack -adb.

Unfortunately I do not have a reproducible test case with a repository 
that's open source, or with one that I can share. However, I'm happy to 
provide any other information about the structure of the repository, and 
to set up a debugging session over IRC or other means.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: with reuse-delta patches, fetching with bitmaps segfaults due to possibly incomplete bitmap traverse
  2014-03-22  2:58 with reuse-delta patches, fetching with bitmaps segfaults due to possibly incomplete bitmap traverse Siddharth Agarwal
@ 2014-03-22 12:56 ` Jeff King
  2014-03-24  0:01   ` Siddharth Agarwal
  2014-03-24 20:30   ` Junio C Hamano
  0 siblings, 2 replies; 4+ messages in thread
From: Jeff King @ 2014-03-22 12:56 UTC (permalink / raw)
  To: Siddharth Agarwal; +Cc: git, bmaurer, Aaron Kushner

On Fri, Mar 21, 2014 at 07:58:55PM -0700, Siddharth Agarwal wrote:

> At Facebook we've found that fetch speed is a bottleneck for our Git repos,
> so we've been looking to deploy bitmaps to speed up fetches. We've been
> trying out git-next with the top two patches from
> https://github.com/peff/git/commits/jk/bitmap-reuse-delta, but the following
> is reproducible with tip of that branch, currently 81cdec2.

Is it also reproducible just with the tip of "next"? Note that the
patches in jk/bitmap-reuse-delta have not been widely deployed (in
particular, we are not yet using them at GitHub, and we track segfaults
on our servers closely and have not seen any related to this).

Those patches allocate extra "fake" entries in the entry->delta fields,
which are not accounted for in to_pack.nr_objects. It's entirely
possible that those entries are related to the bug you are seeing.

> I dug into this a bit and it looks like at this point:
> 
> https://github.com/peff/git/blob/81cdec28fa24fdc613ab7c3406c1c67975dbf22f/builtin/pack-objects.c#L700
> 
> at some object that add_family_to_write_order is called for, wo_end exceeds
> to_pack.nr_objects by over 1000 objects. More precisely, at the point it
> crashes, wo_end is 218081 while to_pack.nr_objects is 201614. (This means
> wo_end overshot to_pack.nr_objects some time ago.)

Hmm, yeah, that confirms my suspicion. In the earlier loops, we call
add_to_write_order, which only adds the object in question, and can
never exceed to_pack.nr_objects. In this final loop, we call
add_family_to_write_order, which is going to add any deltas that were
not already included.

The patch below may fix your problem, but I have a feeling it is not the
right thing to do. The point of 81cdec28 is to try to point to a delta
entry as if it were a "preferred base" (i.e., something we know that the
other side has already). We perhaps want to add these entries to the
actual packing list, and skip them as we do with normal preferred_base
objects.

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 9fc5321..ca1b0f7 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1437,6 +1437,7 @@ static void check_object(struct object_entry *entry)
 			entry->delta = xcalloc(1, sizeof(*entry->delta));
 			hashcpy(entry->delta->idx.sha1, base_ref);
 			entry->delta->preferred_base = 1;
+			entry->delta->filled = 1;
 			unuse_pack(&w_curs);
 			return;
 		}

-Peff

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: with reuse-delta patches, fetching with bitmaps segfaults due to possibly incomplete bitmap traverse
  2014-03-22 12:56 ` Jeff King
@ 2014-03-24  0:01   ` Siddharth Agarwal
  2014-03-24 20:30   ` Junio C Hamano
  1 sibling, 0 replies; 4+ messages in thread
From: Siddharth Agarwal @ 2014-03-24  0:01 UTC (permalink / raw)
  To: Jeff King; +Cc: git, bmaurer, Aaron Kushner

On 03/22/2014 05:56 AM, Jeff King wrote:
> On Fri, Mar 21, 2014 at 07:58:55PM -0700, Siddharth Agarwal wrote:
>
> Is it also reproducible just with the tip of "next"? Note that the
> patches in jk/bitmap-reuse-delta have not been widely deployed (in
> particular, we are not yet using them at GitHub, and we track segfaults
> on our servers closely and have not seen any related to this).

I cannot reproduce this with the tip of next (tested with 4443bfd). 
That's also --  unsurprisingly -- significantly slower in the 
compression phase and sends much more data (3x for the pair of repos in 
the OP) over the wire than a Git that doesn't use bitmaps.

> Those patches allocate extra "fake" entries in the entry->delta fields,
> which are not accounted for in to_pack.nr_objects. It's entirely
> possible that those entries are related to the bug you are seeing.

That sounds like it could be the problem, yes.

> Hmm, yeah, that confirms my suspicion. In the earlier loops, we call
> add_to_write_order, which only adds the object in question, and can
> never exceed to_pack.nr_objects. In this final loop, we call
> add_family_to_write_order, which is going to add any deltas that were
> not already included.
>
> The patch below may fix your problem, but I have a feeling it is not the
> right thing to do. The point of 81cdec28 is to try to point to a delta
> entry as if it were a "preferred base" (i.e., something we know that the
> other side has already). We perhaps want to add these entries to the
> actual packing list, and skip them as we do with normal preferred_base
> objects.

The patch does stop Git from segfaulting. I know too little to judge its 
correctness, though.

>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 9fc5321..ca1b0f7 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -1437,6 +1437,7 @@ static void check_object(struct object_entry *entry)
>   			entry->delta = xcalloc(1, sizeof(*entry->delta));
>   			hashcpy(entry->delta->idx.sha1, base_ref);
>   			entry->delta->preferred_base = 1;
> +			entry->delta->filled = 1;
>   			unuse_pack(&w_curs);
>   			return;
>   		}
>
> -Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: with reuse-delta patches, fetching with bitmaps segfaults due to possibly incomplete bitmap traverse
  2014-03-22 12:56 ` Jeff King
  2014-03-24  0:01   ` Siddharth Agarwal
@ 2014-03-24 20:30   ` Junio C Hamano
  1 sibling, 0 replies; 4+ messages in thread
From: Junio C Hamano @ 2014-03-24 20:30 UTC (permalink / raw)
  To: Jeff King; +Cc: Siddharth Agarwal, git, bmaurer, Aaron Kushner

Jeff King <peff@peff.net> writes:

> On Fri, Mar 21, 2014 at 07:58:55PM -0700, Siddharth Agarwal wrote:
>
>> At Facebook we've found that fetch speed is a bottleneck for our Git repos,
>> so we've been looking to deploy bitmaps to speed up fetches. We've been
>> trying out git-next with the top two patches from
>> https://github.com/peff/git/commits/jk/bitmap-reuse-delta, but the following
>> is reproducible with tip of that branch, currently 81cdec2.
>
> Is it also reproducible just with the tip of "next"? Note that the
> patches in jk/bitmap-reuse-delta have not been widely deployed (in
> particular, we are not yet using them at GitHub, and we track segfaults
> on our servers closely and have not seen any related to this).

Nice to hear.  I was worried for a short while if I merged what was
not cooked well, before I realized that Siddharth is on a codebase
that is more bleeding edge than I use myself ;-)

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2014-03-24 20:31 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-22  2:58 with reuse-delta patches, fetching with bitmaps segfaults due to possibly incomplete bitmap traverse Siddharth Agarwal
2014-03-22 12:56 ` Jeff King
2014-03-24  0:01   ` Siddharth Agarwal
2014-03-24 20:30   ` Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.