All of lore.kernel.org
 help / color / mirror / Atom feed
* Compressing packed-refs
@ 2020-07-16 22:10 Konstantin Ryabitsev
  2020-07-16 22:27 ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Konstantin Ryabitsev @ 2020-07-16 22:10 UTC (permalink / raw)
  To: git

Hi, all:

I know repos with too many refs is a corner-case for most people, but 
it's looming large in my world, so I'm wondering if it makes sense to 
compress the packed-refs file when "git pack-refs" is performed?

What would the implications be, other than minor performance degradation 
when reading it?

-K

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Compressing packed-refs
  2020-07-16 22:10 Compressing packed-refs Konstantin Ryabitsev
@ 2020-07-16 22:27 ` Junio C Hamano
  2020-07-16 22:54   ` Konstantin Ryabitsev
  0 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2020-07-16 22:27 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: git

Konstantin Ryabitsev <konstantin@linuxfoundation.org> writes:

> I know repos with too many refs is a corner-case for most people, but 
> it's looming large in my world, so I'm wondering if it makes sense to 
> compress the packed-refs file when "git pack-refs" is performed?

I think the reftable is the longer term direction, but let's see if
there is easy enough optimization opportunity that we can afford the
development and maintenance cost for the short term.

My .git/packed-refs file begins like so:

    # pack-refs with: peeled fully-peeled sorted 
    c3808ca6982b0ad7ee9b87eca9b50b9a24ec08b0 refs/heads/maint-2.10
    3b9e3c2cede15057af3ff8076c45ad5f33829436 refs/heads/maint-2.11
    584f8975d2d9530a34bd0b936ae774f82fe30fed refs/heads/master
    2cccc8116438182c988c7f26d9559a1c22e78f1c refs/heads/next
    8300349bc1f0a0e2623d5824266bd72c1f4b5f24 refs/notes/commits
    ...

A few observations that can lead to easy design elements are

 - Typically more than half of each records is consumed by the
   object name that is hard to "compress".

 - The file is sorted, so it could use the prefix compression like
   we do in the v4 index files.

So perhaps a new format could be

 - The header "# pack-refs with: " lists a new trait, "compressed";

 - Object names will be expressed in binary, saving 20 bytes per a
   record;

 - Prefix compression of the refnames similar to v4 index would save
   a bit more.

Storing binary object names would actually be favourable for
performance, as the in-core data structure we use to store the
result of parsing the file uses binary.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Compressing packed-refs
  2020-07-16 22:27 ` Junio C Hamano
@ 2020-07-16 22:54   ` Konstantin Ryabitsev
  2020-07-17  6:27     ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Konstantin Ryabitsev @ 2020-07-16 22:54 UTC (permalink / raw)
  To: Junio C Hamano, git

On Thu, Jul 16, 2020 at 03:27:15PM -0700, Junio C Hamano wrote:
> I think the reftable is the longer term direction, but let's see if
> there is easy enough optimization opportunity that we can afford the
> development and maintenance cost for the short term.
> 
> My .git/packed-refs file begins like so:
> 
>     # pack-refs with: peeled fully-peeled sorted 
>     c3808ca6982b0ad7ee9b87eca9b50b9a24ec08b0 refs/heads/maint-2.10
>     3b9e3c2cede15057af3ff8076c45ad5f33829436 refs/heads/maint-2.11
>     584f8975d2d9530a34bd0b936ae774f82fe30fed refs/heads/master
>     2cccc8116438182c988c7f26d9559a1c22e78f1c refs/heads/next
>     8300349bc1f0a0e2623d5824266bd72c1f4b5f24 refs/notes/commits
>     ...

Let me offer a more special-case (but not crazy) example from 
git.kernel.org. The newer version of grokmirror that I'm working on is 
built to take advantage of the pack-islands feature that was added a 
while back. We fetch all linux forks into a single "object storage" 
repo, with each fork going into its own 
refs/virtual/[uniquename]/(heads|tags) place. This means there's lots of 
duplicates in packed-refs, as all the tags from torvalds/linux.git will 
end up duplicated in almost every fork.

So, after running git pack-refs --all, the packed-refs file is 50-ish MB 
in size, with a lot of same stuff like:

5dc01c595e6c6ec9ccda4f6f69c131c0dd945f8c refs/virtual/00018460b026/tags/v2.6.11
^c39ae07f393806ccf406ef966e9a15afc43cc36a
...
5dc01c595e6c6ec9ccda4f6f69c131c0dd945f8c refs/virtual/00bcef8138af/tags/v2.6.11
^c39ae07f393806ccf406ef966e9a15afc43cc36a

etc, duplicated 600 times with each fork. It compresses decently well 
with gzip -9, and *amazingly* well with xz -9:

$ ls -ahl packed-refs
-rw-r--r--. 1 mirror mirror 46M Jul 16 22:37 packed-refs
$ ls -ahl packed-refs.gz
-rw-r--r--. 1 mirror mirror 19M Jul 16 22:47 packed-refs.gz
$ ls -ahl packed-refs.xz
-rw-r--r--. 1 mirror mirror 2.3M Jul 16 22:47 packed-refs.xz

Which really just indicates how much duplicated data is in that file. If 
reftables will eventually replace refs entirely, then we probably 
shouldn't expend too much effort super-optimizing it, especially if I'm 
one of the very few people who would benefit from it. However, I'm 
curious if a different sorting strategy would help remove most of the 
duplication without requiring too much engineering time.

-K

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Compressing packed-refs
  2020-07-16 22:54   ` Konstantin Ryabitsev
@ 2020-07-17  6:27     ` Jeff King
  2020-07-18 18:26       ` Konstantin Ryabitsev
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2020-07-17  6:27 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: Junio C Hamano, git

On Thu, Jul 16, 2020 at 06:54:29PM -0400, Konstantin Ryabitsev wrote:

> Let me offer a more special-case (but not crazy) example from 
> git.kernel.org. The newer version of grokmirror that I'm working on is 
> built to take advantage of the pack-islands feature that was added a 
> while back. We fetch all linux forks into a single "object storage" 
> repo, with each fork going into its own 
> refs/virtual/[uniquename]/(heads|tags) place. This means there's lots of 
> duplicates in packed-refs, as all the tags from torvalds/linux.git will 
> end up duplicated in almost every fork.
> 
> So, after running git pack-refs --all, the packed-refs file is 50-ish MB 
> in size, with a lot of same stuff like:

We do the same thing at GitHub. 50MB is on the easy side. We have cases
in the gigabytes.

> etc, duplicated 600 times with each fork. It compresses decently well 
> with gzip -9, and *amazingly* well with xz -9:
> 
> $ ls -ahl packed-refs
> -rw-r--r--. 1 mirror mirror 46M Jul 16 22:37 packed-refs
> $ ls -ahl packed-refs.gz
> -rw-r--r--. 1 mirror mirror 19M Jul 16 22:47 packed-refs.gz
> $ ls -ahl packed-refs.xz
> -rw-r--r--. 1 mirror mirror 2.3M Jul 16 22:47 packed-refs.xz

Yes, it does compress well. Just gzipping like that would have worked
once upon a time, when accessing it meant reading the whole thing
linearly. These days, though, we mmap() the file and binary-search it.
That lets us examine a subset of the refs quickly (this is from our
torvalds/linux.git fork network):


  $ wc -c packed-refs 
  2394623761 packed-refs

  $ time git for-each-ref | wc -l
  19552978
  
  real	1m12.297s
  user	1m2.441s
  sys	0m10.235s

  $ time git for-each-ref refs/remotes/2325298 | wc -l
  2590
  
  real	0m0.077s
  user	0m0.025s
  sys	0m0.055s

> Which really just indicates how much duplicated data is in that file. If 
> reftables will eventually replace refs entirely, then we probably 
> shouldn't expend too much effort super-optimizing it, especially if I'm 
> one of the very few people who would benefit from it. However, I'm 
> curious if a different sorting strategy would help remove most of the 
> duplication without requiring too much engineering time.

You definitely could store it in a more efficient way. Reftables will
have most of the things you'd want: prefix compression, binary oids,
etc.  I wouldn't be opposed to a tweak to packed-refs in the meantime if
it was simple to implement. But definitely we'd want to retain the
ability to find a subset of refs in sub-linear time. That might get
tricky and push it from "simple" to "let's just invest in reftable".

You might also consider whether you need all of those refs at all in the
object storage repo. The main uses are:

  - determining reachability during repacks; but you could generate this
    on the fly from the refs in the individual forks (de-duplicating as
    you go). We don't do this at GitHub, because the information in the
    duplicates is useful to our delta-islands config.

  - getting new objects into the object store. It sounds like you might
    do this with "git fetch", which does need up-to-date refs. We used
    to do that, too, but it can be quite slow. These days we migrate the
    objects directly via hardlinks, and then use "update-ref --stdin" to
    sync the refs into the shared storage repo.

  - advertising alternate ref tips in receive-pack (i.e., saying "we
    already know about object X" if it's in somebody else's fork, which
    means people pulling from Linus and then pushing to their fork don't
    have to send the objects again). You probably don't want to
    advertise all of them (just sifting the duplicates is too
    expensive). We use core.alternateRefsCommand to pick out just the
    ones from the parent fork. We _do_ still use the copy of the refs in
    our shared storage, not the ones in the actual fork. But that's
    because we migrate objects to shared storage asynchronously (so it's
    possible for one fork to have refs pointing to objects that aren't
    yet available to the other forks).

So it's definitely not a no-brainer, but possibly something to explore.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Compressing packed-refs
  2020-07-17  6:27     ` Jeff King
@ 2020-07-18 18:26       ` Konstantin Ryabitsev
  2020-07-20 17:32         ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Konstantin Ryabitsev @ 2020-07-18 18:26 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

On Fri, Jul 17, 2020 at 02:27:23AM -0400, Jeff King wrote:
> > Which really just indicates how much duplicated data is in that 
> > file. If reftables will eventually replace refs entirely, then we 
> > probably shouldn't expend too much effort super-optimizing it, 
> > especially if I'm one of the very few people who would benefit from 
> > it. However, I'm curious if a different sorting strategy would help 
> > remove most of the duplication without requiring too much 
> > engineering time.
> 
> You definitely could store it in a more efficient way. Reftables will
> have most of the things you'd want: prefix compression, binary oids,
> etc.  I wouldn't be opposed to a tweak to packed-refs in the meantime if
> it was simple to implement. But definitely we'd want to retain the
> ability to find a subset of refs in sub-linear time. That might get
> tricky and push it from "simple" to "let's just invest in reftable".

I'm fine either way, but a better approach seems to wait for reftables 
to land.

> You might also consider whether you need all of those refs at all in the
> object storage repo. The main uses are:
> 
>   - determining reachability during repacks; but you could generate this
>     on the fly from the refs in the individual forks (de-duplicating as
>     you go). We don't do this at GitHub, because the information in the
>     duplicates is useful to our delta-islands config.
>   - getting new objects into the object store. It sounds like you might
>     do this with "git fetch", which does need up-to-date refs. We used
>     to do that, too, but it can be quite slow. These days we migrate the
>     objects directly via hardlinks, and then use "update-ref --stdin" to
>     sync the refs into the shared storage repo.

This is definitely interesting to me, as git fetch runs into objstore 
repos do take a long time, even after I move them into a "lazy" thread.  
It's not so much a problem for git.kernel.org where pushes come in 
sporadically, but for CAF, their automation usually pushes several 
hundred repo updates at the same time, and the subsequent fetch into 
objstore takes several hours to complete.

Can you elaborate on the details of that operation, if it's not secret 
sauce? Say, I have two repos:

repoA/objects/
repoS/objects/

does this properly describe the operation:

1. locate all pack/* and XX/* files in repoA/objects (what about the 
   info/packs file, or do you loosen all packs first?)
2. hardlink them into the same location in repoS/objects
3. use git-show-ref from repoA to generate stdin for git-update-ref in 
   repoS
4. Consequent runs of repack in repoA should unreference the hardlinked 
   files in repoA/objects and leave only their copy in repoS

I'm not sure I'm quite comfortable doing this kind of spinal surgery on 
git repos yet, but I'm willing to wet my feet in some safe environments.  
:)

>   - advertising alternate ref tips in receive-pack (i.e., saying "we
>     already know about object X" if it's in somebody else's fork, which
>     means people pulling from Linus and then pushing to their fork don't
>     have to send the objects again). You probably don't want to
>     advertise all of them (just sifting the duplicates is too
>     expensive). We use core.alternateRefsCommand to pick out just the
>     ones from the parent fork. We _do_ still use the copy of the refs in
>     our shared storage, not the ones in the actual fork. But that's
>     because we migrate objects to shared storage asynchronously (so it's
>     possible for one fork to have refs pointing to objects that aren't
>     yet available to the other forks).

Yes, I did ponder using this, especially when dealing with objstore 
repos with hundreds of thousands of refs -- thanks for another nudge in 
this direction. I am planning to add a concept of indicating "baseline" 
repos to grokmirror, which allows us to:

1. set them as islandCore in objstore repositories
2. return only their refs via alternateRefsCommand

This one seems fairly straightforward and I will probably add that in 
next week.

> So it's definitely not a no-brainer, but possibly something to 
> explore.

Indeed -- at this point I'm more comfortable letting git itself do all 
object moving, but as I get more familiar with how object-storage repos 
work, I can optimize various aspects of it.

Thanks for your help!

-K

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Compressing packed-refs
  2020-07-18 18:26       ` Konstantin Ryabitsev
@ 2020-07-20 17:32         ` Jeff King
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2020-07-20 17:32 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: Junio C Hamano, git

On Sat, Jul 18, 2020 at 02:26:18PM -0400, Konstantin Ryabitsev wrote:

> >   - getting new objects into the object store. It sounds like you might
> >     do this with "git fetch", which does need up-to-date refs. We used
> >     to do that, too, but it can be quite slow. These days we migrate the
> >     objects directly via hardlinks, and then use "update-ref --stdin" to
> >     sync the refs into the shared storage repo.
> [...]
> Can you elaborate on the details of that operation, if it's not secret 
> sauce? Say, I have two repos:

No secret sauce, but it's pretty much what you wrote out. A few
comments:

> 1. locate all pack/* and XX/* files in repoA/objects (what about the 
>    info/packs file, or do you loosen all packs first?)

We only copy the pack and loose object files. We don't generate
info/packs at all, since we don't allow dumb-http access. Nor do we copy
any commit-graph files over (those are generated only in the shared
storage repo, and then every fork gets to use them).

Definitely don't loosen packs. It's very expensive. :)

> 2. hardlink them into the same location in repoS/objects

Yep. And now they're available atomically in both places.

> 3. use git-show-ref from repoA to generate stdin for git-update-ref in 
>    repoS

Use for-each-ref for this. It's received more optimizations over the
years (especially around looking at the minimum of packed-refs when it
can). Don't forget to delete refs that have gone away. We do something
like (typed in email, so watch out for errors):

  id=123
  git --git-dir=repoA for-each-ref \
    --format="%(objectname) refs/remotes/$id/%(refname)' >want
  git --git-dir=repoS for-each-ref \
    --format="%(objectname) %(refname)" refs/remotes/$id/ >have

and then compare the results (our code is in ruby using hashes, but you
could do it with comm or similar). And then you should end up with a set
of updates and deletions, which you can feed to "git update-ref --stdin"
(which is smart enough to do deletions before additions to save you from
directory/file conflicts in the namespace).

(There's no particular reason you need to use refs/remotes/ in the
shared repo; for us it's just historical since we really did define
configure remotes for each fork many many years ago).

> 4. Consequent runs of repack in repoA should unreference the hardlinked 
>    files in repoA/objects and leave only their copy in repoS

Yeah, I think it would do so, but we just unlink them immediately.

> I'm not sure I'm quite comfortable doing this kind of spinal surgery on 
> git repos yet, but I'm willing to wet my feet in some safe environments.  
> :)

We resisted it for a long time, too, because I didn't want to violate
any of Git's assumptions. But the cost of fetches was just getting too
high (especially because we queue a sync job after every push, and some
users like to push a _lot_).

> Yes, I did ponder using this, especially when dealing with objstore 
> repos with hundreds of thousands of refs -- thanks for another nudge in 
> this direction. I am planning to add a concept of indicating "baseline" 
> repos to grokmirror, which allows us to:
> 
> 1. set them as islandCore in objstore repositories
> 2. return only their refs via alternateRefsCommand
> 
> This one seems fairly straightforward and I will probably add that in 
> next week.

Yeah, it is. Our alternateRefsCommand is a script that basically does:

  # receive parent id info out-of-band in environment; if it's not
  # there, then show no alternate tips
  test -z "$parent_repo_id" && exit 0

  git --git-dir="$1" for-each-ref \
    --format='%(objectname)' refs/remotes/$i/heads/

Note that we only advertise "heads/" from the fork, and ignore the tags.
I don't know that we did a very rigorous study, but our general finding
was that tags don't often help much, and do clutter up the response for
some repos (again, some users think that 50,000 tags is reasonable).

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-07-20 17:32 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-16 22:10 Compressing packed-refs Konstantin Ryabitsev
2020-07-16 22:27 ` Junio C Hamano
2020-07-16 22:54   ` Konstantin Ryabitsev
2020-07-17  6:27     ` Jeff King
2020-07-18 18:26       ` Konstantin Ryabitsev
2020-07-20 17:32         ` Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.