All of lore.kernel.org
 help / color / mirror / Atom feed
* is there a fast web-interface to git for huge repos?
@ 2013-06-07  1:35 Constantine A. Murenin
  2013-06-07  6:33 ` Fredrik Gustafsson
  0 siblings, 1 reply; 8+ messages in thread
From: Constantine A. Murenin @ 2013-06-07  1:35 UTC (permalink / raw)
  To: git

Hi,

On a relatively-empty Intel Core i7 975 @ 3.33GHz (quad-core):

Cns# cd DragonFly/

Cns# time git log sys/sys/sockbuf.h >/dev/null
0.540u 0.140s 0:04.30 15.8%     0+0k 2754+55io 6484pf+0w
Cns# time git log sys/sys/sockbuf.h > /dev/null
0.000u 0.030s 0:00.52 5.7%      0+0k 0+0io 0pf+0w
Cns# time git log sys/sys/sockbuf.h > /dev/null
0.180u 0.020s 0:00.52 38.4%     0+0k 0+2io 0pf+0w
Cns# time git log sys/sys/sockbuf.h > /dev/null
0.420u 0.020s 0:00.52 84.6%     0+0k 0+0io 0pf+0w

And, right away, a semi-cold git-blame:

Cns# time git blame sys/sys/sockbuf.h >/dev/null
0.340u 0.040s 0:01.91 19.8%     0+0k 769+45io 2078pf+0w
Cns# time git blame sys/sys/sockbuf.h > /dev/null
0.340u 0.010s 0:00.36 97.2%     0+0k 0+2io 0pf+0w
Cns# time git blame sys/sys/sockbuf.h > /dev/null
0.310u 0.040s 0:00.36 97.2%     0+0k 0+0io 0pf+0w
Cns# time git blame sys/sys/sockbuf.h > /dev/null
0.310u 0.050s 0:00.36 100.0%    0+0k 0+0io 0pf+0w


I'm interested in running a web interface to this and other similar
git repositories (FreeBSD and NetBSD git repositories are even much,
much bigger).

Software-wise, is there no way to make cold access for git-log and
git-blame to be orders of magnitude less than ~5s, and warm access
less than ~0.5s?

C.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: is there a fast web-interface to git for huge repos?
  2013-06-07  1:35 is there a fast web-interface to git for huge repos? Constantine A. Murenin
@ 2013-06-07  6:33 ` Fredrik Gustafsson
  2013-06-07 17:05   ` Constantine A. Murenin
  0 siblings, 1 reply; 8+ messages in thread
From: Fredrik Gustafsson @ 2013-06-07  6:33 UTC (permalink / raw)
  To: Constantine A. Murenin; +Cc: git

On Thu, Jun 06, 2013 at 06:35:43PM -0700, Constantine A. Murenin wrote:
> I'm interested in running a web interface to this and other similar
> git repositories (FreeBSD and NetBSD git repositories are even much,
> much bigger).
> 
> Software-wise, is there no way to make cold access for git-log and
> git-blame to be orders of magnitude less than ~5s, and warm access
> less than ~0.5s?

The obvious way would be to cache the results. You can even put an
update cache hook the git repositories to make the cache always be up to
date.

There's some dynamic web frontends like cgit and gitweb out there but
there's also static ones like git-arr ( http://blitiri.com.ar/p/git-arr/
) that might be more of an option to you.

-- 
Med vänliga hälsningar
Fredrik Gustafsson

tel: 0733-608274
e-post: iveqy@iveqy.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: is there a fast web-interface to git for huge repos?
  2013-06-07  6:33 ` Fredrik Gustafsson
@ 2013-06-07 17:05   ` Constantine A. Murenin
  2013-06-07 17:57     ` Fredrik Gustafsson
  0 siblings, 1 reply; 8+ messages in thread
From: Constantine A. Murenin @ 2013-06-07 17:05 UTC (permalink / raw)
  To: Fredrik Gustafsson; +Cc: git

On 6 June 2013 23:33, Fredrik Gustafsson <iveqy@iveqy.com> wrote:
> On Thu, Jun 06, 2013 at 06:35:43PM -0700, Constantine A. Murenin wrote:
>> I'm interested in running a web interface to this and other similar
>> git repositories (FreeBSD and NetBSD git repositories are even much,
>> much bigger).
>>
>> Software-wise, is there no way to make cold access for git-log and
>> git-blame to be orders of magnitude less than ~5s, and warm access
>> less than ~0.5s?
>
> The obvious way would be to cache the results. You can even put an

That would do nothing to prevent slowness of the cold requests, which
already run for 5s when completely cold.

In fact, unless done right, it would actually slow things down, as
lines would not necessarily show up as they're ready.

> update cache hook the git repositories to make the cache always be up to
> date.

That's entirely inefficient.  It'll probably take hours or days to
pre-cache all the html pages with a naive wget and the list of all the
files.  Not a solution at all.

(0.5s x 35k files = 5 hours for log/blame, plus another 5h of cpu time
for blame/log)

> There's some dynamic web frontends like cgit and gitweb out there but
> there's also static ones like git-arr ( http://blitiri.com.ar/p/git-arr/
> ) that might be more of an option to you.

The concept for git-arr looks interesting, but it has neither blame
nor log, so, it's kinda pointless, because the whole thing that's slow
is exactly blame and log.

There has to be some way to improve these matters.  Noone wants to
wait 5 seconds until a page is generated, we're not running enterprise
software here, latency is important!

C.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: is there a fast web-interface to git for huge repos?
  2013-06-07 17:05   ` Constantine A. Murenin
@ 2013-06-07 17:57     ` Fredrik Gustafsson
  2013-06-07 19:02       ` Constantine A. Murenin
  0 siblings, 1 reply; 8+ messages in thread
From: Fredrik Gustafsson @ 2013-06-07 17:57 UTC (permalink / raw)
  To: Constantine A. Murenin; +Cc: Fredrik Gustafsson, git

On Fri, Jun 07, 2013 at 10:05:37AM -0700, Constantine A. Murenin wrote:
> On 6 June 2013 23:33, Fredrik Gustafsson <iveqy@iveqy.com> wrote:
> > On Thu, Jun 06, 2013 at 06:35:43PM -0700, Constantine A. Murenin wrote:
> >> I'm interested in running a web interface to this and other similar
> >> git repositories (FreeBSD and NetBSD git repositories are even much,
> >> much bigger).
> >>
> >> Software-wise, is there no way to make cold access for git-log and
> >> git-blame to be orders of magnitude less than ~5s, and warm access
> >> less than ~0.5s?
> >
> > The obvious way would be to cache the results. You can even put an
> 
> That would do nothing to prevent slowness of the cold requests, which
> already run for 5s when completely cold.
> 
> In fact, unless done right, it would actually slow things down, as
> lines would not necessarily show up as they're ready.

You need to cache this _before_ the web-request. Don't let the
web-request trigger a cache-update but a git push to the repository.

> 
> > update cache hook the git repositories to make the cache always be up to
> > date.
> 
> That's entirely inefficient.  It'll probably take hours or days to
> pre-cache all the html pages with a naive wget and the list of all the
> files.  Not a solution at all.
> 
> (0.5s x 35k files = 5 hours for log/blame, plus another 5h of cpu time
> for blame/log)

That's a one-time penalty. Why would that be a problem? And why is wget
even mentioned? Did we misunderstood eachother?

> 
> > There's some dynamic web frontends like cgit and gitweb out there but
> > there's also static ones like git-arr ( http://blitiri.com.ar/p/git-arr/
> > ) that might be more of an option to you.
> 
> The concept for git-arr looks interesting, but it has neither blame
> nor log, so, it's kinda pointless, because the whole thing that's slow
> is exactly blame and log.
> 
> There has to be some way to improve these matters.  Noone wants to
> wait 5 seconds until a page is generated, we're not running enterprise
> software here, latency is important!
> 
> C.

Git's internal structures make just blame pretty expensive. There's
nothing you really can do for it algoritm wise (as far as I know, if
there was, people would already improved it).

The solution here is to have a "hot" repository to speed up things.

There's of course little things you can do. I imagine that using git
repack in a sane way probably could speed things up, as well as git gc.

-- 
Med vänliga hälsningar
Fredrik Gustafsson

tel: 0733-608274
e-post: iveqy@iveqy.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: is there a fast web-interface to git for huge repos?
  2013-06-07 17:57     ` Fredrik Gustafsson
@ 2013-06-07 19:02       ` Constantine A. Murenin
  2013-06-07 20:13         ` Charles McGarvey
  0 siblings, 1 reply; 8+ messages in thread
From: Constantine A. Murenin @ 2013-06-07 19:02 UTC (permalink / raw)
  To: Fredrik Gustafsson; +Cc: git

On 7 June 2013 10:57, Fredrik Gustafsson <iveqy@iveqy.com> wrote:
> On Fri, Jun 07, 2013 at 10:05:37AM -0700, Constantine A. Murenin wrote:
>> On 6 June 2013 23:33, Fredrik Gustafsson <iveqy@iveqy.com> wrote:
>> > On Thu, Jun 06, 2013 at 06:35:43PM -0700, Constantine A. Murenin wrote:
>> >> I'm interested in running a web interface to this and other similar
>> >> git repositories (FreeBSD and NetBSD git repositories are even much,
>> >> much bigger).
>> >>
>> >> Software-wise, is there no way to make cold access for git-log and
>> >> git-blame to be orders of magnitude less than ~5s, and warm access
>> >> less than ~0.5s?
>> >
>> > The obvious way would be to cache the results. You can even put an
>>
>> That would do nothing to prevent slowness of the cold requests, which
>> already run for 5s when completely cold.
>>
>> In fact, unless done right, it would actually slow things down, as
>> lines would not necessarily show up as they're ready.
>
> You need to cache this _before_ the web-request. Don't let the
> web-request trigger a cache-update but a git push to the repository.
>
>>
>> > update cache hook the git repositories to make the cache always be up to
>> > date.
>>
>> That's entirely inefficient.  It'll probably take hours or days to
>> pre-cache all the html pages with a naive wget and the list of all the
>> files.  Not a solution at all.
>>
>> (0.5s x 35k files = 5 hours for log/blame, plus another 5h of cpu time
>> for blame/log)
>
> That's a one-time penalty. Why would that be a problem? And why is wget
> even mentioned? Did we misunderstood eachother?

`wget` or `curl --head` would be used to trigger the caching.

I don't understand how it's a one-time penalty.  Noone wants to look
at an old copy of the repository, so, pretty much, if, say, I want to
have a gitweb of all 4 BSDs, updated daily, then, pretty much, even
with lots of ram (e.g. to eliminate the cold-case 5s penalty, and
reduce each page to 0.5s), on a quad-core box, I'd be kinda be lucky
to complete a generation of all the pages within 12h or so, obviously
using the machine at, or above, 50% capacity just for the caching.  Or
several days or even a couple of weeks on an Intel Atom or VIA Nano
with 2GB of RAM or so.  Obviously not acceptable, there has to be a
better solution.

One could, I guess, only regenerate the pages which have changed, but
it still sounds like an ugly solution, where you'd have to be
generating a list of files that have changed between one gen and the
next, and you'd still have to have a very high cpu, cache and storage
requirements.

C.

>> > There's some dynamic web frontends like cgit and gitweb out there but
>> > there's also static ones like git-arr ( http://blitiri.com.ar/p/git-arr/
>> > ) that might be more of an option to you.
>>
>> The concept for git-arr looks interesting, but it has neither blame
>> nor log, so, it's kinda pointless, because the whole thing that's slow
>> is exactly blame and log.
>>
>> There has to be some way to improve these matters.  Noone wants to
>> wait 5 seconds until a page is generated, we're not running enterprise
>> software here, latency is important!
>>
>> C.
>
> Git's internal structures make just blame pretty expensive. There's
> nothing you really can do for it algoritm wise (as far as I know, if
> there was, people would already improved it).
>
> The solution here is to have a "hot" repository to speed up things.
>
> There's of course little things you can do. I imagine that using git
> repack in a sane way probably could speed things up, as well as git gc.
>
> --
> Med vänliga hälsningar
> Fredrik Gustafsson
>
> tel: 0733-608274
> e-post: iveqy@iveqy.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: is there a fast web-interface to git for huge repos?
  2013-06-07 19:02       ` Constantine A. Murenin
@ 2013-06-07 20:13         ` Charles McGarvey
  2013-06-07 20:21           ` Constantine A. Murenin
  0 siblings, 1 reply; 8+ messages in thread
From: Charles McGarvey @ 2013-06-07 20:13 UTC (permalink / raw)
  To: Constantine A. Murenin; +Cc: Fredrik Gustafsson, git

[-- Attachment #1: Type: text/plain, Size: 2207 bytes --]

On 06/07/2013 01:02 PM, Constantine A. Murenin wrote:
>> That's a one-time penalty. Why would that be a problem? And why is wget
>> even mentioned? Did we misunderstood eachother?
> 
> `wget` or `curl --head` would be used to trigger the caching.
> 
> I don't understand how it's a one-time penalty.  Noone wants to look
> at an old copy of the repository, so, pretty much, if, say, I want to
> have a gitweb of all 4 BSDs, updated daily, then, pretty much, even
> with lots of ram (e.g. to eliminate the cold-case 5s penalty, and
> reduce each page to 0.5s), on a quad-core box, I'd be kinda be lucky
> to complete a generation of all the pages within 12h or so, obviously
> using the machine at, or above, 50% capacity just for the caching.  Or
> several days or even a couple of weeks on an Intel Atom or VIA Nano
> with 2GB of RAM or so.  Obviously not acceptable, there has to be a
> better solution.
> 
> One could, I guess, only regenerate the pages which have changed, but
> it still sounds like an ugly solution, where you'd have to be
> generating a list of files that have changed between one gen and the
> next, and you'd still have to have a very high cpu, cache and storage
> requirements.

Have you already ruled out caching on a proxy?  Pages would only be generated
on demand, so the first visitor would still experience the delay but the rest
would be fast until the page expires.  Even expiring pages as often as five
minutes or less would probably provide significant processing savings
(depending on how many users you have), and that level of staleness and the
occasional delays may be acceptable to your users.

As you say, generating the entire cache upfront and continuously is wasteful
and probably unrealistic, but any type of caching, by definition, is going to
involve users seeing stale content, and I don't see that you have any other
option but some type of caching.  Well, you could reproduce what git does in a
bunch of distributed algorithms and run your app on a farm--which, I guess, is
probably what GitHub is doing--but throwing up a caching reverse proxy is a
lot quicker if you can accept the caveats.

-- 
Charles McGarvey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: is there a fast web-interface to git for huge repos?
  2013-06-07 20:13         ` Charles McGarvey
@ 2013-06-07 20:21           ` Constantine A. Murenin
  2013-06-14 10:55             ` Holger Hellmuth (IKS)
  0 siblings, 1 reply; 8+ messages in thread
From: Constantine A. Murenin @ 2013-06-07 20:21 UTC (permalink / raw)
  To: Charles McGarvey; +Cc: Fredrik Gustafsson, git

On 7 June 2013 13:13, Charles McGarvey <chazmcgarvey@brokenzipper.com> wrote:
> On 06/07/2013 01:02 PM, Constantine A. Murenin wrote:
>>> That's a one-time penalty. Why would that be a problem? And why is wget
>>> even mentioned? Did we misunderstood eachother?
>>
>> `wget` or `curl --head` would be used to trigger the caching.
>>
>> I don't understand how it's a one-time penalty.  Noone wants to look
>> at an old copy of the repository, so, pretty much, if, say, I want to
>> have a gitweb of all 4 BSDs, updated daily, then, pretty much, even
>> with lots of ram (e.g. to eliminate the cold-case 5s penalty, and
>> reduce each page to 0.5s), on a quad-core box, I'd be kinda be lucky
>> to complete a generation of all the pages within 12h or so, obviously
>> using the machine at, or above, 50% capacity just for the caching.  Or
>> several days or even a couple of weeks on an Intel Atom or VIA Nano
>> with 2GB of RAM or so.  Obviously not acceptable, there has to be a
>> better solution.
>>
>> One could, I guess, only regenerate the pages which have changed, but
>> it still sounds like an ugly solution, where you'd have to be
>> generating a list of files that have changed between one gen and the
>> next, and you'd still have to have a very high cpu, cache and storage
>> requirements.
>
> Have you already ruled out caching on a proxy?  Pages would only be generated
> on demand, so the first visitor would still experience the delay but the rest
> would be fast until the page expires.  Even expiring pages as often as five
> minutes or less would probably provide significant processing savings
> (depending on how many users you have), and that level of staleness and the
> occasional delays may be acceptable to your users.
>
> As you say, generating the entire cache upfront and continuously is wasteful
> and probably unrealistic, but any type of caching, by definition, is going to
> involve users seeing stale content, and I don't see that you have any other
> option but some type of caching.  Well, you could reproduce what git does in a
> bunch of distributed algorithms and run your app on a farm--which, I guess, is
> probably what GitHub is doing--but throwing up a caching reverse proxy is a
> lot quicker if you can accept the caveats.

I don't think GitHub / Gitorious / whatever have solved this problem
at all.  They're terribly slow on big repos, some pages don't even
generate the first time you click on the link.

I'm totally fine with daily updates; but I think there still has to be
some better way of doing this than wasting 0.5s of CPU time and 5s of
HDD time (if completely cold) for each blame / log, at the price of
more storage and some pre-caching, and (daily (in my use-case))
fine-grained incremental updates.

C.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: is there a fast web-interface to git for huge repos?
  2013-06-07 20:21           ` Constantine A. Murenin
@ 2013-06-14 10:55             ` Holger Hellmuth (IKS)
  0 siblings, 0 replies; 8+ messages in thread
From: Holger Hellmuth (IKS) @ 2013-06-14 10:55 UTC (permalink / raw)
  To: Constantine A. Murenin; +Cc: Charles McGarvey, Fredrik Gustafsson, git

Am 07.06.2013 22:21, schrieb Constantine A. Murenin:
> I'm totally fine with daily updates; but I think there still has to be
> some better way of doing this than wasting 0.5s of CPU time and 5s of
> HDD time (if completely cold) for each blame / log, at the price of
> more storage and some pre-caching, and (daily (in my use-case))
> fine-grained incremental updates.

To get a feel for the numbers: I would guess 'git blame' is mostly run 
against the newest version and the release version of a file, right? I 
couldn't find the number of files in bsd, so lets take linux instead: 
That is 25k files for version 2.6.27. Lets say 35k files altogether for 
both release and newer versions of the files.

A typical page of git blame output on github seems to be in the vicinity 
of 500 kbytes, but that seems to include lots of overhead for comfort 
functions. At least that means it is a good upper bound value.

35k files times 500k gives 17.5 Gbytes, a trivial value for a static 
*disk* based cache. It is also a manageable value for affordable SSDs

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-06-14 10:55 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-07  1:35 is there a fast web-interface to git for huge repos? Constantine A. Murenin
2013-06-07  6:33 ` Fredrik Gustafsson
2013-06-07 17:05   ` Constantine A. Murenin
2013-06-07 17:57     ` Fredrik Gustafsson
2013-06-07 19:02       ` Constantine A. Murenin
2013-06-07 20:13         ` Charles McGarvey
2013-06-07 20:21           ` Constantine A. Murenin
2013-06-14 10:55             ` Holger Hellmuth (IKS)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.