From: Linus Torvalds <torvalds@osdl.org>
To: Jakub Narebski <jnareb@gmail.com>
Cc: Jeff Garzik <jeff@garzik.org>,
Martin Langhoff <martin.langhoff@gmail.com>,
Git Mailing List <git@vger.kernel.org>,
"H. Peter Anvin" <hpa@zytor.com>,
Rogan Dawes <discard@dawes.za.net>,
Kernel Org Admin <ftpadmin@kernel.org>
Subject: Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
Date: Sun, 10 Dec 2006 11:50:15 -0800 (PST) [thread overview]
Message-ID: <Pine.LNX.4.64.0612101129190.12500@woody.osdl.org> (raw)
In-Reply-To: <200612102011.52589.jnareb@gmail.com>
On Sun, 10 Dec 2006, Jakub Narebski wrote:
> >> If-Modified-Since:, If-Match:, If-None-Match: do you?
>
> Adn in CGI standard there is a way to access additional HTTP headers
> info from CGI script: the envirionmental variables are HTTP_HEADER,
> for example if browser sent If-Modified-Since: header it's value
> can be found in HTTP_IF_MODIFIED_SINCE environmental variable.
Guys, you're missing something fairly fundamnetal.
It helps almost _nothing_ to support client-side caching with all these
fancy "If-Modified-Since:" etc crap.
That's not the _problem_.
It's usually not one client asking for the gitweb pages: the load comes
from just lots of people independently asking for it. So client-side
caching may help a tiny tiny bit, but it's not actually fixing the
fundamental problem at all.
So forget about "If-Modified-Since:" etc. It may help in benchmarks when
you try it yourself, and use "refresh" on the client side. But the basic
problem is all about lots of clients that do NOT have things cached,
because all teh client caches are all filled up with pr0n, not with gitweb
data from yesterday.
So the thing to help is server-side caching with good access patterns, so
that the server won't have to seek all over the disk when clients that
_don't_ have things in their caches want to see the "git projects" summary
overview (that currently lists something like 200+ projects).
So to get that list of 200+ projects, right now gitweb will literally walk
them all, look at their refs, their descriptions, their ages (which
requires looking up the refs, and the objects behing the refs), and if
they aren't cached, you're going to have several disk seeks for each
project.
At 200+ projects, the thing that makes it slow is those disk seeks. Even
with a fast disk and RAID array, the seeks are all basically going to be
interdependent, so there's no room for disk arm movement optimization, and
in the absense of any other load it's still going to be several seconds
just for the seeks (say 10ms per seek, four or five seeks per project,
you've got 10 seconds _just_ for the seeks to generate the top-level
summary page, and quite frankly, five seeks is probably optimistic).
Now, hopefully some of it will be in the disk cache, but when the
mirroring happens, it will basically blow the disk caches away totally
(when using the "--checksum" option), and then you literally have tens of
seconds to generate that one top-level page.
And when mirroring is blowing out the disk caches, the thing will be doing
other things _too_ to the disk, of course.
So what you want is server-side caching, and you basically _never_ want to
re-generate that data synchronously (because even if the server can take
the load, having the clients wait for half a minute or more for the data
is just NOT FRIENDLY). This is why I suggested the grace-period where we
fill the cache on he server side in the background _while_at_the_same_time
actually feeding the clients the old cached contents.
Because what matters most to _clients_ is not getting the most recent
up-to-date data within the last few minutes - people who go to the
overview page want to just get a list of projects, and they want to get
them in a second or two, not half a minute later.
And btw, all those "If-Modified-Since:" things are irrelevant, since quite
often, the top-level page really technically _has_ been modified in the
last few minutes, because with the kernel and git projects, _somebody_ has
usually pushed out one of the projects within the last hour.
And no, people don't just sit there refreshing their browser page all the
time. I bet even "active" git users do it at most once or twice a day,
which means that their client cache will _never_ be up-to-date.
But if you do it with server-side caches and grace-periods, you can
generally say "we have something that is at most five minutes old", and
most importantly, you can hopefully do it without a lot of disk seeks
(because you just cache the _one_ page as _one_ object), so hopefully you
can do it in a few hundred ms even if the thing is on disk and even if
there's a lot of other load going on.
I bet the top-level "all projects" summary page and the individual
project summary pages are the important things to cache. That's what
probably most people look at, and they are the ones that have lots of
server-side cache locality. Individual commits and diffs probably don't
get the same kind of "lots of people looking at them" and thus don't get
the same kind of benefit from caching.
(Individual commits hopefully also need fewer disk seeks, at least with
packed repositories. So even if you have to re-generate them from scratch,
they won't have the seek times themselves taking up tens of seconds,
unless the project is entirely unpacked and diffing just generates total
disk seek hell)
next prev parent reply other threads:[~2006-12-10 19:51 UTC|newest]
Thread overview: 82+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <45708A56.3040508@drzeus.cx>
[not found] ` <Pine.LNX.4.64.0612011639240.3695@woody.osdl.org>
[not found] ` <457151A0.8090203@drzeus.cx>
[not found] ` <Pine.LNX.4.64.0612020835110.3476@woody.osdl.org>
[not found] ` <45744FA3.7020908@zytor.com>
[not found] ` <Pine.LNX.4.64.0612061847190.3615@woody.osdl.org>
[not found] ` <45778AA3.7080709@zytor.com>
[not found] ` <Pine.LNX.4.64.0612061940170.3615@woody.osdl.org>
[not found] ` <4577A84C.3010601@zytor.com>
[not found] ` <Pine.LNX.4.64.0612070953290.3615@woody.osdl.org>
[not found] ` <45785697.1060001@zytor.com>
2006-12-07 19:05 ` kernel.org mirroring (Re: [GIT PULL] MMC update) Linus Torvalds
2006-12-07 19:16 ` H. Peter Anvin
2006-12-07 19:30 ` Olivier Galibert
2006-12-07 19:57 ` H. Peter Anvin
2006-12-07 23:50 ` Olivier Galibert
2006-12-07 23:56 ` H. Peter Anvin
2006-12-08 11:25 ` Jakub Narebski
2006-12-08 12:57 ` Rogan Dawes
2006-12-08 13:38 ` Jakub Narebski
2006-12-08 14:31 ` Rogan Dawes
2006-12-08 15:38 ` Jonas Fonseca
2006-12-09 1:28 ` Martin Langhoff
2006-12-09 2:03 ` H. Peter Anvin
2006-12-09 2:52 ` Martin Langhoff
2006-12-09 5:09 ` H. Peter Anvin
2006-12-09 5:34 ` Martin Langhoff
2006-12-09 16:26 ` H. Peter Anvin
2006-12-08 16:16 ` H. Peter Anvin
2006-12-08 16:35 ` Linus Torvalds
2006-12-08 16:42 ` H. Peter Anvin
2006-12-08 19:49 ` Lars Hjemli
2006-12-08 19:51 ` H. Peter Anvin
2006-12-08 19:59 ` Lars Hjemli
2006-12-08 20:02 ` H. Peter Anvin
2006-12-10 9:43 ` rda
2006-12-08 16:54 ` Jeff Garzik
2006-12-08 17:04 ` H. Peter Anvin
2006-12-08 17:40 ` Jeff Garzik
2006-12-08 23:27 ` Linus Torvalds
2006-12-08 23:46 ` Michael K. Edwards
2006-12-08 23:49 ` H. Peter Anvin
2006-12-09 0:18 ` Michael K. Edwards
2006-12-09 0:23 ` H. Peter Anvin
2006-12-09 0:49 ` Linus Torvalds
2006-12-09 0:51 ` H. Peter Anvin
2006-12-09 4:36 ` Michael K. Edwards
2006-12-09 9:27 ` Jeff Garzik
[not found] ` <4579FABC.5070509@garzik.org>
2006-12-09 0:45 ` Linus Torvalds
2006-12-09 0:47 ` H. Peter Anvin
2006-12-09 9:16 ` Jeff Garzik
2006-12-09 1:56 ` Martin Langhoff
2006-12-09 11:51 ` Jakub Narebski
2006-12-09 12:42 ` Jeff Garzik
2006-12-09 13:37 ` Jakub Narebski
2006-12-09 14:43 ` Jeff Garzik
2006-12-09 17:02 ` Jakub Narebski
2006-12-09 17:27 ` Jeff Garzik
2006-12-10 4:07 ` Martin Langhoff
2006-12-10 10:09 ` Jakub Narebski
2006-12-10 12:41 ` Jeff Garzik
2006-12-10 13:02 ` Jakub Narebski
2006-12-10 13:45 ` Jeff Garzik
2006-12-10 19:11 ` Jakub Narebski
2006-12-10 19:50 ` Linus Torvalds [this message]
2006-12-10 20:27 ` Jakub Narebski
2006-12-10 20:30 ` Linus Torvalds
2006-12-10 22:01 ` Martin Langhoff
2006-12-10 22:14 ` Jeff Garzik
2006-12-10 22:08 ` Jeff Garzik
2006-12-10 21:01 ` H. Peter Anvin
2006-12-10 22:05 ` Jeff Garzik
2006-12-10 22:59 ` Jakub Narebski
2006-12-11 2:16 ` Martin Langhoff
2006-12-11 8:59 ` Jakub Narebski
2006-12-11 10:18 ` Martin Langhoff
2006-12-09 18:04 ` Linus Torvalds
2006-12-09 18:30 ` H. Peter Anvin
2006-12-10 3:55 ` Martin Langhoff
2006-12-10 7:05 ` H. Peter Anvin
2006-12-12 21:19 ` Jakub Narebski
2006-12-09 7:56 ` Steven Grimm
2006-12-07 19:30 ` Linus Torvalds
2006-12-07 19:39 ` Shawn Pearce
2006-12-07 19:58 ` Linus Torvalds
2006-12-07 23:33 ` Michael K. Edwards
2006-12-07 19:58 ` H. Peter Anvin
2006-12-07 20:05 ` Junio C Hamano
2006-12-07 20:09 ` H. Peter Anvin
2006-12-07 22:11 ` Junio C Hamano
2006-12-08 9:43 ` Jakub Narebski
2006-12-11 3:40 linux
2006-12-11 9:30 ` Jakub Narebski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.64.0612101129190.12500@woody.osdl.org \
--to=torvalds@osdl.org \
--cc=discard@dawes.za.net \
--cc=ftpadmin@kernel.org \
--cc=git@vger.kernel.org \
--cc=hpa@zytor.com \
--cc=jeff@garzik.org \
--cc=jnareb@gmail.com \
--cc=martin.langhoff@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).