All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCHv2 00/10] gitweb: Simple file based output caching
@ 2010-02-09 10:30 Jakub Narebski
  2010-02-09 10:30 ` [RFC PATCHv2 01/10] gitweb: href(..., -path_info => 0|1) Jakub Narebski
                   ` (9 more replies)
  0 siblings, 10 replies; 18+ messages in thread
From: Jakub Narebski @ 2010-02-09 10:30 UTC (permalink / raw)
  To: git
  Cc: John 'Warthog9' Hawley, John 'Warthog9' Hawley,
	Petr Baudis, Jakub Narebski

This 10 patches long patch series is intended as preliminary version
for splitting large 'gitweb: File based caching layer (from git.kernel.org)'
mega-patch by John 'Warthog9' Hawley aka J.H., by starting small and
adding features piece by piece.

This is second version (second release) of this series; previous is
available at http://repo.or.cz/w/git/jnareb-git.git as 'gitweb/cache-kernel'
branch.  It was sent as:
* [RFC PATCH 00/10] gitweb: Simple file based output caching
  Message-Id: <cover.1264198194.git.jnareb@gmail.com>
  http://thread.gmane.org/gmane.comp.version-control.git/136913/focus=136917
(note however that v1 series of emails is lacking one of patches because
it was over VGER anti-spam size limit for messages).

This version tries to do without 
  gitweb: Print to explicit filehandle (preparing for caching)
patch, by capturing output using either PerlIO layers manipulated
using PerlIO::Util if this module is available, or direct manipulation
of *STDOUT if PerlIO::Util isn't available.  One of the goals of this
series is then decide whether it is worth to have explicit filehandle
in print statements in gitweb, or not; if the complexity is worth not
having to deal with straightforward but quite intrusive (and large)
patch.

As the earlier version was inspired by file-based caching in
Cache::Cache, this one is inspired by file-based caching in more
modern CHI (unified cache interface).

It still lacks POD for gitweb/cache.pm (would it be needed, or would
comments be enough), and gitweb/cache.pm still ties rather heavily
into gitweb (following still what was in original J.H. (mega)patch).

It *does* have quite detailed commit messages, as opposed to v1 of
this series, where some commits were described only in comment section
of emails containing them.  It is also very configurable (Pasky, this
would probably be of interest to you, as you didn't want to have 
"Generating..." pages enabled), even more than in original patch
by J.H.


NOTE: there are quite a bit of _API_ tests, but I have not tested gitweb
output with caching enabled extensively (thats how bug in "Generating..."
slipped through - for details see comments in last patch).  I have tested
that caching works around 4th patch in series, in that it doesn't cause
errors and displays page (here the lack of error handling is decidely
unhelpful), and that it displays the time when page was generated.  As I
have installed PerlIO::Util using local::lib, i.e. locally in ~/perl5,
I think that what I have been testing was the "*STDOUT munging" method
of capturing gitweb output.  (See "Capturing gitweb output" section
in PATCHv2 04/10).


This series is based on commit 8424981934c415bd20643de9cc932bd348dfb115:
(in the 'master' branch of git.git repository)
  Jeff King (1):
        Fix invalid read in quote_c_style_counted

and is available in the git repository at:

  git://repo.or.cz/git/jnareb-git.git gitweb/cache-kernel-v2

Jakub Narebski (10):
  gitweb: href(..., -path_info => 0|1)
  gitweb/cache.pm - Very simple file based caching
  gitweb/cache.pm - Stat-based cache expiration
  gitweb: Use Cache::Cache compatible (get, set) output caching
  gitweb/cache.pm - Adaptive cache expiration time
  gitweb: Use CHI compatible (compute method) caching
  gitweb/cache.pm - Use locking to avoid 'cache miss stampede' problem
  gitweb/cache.pm - Serve stale data when waiting for filling cache
  gitweb/cache.pm - Regenerate (refresh) cache in background
  gitweb: Show appropriate "Generating..." page when regenerating cache

Note that compared to previous version of this series, this version
lacks initial commit.
  gitweb: Print to explicit filehandle (preparing for caching)
This is a bit of an experiment if we can do caching without large patch
to gitweb upfront, and to decide whether tradeoff (more complicated
capturing) is worth it.

Also, one of the commits:
  gitweb/cache.pm - Serve stale data when waiting for filling cache (WIP)
was split into two separate commits:
  gitweb/cache.pm - Serve stale data when waiting for filling cache
  gitweb/cache.pm - Regenerate (refresh) cache in background
one serving stale data (in processes waiting for cache to be filled, 
aka readers), and one adding background cache regeneration. 

After previous series I have sent additional (PATCH 11/10) patch:
  gitweb: Ajax-y "Generating..." page when regenerating cache (WIP)
This patch would require rework to apply to this new series.

Diffstat:
~~~~~~~~~
 gitweb/README                          |   70 +++++
 gitweb/cache.pm                        |  530 ++++++++++++++++++++++++++++++++
 gitweb/gitweb.perl                     |  305 +++++++++++++++++-
 t/gitweb-lib.sh                        |    2 +
 t/t9500-gitweb-standalone-no-errors.sh |   19 ++
 t/t9503-gitweb-caching.sh              |   32 ++
 t/t9503/test_cache_interface.pl        |  380 +++++++++++++++++++++++
 t/test-lib.sh                          |    3 +
 8 files changed, 1325 insertions(+), 16 deletions(-)
 create mode 100644 gitweb/cache.pm
 create mode 100755 t/t9503-gitweb-caching.sh
 create mode 100755 t/t9503/test_cache_interface.pl

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCHv2 01/10] gitweb: href(..., -path_info => 0|1)
  2010-02-09 10:30 [RFC PATCHv2 00/10] gitweb: Simple file based output caching Jakub Narebski
@ 2010-02-09 10:30 ` Jakub Narebski
  2010-02-09 10:30 ` [RFC PATCHv2 02/10] gitweb/cache.pm - Very simple file based caching Jakub Narebski
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Jakub Narebski @ 2010-02-09 10:30 UTC (permalink / raw)
  To: git
  Cc: John 'Warthog9' Hawley, John 'Warthog9' Hawley,
	Petr Baudis, Jakub Narebski

If named boolean option -path_info is passed to href() subroutine, use
its value to decide whether to generate path_info URL form.  If this
option is not passed, href() queries 'pathinfo' feature to check
whether to generate path_info URL (if generating path_info link is
possible at all).

href(-replay=>1, -path_info=>0) is meant to be used to generate a key
for caching gitweb output; alternate solution would be to use freeze()
from Storable (core module) on %input_params hash (or its reference),
e.g.:
  $key = freeze \%input_params;
or other serialization technique.

While at it document extra options/flags to href().

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
There is no change in the patch with v2; only the explanation in the
commit message got changed a bit.


The reason behind -path_info=>0 is that we want to have the same cache
entry (the same cache file) regardless of whether we use path_info
URL, or non-path_info URL.

J.H. recommends using href(...,full=>1), to be able to have two gitweb
installations (or one gitweb installation but with $projects_root 
depending on virtual server used) to share cache without worrying
that they would stomp over each other cache entries.  On the other
hand this means that moving to another server [name], or changin
[virtual] location of gitweb, means that you need to re-cache
everything.  The discussion can be found in:
  http://thread.gmane.org/gmane.comp.version-control.git/136913/focus=137061

Note that in the caching patch by J.H. from "Gitweb caching v5" thread
(and top commit in git://git.kernel.org/pub/scm/git/warthog9/gitweb.git,
gitweb-ml-v5 branch) the key was generated as "$my_url?".$ENV{'QUERY_STRING'}
which wouldn't work with path_info URLs, but on the other hand gitweb
at git.kernel.org doesn't use path_info URLs (perhaps even doesn't
support them).

Using href(replay=>1,full=>1,path_info=>0) has additional advantage
over using $cgi->self_url() in that it also does not depend on
ordering of parameters in handcrafted URLs.

 gitweb/gitweb.perl |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 1f6978a..97ea3ec 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -970,6 +970,10 @@ exit;
 ## ======================================================================
 ## action links
 
+# possible values of extra options
+# -full => 0|1      - use absolute/full URL ($my_uri/$my_url as base)
+# -replay => 1      - start from a current view (replay with modifications)
+# -path_info => 0|1 - don't use/use path_info URL (if possible)
 sub href {
 	my %params = @_;
 	# default is to use -absolute url() i.e. $my_uri
@@ -986,7 +990,8 @@ sub href {
 	}
 
 	my $use_pathinfo = gitweb_check_feature('pathinfo');
-	if ($use_pathinfo and defined $params{'project'}) {
+	if (defined $params{'project'} &&
+	    (exists $params{-path_info} ? $params{-path_info} : $use_pathinfo)) {
 		# try to put as many parameters as possible in PATH_INFO:
 		#   - project name
 		#   - action
-- 
1.6.6.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCHv2 02/10] gitweb/cache.pm - Very simple file based caching
  2010-02-09 10:30 [RFC PATCHv2 00/10] gitweb: Simple file based output caching Jakub Narebski
  2010-02-09 10:30 ` [RFC PATCHv2 01/10] gitweb: href(..., -path_info => 0|1) Jakub Narebski
@ 2010-02-09 10:30 ` Jakub Narebski
  2010-02-09 10:30 ` [RFC PATCHv2 03/10] gitweb/cache.pm - Stat-based cache expiration Jakub Narebski
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Jakub Narebski @ 2010-02-09 10:30 UTC (permalink / raw)
  To: git
  Cc: John 'Warthog9' Hawley, John 'Warthog9' Hawley,
	Petr Baudis, Jakub Narebski

This is first step towards implementing file based output (response)
caching layer that is used on such large sites as kernel.org.

This patch introduces GitwebCaching::SimpleFileCache package, which
follows Cache::Cache / CHI interface, although do not implement it
fully.  The intent of following established convention is to be able
in the future to replace our simple file based cache e.g. by one using
memcached.

Like in original patch by John 'Warthog9' Hawley (J.H.) (the one this
commit intends to be incremental step to), the data is stored in the
case as-is, without adding metadata (like expiration date), and
without serialization (which means only scalar data).

To be implemented (from original patch by J.H.):
* cache expiration (based on file stats, current time and global
  expiration time); currently elements in cache do not expire
* actually using this cache in gitweb, except error pages
* adaptive cache expiration, based on average system load
* optional locking interface, where only one process can update cache
  (using flock)
* server-side progress indicator when waiting for filling cache,
  which in turn requires separating situations (like snapshots and
  other non-HTML responses) where we should not show 'please wait'
  message

Possible extensions (beyond what was in original patch):
* (optionally) show information about cache utilization
* AJAX (JavaScript-based) progress indicator
* JavaScript code to update relative dates in cached output
* make cache size-aware (try to not exceed specified maximum size)
* utilize X-Sendfile header (or equivalent) to show cached data
  (optional, as it makes sense only if web server supports sendfile
  feature and have it enabled)
* variable expiration feature from CHI, allowing items to expire a bit
  earlier than the stated expiration time to prevent cache miss
  stampedes (although locking, if available, should take care of
  this).

The code of GitwebCaching::SimpleFileCache package in gitweb/cache.pm
was heavily based on file-based cache in Cache::Cache package, i.e.
on Cache::FileCache, Cache::FileBackend and Cache::BaseCache, and on
file-based cache in CHI, i.e. on CHI::Driver::File and CHI::Driver
(including implementing atomic write, something that original patch
lacks).

This patch does not yet enable output caching in gitweb (it doesn't
have all required features yet); on the other hand it includes tests,
currently testing only cache Perl API.

Inspired-by-code-by: John 'Warthog9' Hawley <warthog9@kernel.org>
Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
Previous version of this patch had large parts of this code are based
_heavily_ on Cache::FileCache implementation (including Cache::FileBackend
and Cache::BaseCache) from Cache::Cache distribution (which is dual licensed
using (Perl) Artistic License and GNU General Public License, like Perl
itself).

This version is based on file-based caching from CHI (Unified Caching
Interface), which is more modern package, and I think better engineered
(with better design).  Nevertheless gitweb/cache.pm is meant to use minimal
dependencies, like gitweb itself, so it won't use Moose / Mouse for OOP
(Object Oriented Programming) like CHI does.  CHI is also licensed "under
the same terms as Perl itself", which means dual Artistic/GPL licensed.

Note that GitwebCache::SimpleFileCache does implement neither CHI interface
nor Cache::Cache interface fully; it is means to be very simple, but
hopefully fast cache.

Differences from v1:
* Allow for alternate CHI-compatible names for parameters passed to
  GitwebCache::SimpleFileCache constructor: 'cache_root' / 'root_dir',
  'cache_depth' / 'depth' (not sure if it is worth it).
* Accessors (getters and setters, i.e. ->get_depth() / ->set_depth($depth)
  methods) are generated, instead of using lot of boilerplate code.  The
  names of fields changed a bit, too.
* Change the design of "engine" part from inspired by older Cache::Cache
  to inspired by newer and more modern CHI.  Among others this means that
  there are no 'private' subroutines (with '_'-prefixed names), and that
  inner (engine) methods no longer pass namespace as parameter - for example
  in place of ->restore($namespace, $key) there is ->fetch($key).
* ->fetch() and ->store() methods use fast slurp (from CHI::Driver::File,
  which in turn has it adapted from File::Slurp), which uses sys{read,write}.
* ->fetch() and ->store() uses ':raw' (binary) mode to read and write data.
* ->set() has protection against writing undefined data (which would fail).
* Does not use new make_path, but older mkpath interface from File::Path;
  this way we do not require File::Path version >= 2.0.

Differences from relevant parts of J.H. patch:
* Does not use make_path, but mkpath from File::Path (see above)
* Uses catfile/catdir from File::Spec (see below)
* Writes to cache file atomically, by writing to unique temporary file
  and then renaming/moving this file to the cache file.  This is to make
  caching more robust, by preventing readers from getting partial data.
  The issue this prevents should be very rare, as we write whole data
  (whole page) at once, after it is generated in full, so this situation
  can theoretically happen only in the case of heavy load, many clients,
  and very large page over the size of (file)system cache.
* Currently it does not do any error reporting, as opposed to J.H. patch,
  in which errors are reported via die_error from gitweb (entangling
  cache.pm and gitweb.perl even more).  This is planned to be fixed in the
  final version of this patch series.
* The depth of cache hierarchy is configurable, while J.H. patch uses
  hardcoded depth of 1 subdirectory.  It uses unpack rather than substr.
  For 06b90e786e... key digest it uses 06/b90e786e... (with cache_depth = 1)
  rather than 06/06b90e786e... in J.H. patch.  It does not have binary
  and non-binary cache data separately; it does not use '.bin' suffix
  (extension) for binary-data cache files.
* GitwebCache::SimpleFileCache uses standard ->get, ->set, ->compute
  methods, which should allow easy replacing it with CHI or Cache::Cache
  cache (see later commit, adding caching support to gitweb).
* Tests of caching API in t/t9503-gitweb-caching.sh
* Better Perl style, more consistent with what gitweb itself uses.

Possible improvements:
* CHI uses fast_catpath from CHI::Util, which for Unix reduces to simple
  join("/", @_).  GitwebCache::SimpleFileCache uses catpath (remainder of
  Cache::FileCache-based code), but gitweb itself uses just "$root/$path",
  so this over-portability just wastes performance.
* We use File::Temp to generate temporary file to write data.  CHI uses
  instead unique ids (Data::UUID to generate initial unique id, then suffix
  it with serial number, in hexadecimal) for efficiency.
  Note that with locking (introduced later in series) we could write to
  temporary file with pre-defined (not randomized/incremental) name.
* GitwebCache::SimpleFileCache over-uses accessors in its methods, where
  referring directly to object fields would be (usually) enough.  Also
  the fields probably should be just named 'field_key' rather than 
  '_Field_key' (again its remainder of old-style Cache::FileCache-based
  code).
* Separate change to t/test-lib.sh, adding exporting TEST_DIRECTORY and
  TRASH_DIRECTORY in test_external to allow its use in external tests,
  to a separate commit (it is required for t9503 to find cache.pm).
* I should probably update copyright statement in cache.pm (just noticed)

 gitweb/cache.pm                 |  269 +++++++++++++++++++++++++++++++++++++++
 t/t9503-gitweb-caching.sh       |   32 +++++
 t/t9503/test_cache_interface.pl |   84 ++++++++++++
 t/test-lib.sh                   |    3 +
 4 files changed, 388 insertions(+), 0 deletions(-)
 create mode 100644 gitweb/cache.pm
 create mode 100755 t/t9503-gitweb-caching.sh
 create mode 100755 t/t9503/test_cache_interface.pl

diff --git a/gitweb/cache.pm b/gitweb/cache.pm
new file mode 100644
index 0000000..aebab01
--- /dev/null
+++ b/gitweb/cache.pm
@@ -0,0 +1,269 @@
+# gitweb - simple web interface to track changes in git repositories
+#
+# (C) 2006, John 'Warthog9' Hawley <warthog19@eaglescrag.net>
+#
+# This program is licensed under the GPLv2
+
+#
+# Gitweb caching engine
+#
+
+{
+# Minimalistic cache that stores data in the filesystem, without serialization
+# and currently without any kind of cache expiration (all keys last forever till
+# they got explicitely removed)
+#
+# It follows Cache::Cache and CHI interface (but does not implement it fully)
+
+package GitwebCache::SimpleFileCache;
+
+use strict;
+use warnings;
+
+use File::Path qw(mkpath);
+use File::Spec;
+use File::Temp;
+use Digest::MD5 qw(md5_hex);
+
+# by default, the cache nests all entries on the filesystem two
+# directories deep
+
+our $DEFAULT_CACHE_DEPTH = 2;
+
+# by default, the root of the cache is located in 'cache'.
+
+our $DEFAULT_CACHE_ROOT = "cache";
+
+# ......................................................................
+# constructor
+
+# The options are set by passing in a reference to a hash containing
+# any of the following keys:
+#  * 'namespace'
+#    The namespace associated with this cache.  This allows easy separation of
+#    multiple, distinct caches without worrying about key collision.  Defaults
+#    to '' (which does not allow for simple implementation of clear() method).
+#  * 'cache_root' (Cache::FileCache compatibile),
+#    'root_dir' (CHI::Driver::File compatibile),
+#    The location in the filesystem that will hold the root of the cache.
+#    Defaults to 'cache', relative to gitweb.cgi directory.
+#  * 'cache_depth' (Cache::FileCache compatibile),
+#    'depth' (CHI::Driver::File compatibile),
+#    The number of subdirectories deep to cache object item.  This should be
+#    large enough that no cache directory has more than a few hundred objects.
+#    Defaults to 1 unless explicitly set.
+sub new {
+	my ($proto, $p_options_hash_ref) = @_;
+
+	my $class = ref($proto) || $proto;
+	my $self  = {};
+	$self = bless($self, $class);
+
+	my ($root, $depth, $ns);
+	if (defined $p_options_hash_ref) {
+		$root  =
+			$p_options_hash_ref->{'cache_root'} ||
+			$p_options_hash_ref->{'root_dir'};
+		$depth =
+			$p_options_hash_ref->{'cache_depth'} ||
+			$p_options_hash_ref->{'depth'};
+		$ns    = $p_options_hash_ref->{'namespace'};
+	}
+	$root  = $DEFAULT_CACHE_ROOT  unless defined($root);
+	$depth = $DEFAULT_CACHE_DEPTH unless defined($depth);
+	$ns    = '' unless defined($ns);
+
+	$self->set_root($root);
+	$self->set_depth($depth);
+	$self->set_namespace($ns);
+
+	return $self;
+}
+
+# ......................................................................
+# accessors
+
+# http://perldesignpatterns.com/perldesignpatterns.html#AccessorPattern
+
+foreach my $i (qw(depth root namespace)) {
+	my $field = $i;
+	no strict 'refs';
+	*{"get_$field"} = sub {
+		my $self = shift;
+		return $self->{'_'.ucfirst($field)};
+	};
+	*{"set_$field"} = sub {
+		my ($self, $value) = @_;
+		$self->{'_'.ucfirst($field)} = $value;
+	};
+}
+
+# ----------------------------------------------------------------------
+# utility functions and methods
+
+# Return root dir for namespace (lazily built, cached)
+sub path_to_namespace {
+	my ($self) = @_;
+
+	if (!exists $self->{'_Path_to_namespace'}) {
+		$self->{'_Path_to_namespace'} = File::Spec->catfile(
+			$self->get_root(),
+			$self->get_namespace()
+		);
+	}
+	return $self->{'_Path_to_namespace'};
+}
+
+# Take an human readable key, and return file path
+sub path_to_key {
+	my ($self, $key, $dir_ref) = @_;
+
+	my @paths = ( $self->path_to_namespace() );
+
+	# Create a unique (hashed) key from human readable key
+	my $filename = md5_hex($key); # or $digester->add($key)->hexdigest()
+
+	# Split filename so that it have DEPTH subdirectories,
+	# where each subdirectory has a two-letter name
+	push @paths, unpack("(a2)".($self->{'_Depth'} || 1)." a*", $filename);
+	$filename = pop @paths;
+
+	# Join paths together, computing dir separately if $dir_ref was passed.
+	my $filepath;
+	if (defined $dir_ref && ref($dir_ref)) {
+		my $dir = File::Spec->catdir(@paths);
+		$filepath = File::Spec->catfile($dir, $filename);
+		$$dir_ref = $dir;
+	} else {
+		$filepath = File::Spec->catfile(@paths, $filename);
+	}
+
+	return $filepath;
+}
+
+# ----------------------------------------------------------------------
+# worker methods
+
+sub fetch {
+	my ($self, $key) = @_;
+
+	my $file = $self->path_to_key($key);
+	return undef unless (defined $file && -f $file);
+
+	# Fast slurp, adapted from File::Slurp::read, with unnecessary options removed
+	# via CHI::Driver::File (from CHI-0.33)
+	my $buf = '';
+	open my $read_fh, '<', $file
+		or return undef;
+	binmode $read_fh, ':raw';
+
+	my $size_left = -s $read_fh;
+
+	while ($size_left > 0) {
+		my $read_cnt = sysread($read_fh, $buf, $size_left, length($buf));
+		return undef unless defined $read_cnt;
+
+		last if $read_cnt == 0;
+		$size_left -= $read_cnt;
+		#last if $size_left <= 0;
+	}
+
+	return $buf;
+}
+
+sub store {
+	my ($self, $key, $data) = @_;
+
+	my $dir;
+	my $file = $self->path_to_key($key, \$dir);
+	return undef unless (defined $file && defined $dir);
+
+	# ensure that directory leading to cache file exists
+	mkpath($dir, 0, 0777) if !-d $dir;
+
+	# generate a temporary file
+	my $temp = File::Temp->new(
+		#DIR => $dir,
+		TEMPLATE => "${file}_XXXXX",
+		SUFFIX => ".tmp"
+	);
+
+	# Fast spew, adapted from File::Slurp::write, with unnecessary options removed
+	# via CHI::Driver::File (from CHI-0.33)
+	my $write_fh = $temp;
+	binmode $write_fh, ':raw';
+
+	my $size_left = length($data);
+	my $offset = 0;
+
+	while ($size_left > 0) {
+		my $write_cnt = syswrite($write_fh, $data, $size_left, $offset);
+		return undef unless defined $write_cnt;
+		
+		$size_left -= $write_cnt;
+		$offset += $write_cnt; # == length($data);
+	}
+
+	close($temp);
+	rename($temp, $file);
+}
+
+sub get_size {
+	my ($self, $key) = @_;
+
+	my $path = $self->path_to_key($key)
+		or return undef;
+	if (-f $path) {
+		return -s $path;
+	}
+	return 0;
+}
+
+# ......................................................................
+# interface methods
+
+# Removing and expiring
+
+sub remove {
+	my ($self, $key) = @_;
+
+	my $file = $self->path_to_key($key)
+		or return undef;
+	unlink($file);
+}
+
+# Getting and setting
+
+sub set {
+	my ($self, $key, $data) = @_;
+
+	return unless (defined $key && defined $data);
+
+	$self->store($key, $data);
+}
+
+sub get {
+	my ($self, $key) = @_;
+
+	my $data = $self->fetch($key)
+		or return undef;
+
+	return $data;
+}
+
+sub compute {
+	my ($self, $p_key, $p_coderef) = @_;
+
+	my $data = $self->get($p_key);
+	if (!defined $data) {
+		$data = $p_coderef->($self, $p_key);
+		$self->set($p_key, $data);
+	}
+
+	return $data;
+}
+
+1;
+} # end of package GitwebCache::SimpleFileCache;
+
+1;
diff --git a/t/t9503-gitweb-caching.sh b/t/t9503-gitweb-caching.sh
new file mode 100755
index 0000000..768080c
--- /dev/null
+++ b/t/t9503-gitweb-caching.sh
@@ -0,0 +1,32 @@
+#!/bin/sh
+#
+# Copyright (c) 2010 Jakub Narebski
+#
+
+test_description='caching interface to be used in gitweb'
+#test_description='caching interface used in gitweb, gitweb caching
+#
+#This test checks cache (interface) used in gitweb caching, caching
+#infrastructure and gitweb response (output) caching (the last by
+#running gitweb as CGI script from commandline).'
+
+# for now we are running only cache interface tests
+. ./test-lib.sh
+
+# this test is present in gitweb-lib.sh
+if ! test_have_prereq PERL; then
+	say 'perl not available, skipping test'
+	test_done
+fi
+
+"$PERL_PATH" -MTest::More -e 0 >/dev/null 2>&1 || {
+	say 'perl module Test::More unavailable, skipping test'
+	test_done
+}
+
+# ----------------------------------------------------------------------
+
+test_external 'GitwebCache::* Perl API (in gitweb/cache.pm)' \
+	"$PERL_PATH" "$TEST_DIRECTORY"/t9503/test_cache_interface.pl
+
+test_done
diff --git a/t/t9503/test_cache_interface.pl b/t/t9503/test_cache_interface.pl
new file mode 100755
index 0000000..9714f72
--- /dev/null
+++ b/t/t9503/test_cache_interface.pl
@@ -0,0 +1,84 @@
+#!/usr/bin/perl
+use lib (split(/:/, $ENV{GITPERLLIB}));
+
+use warnings;
+use strict;
+
+use Test::More;
+use Data::Dumper;
+
+# test source version; there is no installation target for gitweb
+my $cache_pm = "$ENV{TEST_DIRECTORY}/../gitweb/cache.pm";
+
+unless (-f "$cache_pm") {
+	plan skip_all => "$cache_pm not found";
+}
+
+# it is currently not a proper Perl module, so we use 'do FILE'
+# instead of: BEGIN { use_ok('GitwebCache::SimpleFileCache') }
+my $return = do "$cache_pm";
+ok(!$@,              "parse gitweb/cache.pm")
+	or diag("parse error:\n", $@);
+ok(defined $return,  "do    gitweb/cache.pm");
+ok($return,          "run   gitweb/cache.pm");
+
+
+# Test creating a cache
+#
+my $cache = new_ok('GitwebCache::SimpleFileCache',
+	[ { 'cache_root' => 'cache', 'cache_depth' => 2 } ]);
+
+# Test that default values are defined
+#
+ok(defined $GitwebCache::SimpleFileCache::DEFAULT_CACHE_ROOT,
+	'$DEFAULT_CACHE_ROOT defined');
+ok(defined $GitwebCache::SimpleFileCache::DEFAULT_CACHE_DEPTH,
+	'$DEFAULT_CACHE_DEPTH defined');
+
+# Test accessors and default values for cache
+#
+SKIP: {
+	skip 'default values not defined', 3
+		unless ($GitwebCache::SimpleFileCache::DEFAULT_CACHE_ROOT &&
+		        $GitwebCache::SimpleFileCache::DEFAULT_CACHE_DEPTH);
+
+	is($cache->get_namespace(), '', "default namespace is ''");
+	is($cache->get_root(), $GitwebCache::SimpleFileCache::DEFAULT_CACHE_ROOT,
+		"default cache root is '$GitwebCache::SimpleFileCache::DEFAULT_CACHE_ROOT'");
+	cmp_ok($cache->get_depth(), '==', $GitwebCache::SimpleFileCache::DEFAULT_CACHE_DEPTH,
+		"default cache depth is $GitwebCache::SimpleFileCache::DEFAULT_CACHE_DEPTH");
+}
+
+# Test the getting, setting, and removal of a cached value
+# (Cache::Cache interface)
+#
+my $key = 'Test Key';
+my $value = 'Test Value';
+can_ok($cache, qw(get set remove));
+#ok(!defined($cache->get($key)),        'get before set')
+#	or diag("get returned '", $cache->get($key), "' for $key");
+$cache->set($key, $value);
+is($cache->get($key), $value,          'get after set, returns cached value');
+cmp_ok($cache->get_size($key), '>', 0, 'get_size after set, is greater than 0');
+$cache->remove($key);
+ok(!defined($cache->get($key)),        'get after remove, is undefined');
+eval { $cache->remove('Not-Existent Key'); };
+ok(!$@,                                'remove on non-existent key doesn\'t die');
+
+# Test the getting and setting of a cached value
+# (CHI interface)
+#
+my $call_count = 0;
+sub get_value {
+	$call_count++;
+	return $value;
+}
+can_ok($cache, qw(compute));
+is($cache->compute($key, \&get_value), $value, 'compute 1st time (set)');
+is($cache->compute($key, \&get_value), $value, 'compute 2nd time (get)');
+is($cache->compute($key, \&get_value), $value, 'compute 3rd time (get)');
+cmp_ok($call_count, '==', 1, 'get_value() is called once from compute');
+
+done_testing();
+
+print Dumper($cache);
diff --git a/t/test-lib.sh b/t/test-lib.sh
index afd3053..a13bdd5 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -453,6 +453,9 @@ test_external () {
 		# Announce the script to reduce confusion about the
 		# test output that follows.
 		say_color "" " run $test_count: $descr ($*)"
+		# Export TEST_DIRECTORY and TRASH_DIRECTORY
+		# to be able to use them in script
+		export TEST_DIRECTORY TRASH_DIRECTORY
 		# Run command; redirect its stderr to &4 as in
 		# test_run_, but keep its stdout on our stdout even in
 		# non-verbose mode.
-- 
1.6.6.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCHv2 03/10] gitweb/cache.pm - Stat-based cache expiration
  2010-02-09 10:30 [RFC PATCHv2 00/10] gitweb: Simple file based output caching Jakub Narebski
  2010-02-09 10:30 ` [RFC PATCHv2 01/10] gitweb: href(..., -path_info => 0|1) Jakub Narebski
  2010-02-09 10:30 ` [RFC PATCHv2 02/10] gitweb/cache.pm - Very simple file based caching Jakub Narebski
@ 2010-02-09 10:30 ` Jakub Narebski
  2010-02-09 10:30 ` [RFC PATCHv2 04/10] gitweb: Use Cache::Cache compatibile (get, set) output caching Jakub Narebski
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Jakub Narebski @ 2010-02-09 10:30 UTC (permalink / raw)
  To: git
  Cc: John 'Warthog9' Hawley, John 'Warthog9' Hawley,
	Petr Baudis, Jakub Narebski

Add stat-based cache expiration to file-based GitwebCache::SimpleFileCache.
Contrary to the way other caching interfaces such as Cache::Cache and CHI
do it, the time cache element expires in is _global_ value associated with
cache instance, and is not local property of cache entry.  (Currently cache
entry does not store any metadata associated with entry... which means that
there is no need for serialization / marshaling / freezing and thawing.)
Default expire time is -1, which means never expire.

To check if cache entry is expired, GitwebCache::SimpleFileCache compares
difference between mtime (last modify time) of a cache file and current time
with (global) time to expire.  It is done using CHI-compatible is_valid()
method.

Add some tests checking that expiring works correctly (on the level of API).

To be implemented (from original patch by J.H.):
* actually using this cache in gitweb, except error pages
* adaptive cache expiration, based on average system load
* optional locking interface, where only one process can update cache
  (using flock)
* server-side progress indicator when waiting for filling cache,
  which in turn requires separating situations (like snapshots and
  other non-HTML responses) where we should not show 'please wait'
  message

Inspired-by-code-by: John 'Warthog9' Hawley <warthog9@kernel.org>
Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
Now that caching engine supports cache expiration, we can add caching
support to gitweb.

Differences from v1:
* Showing diagnostic if there were parse errors in gitweb/cache.pm
  was moved to previous commit
* ->get_expires_in() and ->set_expires_in($duration) accessors are generated
  rather than written by hand (reducing repetition of very similar code).
* Test that value is not expired with expiration time of 1 day, and not 
  forever ('expires_in' == -1).

Differences from relevant parts of J.H. patch:
* It simply uses stat on last accessed file (checked for existence),
  instead of opening file for reading (without error detection!), running
  stat on it, and then closing it.
* One can use expire time of -1 (or to be more exact less than 0) to set
  expire time to never (cache is considered fresh forever).
* There are some tests in t9503 of cache expiration (one of those assume
  that expire time of one day would be not expired in get after set).

 gitweb/cache.pm                 |   34 ++++++++++++++++++++++++++++++++--
 t/t9503/test_cache_interface.pl |   10 ++++++++++
 2 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/gitweb/cache.pm b/gitweb/cache.pm
index aebab01..b59509f 100644
--- a/gitweb/cache.pm
+++ b/gitweb/cache.pm
@@ -52,6 +52,10 @@ our $DEFAULT_CACHE_ROOT = "cache";
 #    The number of subdirectories deep to cache object item.  This should be
 #    large enough that no cache directory has more than a few hundred objects.
 #    Defaults to 1 unless explicitly set.
+#  * 'default_expires_in' (Cache::Cache compatibile),
+#    'expires_in' (CHI compatibile) [seconds]
+#    The expiration time for objects place in the cache.
+#    Defaults to -1 (never expire) if not explicitly set.
 sub new {
 	my ($proto, $p_options_hash_ref) = @_;
 
@@ -59,7 +63,7 @@ sub new {
 	my $self  = {};
 	$self = bless($self, $class);
 
-	my ($root, $depth, $ns);
+	my ($root, $depth, $ns, $expires_in);
 	if (defined $p_options_hash_ref) {
 		$root  =
 			$p_options_hash_ref->{'cache_root'} ||
@@ -68,14 +72,19 @@ sub new {
 			$p_options_hash_ref->{'cache_depth'} ||
 			$p_options_hash_ref->{'depth'};
 		$ns    = $p_options_hash_ref->{'namespace'};
+		$expires_in =
+			$p_options_hash_ref->{'default_expires_in'} ||
+			$p_options_hash_ref->{'expires_in'};
 	}
 	$root  = $DEFAULT_CACHE_ROOT  unless defined($root);
 	$depth = $DEFAULT_CACHE_DEPTH unless defined($depth);
 	$ns    = '' unless defined($ns);
+	$expires_in = -1 unless defined($expires_in); # <0 means never
 
 	$self->set_root($root);
 	$self->set_depth($depth);
 	$self->set_namespace($ns);
+	$self->set_expires_in($expires_in);
 
 	return $self;
 }
@@ -85,7 +94,7 @@ sub new {
 
 # http://perldesignpatterns.com/perldesignpatterns.html#AccessorPattern
 
-foreach my $i (qw(depth root namespace)) {
+foreach my $i (qw(depth root namespace expires_in)) {
 	my $field = $i;
 	no strict 'refs';
 	*{"get_$field"} = sub {
@@ -232,6 +241,26 @@ sub remove {
 	unlink($file);
 }
 
+# exists in cache and is not expired
+sub is_valid {
+	my ($self, $key) = @_;
+
+	my $path = $self->path_to_key($key);
+
+	# does file exists in cache?
+	return 0 unless -f $path;
+
+	# expire time can be set to never
+	my $expires_in = $self->get_expires_in();
+	return 1 unless (defined $expires_in && $expires_in >= 0);
+
+	# is file expired?
+	my $mtime = (stat(_))[9];
+	my $now = time();
+
+	return (($now - $mtime) < $expires_in);
+}
+
 # Getting and setting
 
 sub set {
@@ -245,6 +274,7 @@ sub set {
 sub get {
 	my ($self, $key) = @_;
 
+	return undef unless $self->is_valid($key);
 	my $data = $self->fetch($key)
 		or return undef;
 
diff --git a/t/t9503/test_cache_interface.pl b/t/t9503/test_cache_interface.pl
index 9714f72..8700b71 100755
--- a/t/t9503/test_cache_interface.pl
+++ b/t/t9503/test_cache_interface.pl
@@ -79,6 +79,16 @@ is($cache->compute($key, \&get_value), $value, 'compute 2nd time (get)');
 is($cache->compute($key, \&get_value), $value, 'compute 3rd time (get)');
 cmp_ok($call_count, '==', 1, 'get_value() is called once from compute');
 
+# Test cache expiration for 'expire now'
+#
+$cache->set_expires_in(60*60*24); # set expire time to 1 day
+cmp_ok($cache->get_expires_in(), '>', 0, '"expires in" is greater than 0');
+is($cache->get($key), $value,            'get returns cached value (not expired)');
+$cache->set_expires_in(0);
+is($cache->get_expires_in(), 0,          '"expires in" is set to now (0)');
+$cache->set($key, $value);
+ok(!defined($cache->get($key)),          'cache is expired');
+
 done_testing();
 
 print Dumper($cache);
-- 
1.6.6.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCHv2 04/10] gitweb: Use Cache::Cache compatibile (get, set) output caching
  2010-02-09 10:30 [RFC PATCHv2 00/10] gitweb: Simple file based output caching Jakub Narebski
                   ` (2 preceding siblings ...)
  2010-02-09 10:30 ` [RFC PATCHv2 03/10] gitweb/cache.pm - Stat-based cache expiration Jakub Narebski
@ 2010-02-09 10:30 ` Jakub Narebski
  2010-02-10  1:12   ` Jakub Narebski
  2010-02-09 10:30 ` [RFC PATCHv2 05/10] gitweb/cache.pm - Adaptive cache expiration time Jakub Narebski
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 18+ messages in thread
From: Jakub Narebski @ 2010-02-09 10:30 UTC (permalink / raw)
  To: git
  Cc: John 'Warthog9' Hawley, John 'Warthog9' Hawley,
	Petr Baudis, Jakub Narebski

This commit actually adds output caching to gitweb, as we have now
minimal features required for it in GitwebCache::SimpleFileCache
(a 'dumb' but fast file-based cache engine).  To enable cache you need
at least set $caching_enabled to true in gitweb config, and copy cache.pm
from gitweb/ alongside gitweb.cgi - this is described in more detail
in the new "Gitweb caching" section in gitweb/README

Currently cache support related subroutines in cache.pm (which are
outside GitwebCache::SimpleFileCache package) are not well separated
from gitweb script itself; cache.pm lacks encapsulation.  cache.pm
assumes that there are href() subroutine and %actions variable, and
that there exist $actions{$action} (where $action is parameter passed
to cache_fetch), and it is a code reference (see also comments in
t/t9503/test_cache_interface.pl).  This is remaining artifact from the
original patch by J.H. (which also had cache_fetch() subroutine).

Gitweb itself uses directly only cache_fetch, to get page from cache
or to generate page and save it to cache, and cache_stop, to be used
in die_error subroutine, as currently error pages are not cached.

The cache_fetch subroutine captures output (from STDOUT only, as
STDERR is usually logged) using either ->push_layer()/->pop_layer()
from PerlIO::Util submodule (if it is available), or by setting and
restoring *STDOUT.  Note that only the former could be tested reliably
to be reliable in t9503 test!

Enabling caching causes the following additional changes to gitweb
output:
* Disables content-type negotiation (choosing between 'text/html'
  mimetype and 'application/xhtml+xml') when caching, as there is no
  content-type negotiation done when retrieving page from cache.
  Use 'text/html' mimetype that can be used by all browsers.
* Disable timing info (how much time it took to generate original
  page, and how many git commands it took), and in its place show when
  page was originally generated (in GMT / UTC timezone).

Add basic tests of caching support to t9500-gitweb-standalone-no-errors
test: set $caching_enabled to true and check for errors for first time
run (generating cache) and second time run (retrieving from cache) for a
single view - summary view for a project.

If PerlIO::Util is available (see comments), test that cache_fetch
behaves correctly, namely that it saves and restores action output in
cache, and that it prints generated output or cached output.


To be implemented (from original patch by J.H.):
* adaptive cache expiration, based on average system load
* optional locking interface, where only one process can update cache
  (using flock)
* server-side progress indicator when waiting for filling cache,
  which in turn requires separating situations (like snapshots and
  other non-HTML responses) where we should not show 'please wait'
  message

Inspired-by-code-by: John 'Warthog9' Hawley <warthog9@kernel.org>
Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
Differences from v1:
* Commit message is present

* When caching is enabled gitweb outputs the time when page was generated in
  place of timing info (time it took to generate page).  In previous version
  'timed' feature wasn't supported (it was disabled) when caching was turned
  on.

  The 'timed' feature is turned off when caching is enabled for 'blame_data'
  (used by 'blame_incremental' view).

* Capturing gitweb output is done without need to modify gitweb to always
  print explicitly to $out filehandle.  Turning caching off in die_error()
  is now done using cache_stop() subroutine from cache.pm.  See also 
  description of capturing output below.

* Description of $caching_enabled variable in "Gitweb config file variables"
  section in gitweb/README and adding "Gitweb caching" section.

* No need for changing $/, _input_ record separator, when printing (output).

* $cache can be now set not only to caching engine object (initialized
  instance), but also to class name.  

  Pass Cache::Cache ('default_expires_in') and CHI ('root_dir', 'depth')
  compatible aliases for cache options to the $cache->new(), which is
  invoked when $cache is undefined (GitwebCache::SimpleFileCache is used)
  or when it is class name.

* You can override cache options in $GITWEB_CONFIG by modifying
  %cache_options hash.  All cache options are described in detail in
  comments.

* Gitweb dies with appropriate error message when caching is enabled, but
  $cache_pm cannot be found.  (Note that if the change that automatically
  escapes error message in die_error goes in, this would have to be changed
  to not escape $cache_pm in error message in caller.)

* t/t9500-gitweb-standalone-no-errors.sh tests also the case when data
  is retrieved from cache, not only when it is saved to cache (for
  that expire time is set to never expire).

* t9503 tests also cache_fetch() subroutine, using the same capturing
  mechanism as cache.pm / cache_fetch() itself to check if it produces
  correct output.  As noted in comments, I could make tests only for the
  situation where both tests and cache.pm used PerlIO::Util module (not
  among Perl core modules).

  The need for mockup of gitweb's href() and %actions shows that
  cache_fetch() is not properly encapsulated / separated from inner workings
  of gitweb code.


Differences from relevant parts of J.H. patch:
* cache_fetch() subroutine is much, much simpler.  Well, it lacks most of
  the features of original cache_fetch() by J.H.: it doesn't have adaptive
  cache lifetime, nor locking to prevent 'stampeding herd' problem, nor
  serving stale data when waiting for cache regeneration, nor background
  data generation, nor activity indicator... but the cache_fetch() itself
  doesn't change much, as those features are added mainly via methods
  and subroutines cache_fetch() calls.

* Capturing gitweb output is done without need to modify gitweb to either
  save generated output into $output variable and print it or save in cache
  after it is generated in full (original J.H. patch in "Gitweb caching v2"),
  or changing all print statements to print to explicit filehandle which
  points to STDOUT if caching is disabled and to in-memory file if caching
  is enabled (modified J.H. patch in "Gitweb caching v5").  See also below.

* It introduces $cache_pm variable, to be able to test caching in
  t/t9500-gitweb-standalone-no-errors.sh, but also to be able to install
  cache.pm in some other place than along gitweb.cgi.  There would be no
  such problems if we used 'require GitwebCache' or somesuch, in place of
  'do "cache.pm"' like in original patch by J.H.

  Currently however cache.pm is not independent of gitweb.perl; it requires
  href() subroutine and appropriate %actions hash for dispatch.  Therefore
  we must use 'do $cache_pm;' instead of e.g. 'use cache;'.

* Instead of using "binary" (sic!) valued $cache_enable (which means 0 or 1
  valued $cache_enable), a set of two variables is used.  The $cache
  variable can be used to select alternate caching engine / caching class.
  The $caching_enabled variable is used to actually enable/disable cache.

* The information about the time when page was generated is shown only if
  'timed' feature is enabled in gitweb config, and it is shown in place of
  usual time it took to generate page (shown when caching is not enabled).
  This means tha change to gitweb/gitweb.css was not needed.

* cache_fetch() is run only when $caching_enabled.  Some of cache
  initializations are in gitweb.perl, and not at beginning of cache_fetch()

* Cache options are contained in %cache_options hash, instead of individual
  global variables (which were using non-Perlish camelCase notation).

* There is information about caching in gitweb in gitweb/README

* There are tests of gitweb caching in t9500, and caching API in t9503


Capturing gitweb output
=======================
When output (response) caching is enabled, the caching mechanism has to
capture gitweb output (perhaps while printing it to standard output) to
save it to cache, unless the data is available in cache and not expired.

Because die_error uses 'exit', and because it uses git_header_html and
other printing subroutines (which output has to be captured in other
situations), it needs to disable caching, unless we are "tee"-ing.
Otherwise we would get no output from die_error (because it is captured),
but also we would not get data to be saved in cache and printed, because
'exit' in die_error would exit capture, too.  This restricts methods and
modules that can be used to capture output.

Below there are presented various considered and used ways of capturing the
output, or "tee"-ing it (capturing while printing), with their advantages
and disadvantages.


Capturing output (capture)
~~~~~~~~~~~~~~~~~~~~~~~~~~

1. Modify gitweb to save its output to variable before printing it, which
   means changing 'print <sth>' to '$output .= <sth>', and 'printf <sth>'
   to '$output .= sprintf <sth>'.

   This was the approach taken in last patch in "Gitweb caching v2" series
   by J.H.  It has the disadvantage of requiring large patch to gitweb;
   it also makes gitweb output data only after it is generated in full,
   even if caching is disabled (data is streamed now).  It also requires
   taking care of printing in die_error and after running action, and
   using correct mode (':utf8' or ':raw' aka binary mode) both when printing
   to cache file, and when printing to STDOUT in die_error.

   This approach got superceded by the next method.


2. Modify gitweb to print to explicit filehandle (instead of implicitly to
   STDOUT), which means changing 'print <sth>' to 'print $out <sth>' (or to
   'print {$out} <sth>', 'printf <sth>' to 'printf $out <sth>', and last
   'binmode STDOUT <mode>' to 'binmode $out <mode>'.

   This was the approach taken in previous version of this series, and also
   in last patch in "Gitewb caching v5" by J.H.  It has the disadvantage of
   requiring large patch to gitweb, but on the other hand the base change
   itself is just simple mechanical change (plus optionally whitespace-only
   reindent change, and optionally re-wrapping too long lines).  It is very
   easy to capture output: just set $out to in-memory file opened for
   reading (or even point $out directly to cache file / temporary file).  It
   is also easy to stop capturing: just set $out to \*STDOUT (or *STDOUT).

   Should we chose this method (now that it is shown that you can do without
   changes to gitweb, and what needs then to be done to capture output)?


3. The recommended module for capturing output in Perl is Capture::Tiny.  It
   has nice and simple API: to capture output you use 'capture \&code',
   which is prototyped to allow to be called in block form:

     ($stdout, $stderr) = capture {
          # your code here ...
     };

   It is a minor annoyance that it captures all output also in scalar
   context, i.e. when called as '$stdout = capture \&code;', so one would
   need to 'print STDERR $stderr;' after capture.  But what is more
   important is the fact that it doesn't offer any mechanism to turn off
   capturing inside called code... which is required for die_error, at least
   the way it works now (see above).

   Nevertheless the API is clean and I think worth emulating, which I did
   with cache_capture(&) subroutine in cache.pm.  Also the problems vanish
   in "tee" mode, i.e. for '$stdout = tee \&code;', see the next subsection.
   
   Capture::Tiny distribution is only 28K packed.  There exist
   perl-Capture-Tiny RPM for Fedora:
     http://spreadsheets.google.com/pub?key=tVoSlaYU1SGovwwjV0fqt9Q&output=html


4. I have found PerlIO::Util as a way of capturing output when searching the
   way to tee output: PerlIO::Util distribution contains PerlIO::tee module.
   Actually 'scalar' layer to write to in-memory file is implemented in
   PerlIO, which is a core Perl module.  Unfortunately PerlIO lacks API to
   add layer which needs parameters.  You can open in-memory file using
   
     open my $data_fh, '>', \$data

   you can request using given output layer on output with e.g. '>:utf8'
   as mode, you can add a layer with binmode, e.g. 'binmode $out, ":utf8";'
   but you can't add layer which requires extra parameter which is not
   a string (like in the case of ':encoding(<encoding>)' layer).  So you
   can't just add scalar layer to STDOUT to redirect it to in-memory file
   (to a scalar variable).  For that you need PerlIO::Util.  

   The PerlIO manpage talks about ':pop' pseudo layer, and how "a more
   elegant (and safer) interface is needed." - PerlIO::Util is such
   interface.

   To start capturing output to a scalar variable (in-memory file) with
   PerlIO::Util, you just need to

     *STDOUT->push_layer(scalar => \$data);

   To stop capturing you do

     *STDOUT->pop_layer();

   Pushing yet another layer works as it should, therefore my tests of
   cache_fetch() subroutine in t9503 require PerlIO::Util (and do not run if
   it is not present) - I couldn't get to do the test with capture within
   capture using he method described below.  (There is Test::Output module
   that could have been used in tests, or I could have used Capture::Tiny.)

   I haven't found RPM module for PerlIO::Util (but nowadays I install
   modules from CPAN using local::lib and cpan client).


5. Without 'print <sth>' to 'print $out <sth>' change to gitweb, one can try
   manipulating *STDOUT directly, first saving *STDOUT or \*STDOUT to some
   variable, then setting *STDOUT = $data_fh, where $data_fh is filehandle
   opened to in-memory file.

   This seems to work, does not require large patch to gitweb, and does not
   require extra (non-core) Perl module.  Nevertheless it seems fragile with
   respect to restoring STDOUT, and even though basic tests (not included)
   of this mechanism producted expected result, I wasn't able to write
   working tests when using this method.

   We could probably examine how Capture::Tiny does it, and reuse (copy)
   relevant parts of code, perhaps duplicating STDOUT, closing it and then
   reopening as in-memory filehandle.

   YMMV (Your Mileage May Vary).


Saving output while printing (tee)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Instead of capturing gitweb output (except die_error, which displays error
and exists), and then saving it to cache and displaying it, we can capture
(save) output while it is being displayed, somewhat similar to what tee(1)
UNIX (coreutils) command does.  

This has the advantage that for the client that generates data (see further
commit in series about using locking to avoid cache miss stampede) that it
serves as acrivity indicator by itself.  On the other hand achieving this
effect in other processes that wait for the data to be generated would be, 
I think, quite complicated (a `tail -f' like behavior).

Because the original patch by J.H. (both versions) did capturing then
displaying, "tee"-ing data was not added to this series.  Nevetheless this
is something to consider for the future.

1. While using 'capture' from Capture::Tiny would have difficulties (if it
   was at all possible) for gitweb as it is now, using 'tee' from the same
   module would be quite simple:

     $data = tee {
          # generate gitweb output
     };
     # save $data to cache

   die_error would not be a problem in this situation: data would be
   displayed, even if not saved to cache.

2. One of the easiest way of adding output (response) caching to a CGI
   script in Perl is in my opinion using CGI::Cache module.  Modifying
   CGI.pm script (like gitweb) to make use of CGI::Cache requires adding
   only a few lines to script source.

   While relying on non-core module (which I have not found RPM for) makes
   this solution less feasible, its use of slow Cache::SizeAwareFileCache
   (from Cache::Cache) makes its not good for gitweb.  Also CGI::Cache
   requires Tie::Restore to be able to restore ties to filehandles it
   watches... something that gitweb doesn't need.

   Nevetheless we can borrow some code from CGI::Cache, replacing cache
   engine with something more efficient.  CGI::Cache is GPL licensed.

   Internally, the CGI::Cache module ties the output file descriptor
   (usually STDOUT) to an internal variable to which all output is saved.
   And that leads us to the next method.

3. tie filehandle (*STDOUT or $out) so that when we are printing to
   filehandle, it would also save data in a scalar variable (or duplicate it
   to other filehandle).  See perltie(1) and Tie::Handle.

4. Multiplexing output stream like tee(1) with PerlIO::tee from PerlIO::Util
   is very simple: to start saving output while it is being printed, just do

     *STDOUT->push_layer(tee => \$data);

   The source, beside scalar reference as above, may be a file name, or a
   filehandle.  To stop multiplexing, just do

     *STDOUT->pop_layer();

   I haven't found RPM module for PerlIO::Util (which includes PerlIO::tee).


 gitweb/README                          |   70 +++++++++++++++++++++
 gitweb/cache.pm                        |   78 +++++++++++++++++++++++
 gitweb/gitweb.perl                     |  106 ++++++++++++++++++++++++++++----
 t/gitweb-lib.sh                        |    2 +
 t/t9500-gitweb-standalone-no-errors.sh |   19 ++++++
 t/t9503/test_cache_interface.pl        |   93 ++++++++++++++++++++++++++++
 6 files changed, 355 insertions(+), 13 deletions(-)

diff --git a/gitweb/README b/gitweb/README
index 6c2c8e1..53759fc 100644
--- a/gitweb/README
+++ b/gitweb/README
@@ -233,6 +233,11 @@ not include variables usually directly set during build):
    If server load exceed this value then return "503 Service Unavaliable" error.
    Server load is taken to be 0 if gitweb cannot determine its value.  Set it to
    undefined value to turn it off.  The default is 300.
+ * $caching_enabled
+   If true, gitweb would use caching to speed up generating response.
+   Currently supported is only output (response) caching.  See "Gitweb caching"
+   section below for details on how to configure and customize caching.
+   The default is false (caching is disabled).
 
 
 Projects list file format
@@ -305,6 +310,71 @@ You can use the following files in repository:
    descriptions.
 
 
+Gitweb caching
+~~~~~~~~~~~~~~
+
+Currently gitweb supports only output (HTTP response) caching, similar
+to the one used on git.kernel.org.  To turn it on, set $caching_enabled
+variable to true value in gitweb config file, i.e.:
+
+   our $caching_enabled = 1;
+
+You can choose what caching engine should gitweb use by setting $cache
+variable either to _initialized_ instance of cache interface, e.g.:
+
+   use CHI;
+   our $cache = CHI->new( driver => 'Memcached',
+   	servers => [ "10.0.0.15:11211", "10.0.0.15:11212" ],
+   	l1_cache => { driver => 'FastMmap', root_dir => '/var/cache/gitweb' }
+   );
+
+Alternatively you can set $cache variable to the name of cache class, in
+which case caching engine should support Cache::Cache or CHI names for cache
+config (see below), and ignore unrecognized options, e.g.:
+
+   use Cache::FileCache;
+   our $cache = 'Cache::FileCache';
+
+Such caching engine should implement (at least) ->get($key) and
+->set($key, $data) methods (Cache::Cache and CHI compatible interface).
+
+If $cache is left unset (if it is left undefined), then gitweb would use
+GitwebCache::SimpleFileCache from cache.pm as caching engine.  This engine
+is 'dumb' (but fast) file based caching layer, currently without any support
+for cache size limiting, or even removing expired / grossly expired entries.
+It has therefore the downside of requiring a huge amount of disk space if
+there are a number of repositories involved.  It is not uncommon for
+git.kernel.org to have on the order of 80G - 120G accumulate over the course
+of a few months.  It is therefore recommended that the cache directory be
+periodically completely deleted; this operation is safe to perform.
+Suggested mechanism (substitute $cachedir for actual path to gitweb cache):
+
+   # mv $cachedir $cachedir.flush && mkdir $cachedir && rm -rf $cachedir.flush
+
+For gitweb to use caching it must find 'cache.pm' file, which contains
+GitwebCache::SimpleFileCache and cache-related subroutines, from which
+cache_fetch and cache_stop are used in gitweb itself.  Location of
+'cache.pm' file is provided in $cache_pm variable; if it is relative path,
+it is relative to the directory gitweb is run from.  Default value of
+$cache_pm assumes that 'cache.pm' is copied to the same directory as
+'gitweb.cgi'.
+
+Currently 'cache.pm' is not a proper Perl module, because it is not
+encapsulated / it is not separated from details of gitweb.  That is why it
+is sourced using 'do "$cache_pm"', rather than with "use" or "require"
+operators.
+
+Site-wide cache options are defined in %cache_options hash.  Those options
+apply only when $cache is unset (GitwebCache::SimpleFileCache is used), or
+if $cache is name of cache class (e.g. $cache = 'Cache::FileCache').  You
+can override cache options in gitweb config, e.g.:
+
+   $cache_options{'expires_in'} = 60; # 60 seconds = 1 minute
+
+Please read comments for %cache_options entries in gitweb/gitweb.perl for
+description of available cache options.
+
+
 Webserver configuration
 -----------------------
 
diff --git a/gitweb/cache.pm b/gitweb/cache.pm
index b59509f..64c333b 100644
--- a/gitweb/cache.pm
+++ b/gitweb/cache.pm
@@ -296,4 +296,82 @@ sub compute {
 1;
 } # end of package GitwebCache::SimpleFileCache;
 
+# human readable key identifying gitweb output
+sub gitweb_output_key {
+	return href(-replay => 1, -full => 1, -path_info => 0);
+}
+
+
+our $perlio_util = eval { require PerlIO::Util; 1 };
+our $STDOUT = *STDOUT; #our $STDOUTref = \*STDOUT;
+our $data_fh;
+
+# Start caching data printed to STDOUT
+sub cache_start {
+	my $data_ref = shift;
+
+	if ($perlio_util) {
+		*STDOUT->push_layer(scalar => $data_ref);
+
+	} else {
+		open $data_fh, '>', $data_ref
+			or die "Can't open memory file: $!";
+		# matches "binmode STDOUT, ':uft8'" at beginning
+		binmode $data_fh, ':utf8';
+		*STDOUT = $data_fh;
+
+	}
+}
+
+# Stop caching data (required for die_error)
+sub cache_stop {
+
+	if ($perlio_util) {
+		*STDOUT->pop_layer()
+			if ((*STDOUT->get_layers())[-1] eq 'scalar');
+
+	} else {
+		close $data_fh
+			or die "Error closing memory file: $!";
+		*STDOUT = $STDOUT;
+
+	}
+}
+
+# Wrap caching data; capture only STDOUT
+sub cache_capture (&) {
+	my $code = shift;
+	my $data;
+
+	cache_start(\$data);
+	$code->();
+	cache_stop();
+
+	return $data;
+}
+
+sub cache_fetch {
+	my ($cache, $action) = @_;
+
+	my $key = gitweb_output_key();
+	my $data = $cache->get($key);
+
+	if (defined $data) {
+		# print cached data
+		binmode STDOUT, ':raw';
+		print STDOUT $data;
+
+	} else {
+		$data = cache_capture {
+			$actions{$action}->();
+		};
+
+		if (defined $data) {
+			$cache->set($key, $data);
+			binmode STDOUT, ':raw';
+			print STDOUT $data;
+		}
+	}
+}
+
 1;
diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 97ea3ec..f02ead9 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -227,6 +227,44 @@ our %avatar_size = (
 # Leave it undefined (or set to 'undef') to turn off load checking.
 our $maxload = 300;
 
+# This enables/disables the caching layer in gitweb.  Currently supported
+# is only output (response) caching, similar to the one used on git.kernel.org.
+our $caching_enabled = 0;
+# Set to _initialized_ instance of cache interface implementing (at least)
+# get($key) and set($key, $data) methods (Cache::Cache and CHI interfaces).
+# If unset, GitwebCache::SimpleFileCache would be used, which is 'dumb'
+# (but fast) file based caching layer, currently without any support for
+# cache size limiting.  It is therefore recommended that the cache directory
+# be periodically completely deleted; this operation is safe to perform.
+# Suggested mechanism:
+# mv $cachedir $cachedir.flush && mkdir $cachedir && rm -rf $cachedir.flush
+our $cache;
+# Locations of 'cache.pm' file; if it is relative path, it is relative to
+# the directory gitweb is run from
+our $cache_pm = 'cache.pm';
+# You define site-wide cache options defaults here; override them with
+# $GITWEB_CONFIG as necessary.
+our %cache_options = (
+	# The location in the filesystem that will hold the root of the cache.
+	# This directory will be created as needed (if possible) on the first
+	# cache set.  Note that either this directory must exists and web server
+	# has to have write permissions to it, or web server must be able to
+	# create this directory.
+	# Possible values:
+	# * 'cache' (relative to gitweb),
+	# * File::Spec->catdir(File::Spec->tmpdir(), 'gitweb-cache'),
+	# * '/var/cache/gitweb' (FHS compliant, requires being set up),
+	'cache_root' => 'cache',
+	# The number of subdirectories deep to cache object item.  This should be
+	# large enough that no cache directory has more than a few hundred
+	# objects.  Each non-leaf directory contains up to 256 subdirectories
+	# (00-ff).  Must be larger than 0.
+	'cache_depth' => 1,
+	# The (global) expiration time for objects placed in the cache, in seconds.
+	'expires_in' => 20,
+);
+
+
 # You define site-wide feature defaults here; override them with
 # $GITWEB_CONFIG as necessary.
 our %feature = (
@@ -964,7 +1002,34 @@ if ($action !~ m/^(?:opml|project_list|project_index)$/ &&
     !$project) {
 	die_error(400, "Project needed");
 }
-$actions{$action}->();
+
+if ($caching_enabled) {
+	die_error(500, 'Caching enabled and "'.esc_path($cache_pm).'" not found')
+		unless -f $cache_pm;
+	do $cache_pm;
+	die $@ if $@;
+
+	# $cache might be initialized (instantiated) cache, i.e. cache object,
+	# or it might be name of class, or it might be undefined
+	unless (defined $cache && ref($cache)) {
+		$cache ||= 'GitwebCache::SimpleFileCache';
+		$cache = $cache->new({
+			%cache_options,
+			#'cache_root' => '/tmp/cache',
+			#'cache_depth' => 2,
+			#'expires_in' => 20, # in seconds (CHI compatibile)
+			# (Cache::Cache compatibile initialization)
+			'default_expires_in' => $cache_options{'expires_in'},
+			# (CHI compatibile initialization)
+			'root_dir' => $cache_options{'cache_root'},
+			'depth' => $cache_options{'cache_depth'},
+		});
+	}
+	cache_fetch($cache, $action);
+} else {
+	$actions{$action}->();
+}
+
 exit;
 
 ## ======================================================================
@@ -3169,7 +3234,9 @@ sub git_header_html {
 	# 'application/xhtml+xml', otherwise send it as plain old 'text/html'.
 	# we have to do this because MSIE sometimes globs '*/*', pretending to
 	# support xhtml+xml but choking when it gets what it asked for.
-	if (defined $cgi->http('HTTP_ACCEPT') &&
+	# Disable content-type negotiation when caching (use mimetype good for all).
+	if (!$caching_enabled &&
+	    defined $cgi->http('HTTP_ACCEPT') &&
 	    $cgi->http('HTTP_ACCEPT') =~ m/(,|;|\s|^)application\/xhtml\+xml(,|;|\s|$)/ &&
 	    $cgi->Accept('application/xhtml+xml') != 0) {
 		$content_type = 'application/xhtml+xml';
@@ -3342,17 +3409,25 @@ sub git_footer_html {
 	}
 	print "</div>\n"; # class="page_footer"
 
-	if (defined $t0 && gitweb_check_feature('timed')) {
+	# timing info doesn't make much sense with output (response) caching,
+	# so when caching is enabled gitweb prints the time of page generation
+	if ((defined $t0 || $caching_enabled) &&
+	    gitweb_check_feature('timed')) {
 		print "<div id=\"generating_info\">\n";
-		print 'This page took '.
-		      '<span id="generating_time" class="time_span">'.
-		      Time::HiRes::tv_interval($t0, [Time::HiRes::gettimeofday()]).
-		      ' seconds </span>'.
-		      ' and '.
-		      '<span id="generating_cmd">'.
-		      $number_of_git_cmds.
-		      '</span> git commands '.
-		      " to generate.\n";
+		if ($caching_enabled) {
+			print 'This page was generated at '.
+			      gmtime( time() )." GMT\n";
+		} else {
+			print 'This page took '.
+			      '<span id="generating_time" class="time_span">'.
+			      Time::HiRes::tv_interval($t0, [Time::HiRes::gettimeofday()]).
+			      ' seconds </span>'.
+			      ' and '.
+			      '<span id="generating_cmd">'.
+			      $number_of_git_cmds.
+			      '</span> git commands '.
+			      " to generate.\n";
+		}
 		print "</div>\n"; # class="page_footer"
 	}
 
@@ -3402,6 +3477,10 @@ sub die_error {
 		500 => '500 Internal Server Error',
 		503 => '503 Service Unavailable',
 	);
+
+	# Do not cache error pages (die_error() uses 'exit')
+	cache_stop() if ($caching_enabled);
+
 	git_header_html($http_responses{$status});
 	print <<EOF;
 <div class="page_body">
@@ -5050,7 +5129,8 @@ sub git_blame_common {
 			or print "ERROR $!\n";
 
 		print 'END';
-		if (defined $t0 && gitweb_check_feature('timed')) {
+		if (!$caching_enabled &&
+		    defined $t0 && gitweb_check_feature('timed')) {
 			print ' '.
 			      Time::HiRes::tv_interval($t0, [Time::HiRes::gettimeofday()]).
 			      ' '.$number_of_git_cmds;
diff --git a/t/gitweb-lib.sh b/t/gitweb-lib.sh
index 5a734b1..b7c2937 100644
--- a/t/gitweb-lib.sh
+++ b/t/gitweb-lib.sh
@@ -27,6 +27,8 @@ our \$export_ok = '';
 our \$strict_export = '';
 our \$maxload = undef;
 
+our \$cache_pm = '$TEST_DIRECTORY/../gitweb/cache.pm';
+
 EOF
 
 	cat >.git/description <<EOF
diff --git a/t/t9500-gitweb-standalone-no-errors.sh b/t/t9500-gitweb-standalone-no-errors.sh
index 2fc7fdb..41c1119 100755
--- a/t/t9500-gitweb-standalone-no-errors.sh
+++ b/t/t9500-gitweb-standalone-no-errors.sh
@@ -639,4 +639,23 @@ test_expect_success \
 	 gitweb_run "p=.git;a=summary"'
 test_debug 'cat gitweb.log'
 
+# ----------------------------------------------------------------------
+# caching
+
+cat >>gitweb_config.perl <<\EOF
+$caching_enabled = 1;
+$cache_options{'expires_in'} = -1; # never expire cache for tests
+EOF
+rm -rf cache
+
+test_expect_success \
+	'caching enabled (project summary, first run)' \
+	'gitweb_run "p=.git;a=summary"'
+test_debug 'cat gitweb.log'
+
+test_expect_success \
+	'caching enabled (project summary, second run)' \
+	'gitweb_run "p=.git;a=summary"'
+test_debug 'cat gitweb.log'
+
 test_done
diff --git a/t/t9503/test_cache_interface.pl b/t/t9503/test_cache_interface.pl
index 8700b71..42c49e9 100755
--- a/t/t9503/test_cache_interface.pl
+++ b/t/t9503/test_cache_interface.pl
@@ -7,9 +7,36 @@ use strict;
 use Test::More;
 use Data::Dumper;
 
+# Modules that could have been used to capture output for testing cache_fetch
+#use Capture::Tiny;
+#use Test::Output qw(:stdout);
+# Modules that could have been used in $cache_pm for cache_fetch
+#use IO::Capture
+
 # test source version; there is no installation target for gitweb
 my $cache_pm = "$ENV{TEST_DIRECTORY}/../gitweb/cache.pm";
 
+# ......................................................................
+
+# Setup mockup of gitweb's subroutines and variables used in $cache_pm;
+# must be done before loading $cache_pm.  (This probably means that
+# gitweb.perl and cache.pm are too tightly coupled.)
+sub href {
+	return 'href';
+}
+my $action = 'action';
+my $fake_action_output = <<'EOF';
+# This is data to be cached and shown
+EOF
+sub fake_action {
+	print $fake_action_output;
+}
+our %actions = (
+	$action => \&fake_action,
+);
+
+# ......................................................................
+
 unless (-f "$cache_pm") {
 	plan skip_all => "$cache_pm not found";
 }
@@ -22,6 +49,7 @@ ok(!$@,              "parse gitweb/cache.pm")
 ok(defined $return,  "do    gitweb/cache.pm");
 ok($return,          "run   gitweb/cache.pm");
 
+# ......................................................................
 
 # Test creating a cache
 #
@@ -88,6 +116,71 @@ $cache->set_expires_in(0);
 is($cache->get_expires_in(), 0,          '"expires in" is set to now (0)');
 $cache->set($key, $value);
 ok(!defined($cache->get($key)),          'cache is expired');
+$cache->set_expires_in(-1);
+
+# ......................................................................
+
+# Prepare for testing cache_fetch from $cache_pm
+my $test_perlio_util = eval { require PerlIO::Util; 1 };
+my $cached_output = <<"EOF";
+$fake_action_output# (version recovered from cache)
+EOF
+$key = gitweb_output_key();
+
+# Catch output printed by cache_fetch
+# Test all ways of capturing output in cache_fetch
+our ($perlio_util, $STDOUT);
+my ($test_data, $test_data_fh, $test_STDOUT);
+sub capture_cache_fetch_output {
+	$test_data = '' if defined $test_data;
+
+	if ($perlio_util) { # or $test_perlio_util
+		*STDOUT->push_layer(scalar => \$test_data);
+
+		cache_fetch($cache, $action);
+
+		*STDOUT->pop_layer();
+
+	} else {
+		diag("PerlIO::Util not available, not all tests run");
+		$test_STDOUT = *STDOUT;
+		open $test_data_fh, '>', \$test_data;
+		$STDOUT = *STDOUT = $test_data_fh; # $STDOUT is from $cache_pm
+
+		cache_fetch($cache, $action);
+
+		*STDOUT = $test_STDOUT;
+
+	}
+}
+
+
+# Due to some bad interaction between double capturing, both if second
+# capture (for this test) is done using PerlIO layers (via PerlIO::Util),
+# and if it is done using *STDOUT manipulation, tests below do not work if
+# $perlio_util is false, i.e. if cache_fetch() uses *STDOUT manipulation.
+# Earlier manual test shown that cache_fetch() *STDOUT manipulation seems
+# to work all right... but this test would fail when 'if' is replaced by
+# (currently commented out) 'for'.
+
+#for my $use_perlio_util (0..$test_perlio_util) {
+if ((my $use_perlio_util = $test_perlio_util)) {
+	$perlio_util = $use_perlio_util;
+	diag(($perlio_util ? "Use" : "Don't use")." PerlIO::Util");
+
+	# clean state
+	$cache->remove($key);
+
+	# first time (if there is no cache) generates cache entry
+	capture_cache_fetch_output();
+	is($test_data, $fake_action_output,        'action output is printed (generated)');
+	is($cache->get($key), $fake_action_output, 'action output is in cache (generated)');
+
+	# second time (if cache is set/valid) reads from cache
+	$cache->set($key, $cached_output);
+	capture_cache_fetch_output();
+	is($test_data, $cached_output,             'action output is printed (from cache)');
+}
 
 done_testing();
 
-- 
1.6.6.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCHv2 05/10] gitweb/cache.pm - Adaptive cache expiration time
  2010-02-09 10:30 [RFC PATCHv2 00/10] gitweb: Simple file based output caching Jakub Narebski
                   ` (3 preceding siblings ...)
  2010-02-09 10:30 ` [RFC PATCHv2 04/10] gitweb: Use Cache::Cache compatibile (get, set) output caching Jakub Narebski
@ 2010-02-09 10:30 ` Jakub Narebski
  2010-02-09 10:30 ` [RFC PATCHv2 06/10] gitweb: Use CHI compatibile (compute method) caching Jakub Narebski
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Jakub Narebski @ 2010-02-09 10:30 UTC (permalink / raw)
  To: git
  Cc: John 'Warthog9' Hawley, John 'Warthog9' Hawley,
	Petr Baudis, Jakub Narebski

Add to GitwebCache::SimpleFileCache support for adaptive lifetime
(cache expiration) control.  Cache lifetime can be increased or
decreased by any factor, e.g. load average, through the definition
of the 'check_load' callback.

Note that using ->set_expires_in, or unsetting 'check_load' via
->set_check_load(undef) turns off adaptive caching.

Make gitweb automatically adjust cache lifetime by load, using
get_loadavg() function.  Define and describe default parameters for
dynamic (adaptive) cache expiration time control.

There are some very basic tests of dynamic expiration time in t9503,
namely checking if dynamic expire time is within given upper and lower
bounds.

To be implemented (from original patch by J.H.):
* optional locking interface, where only one process can update cache
  (using flock)
* server-side progress indicator when waiting for filling cache,
  which in turn requires separating situations (like snapshots and
  other non-HTML responses) where we should not show 'please wait'
  message

Inspired-by-code-by: John 'Warthog9' Hawley <warthog9@kernel.org>
Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
Differences from v1:
* Proper commit message
* Adaptive cache expiration can be configured in gitweb config file by
  changing 'expires_min', 'expires_max', 'increase_factor' in
  %cache_options.  Those parameters are described in detail.
* You can turn adaptive cache expiration by setting 'check_load' to undef.
* By default adaptive cache expiration is turned off; there are extra check
  when to run 'check_load' subroutine.
* Instead of removing ->set_expires_in(), it just sets 'expires_min' and
  'expires_max' to the same value, which turns off adaptive expiration.
* Test assertions for adaptive cache expiration:
    'expires_min' <= $cache->get_expires_in() <= 'expires_max'

Differences from relevant parts of J.H. patch:
* 'increase_factor' is configurable rather than hardcoded
* 'check_load' is passed via conctructor parameter; gitweb by default sets
  it to \&get_loadavg.  This means that cache.pm is not entangled with
  gitweb, at least for this feature.
* options are stored in %cache_options, e.g. as 'expires_max' (compatible
  with Cache::Adaptive), and not as global variables with un-Perlish
  camelCase names like $maxCacheTime.
* API tests

Possible improvements:
* Turn off adaptive cache lifetime by unsetting check_load, and not only by
  using ->set_expires_in()

 gitweb/cache.pm                 |   55 +++++++++++++++++++++++++++++++++++---
 gitweb/gitweb.perl              |   27 +++++++++++++++++-
 t/t9503/test_cache_interface.pl |   22 +++++++++++++++
 3 files changed, 97 insertions(+), 7 deletions(-)

diff --git a/gitweb/cache.pm b/gitweb/cache.pm
index 64c333b..d81f6e8 100644
--- a/gitweb/cache.pm
+++ b/gitweb/cache.pm
@@ -63,7 +63,8 @@ sub new {
 	my $self  = {};
 	$self = bless($self, $class);
 
-	my ($root, $depth, $ns, $expires_in);
+	my ($root, $depth, $ns);
+	my ($expires_min, $expires_max, $increase_factor, $check_load);
 	if (defined $p_options_hash_ref) {
 		$root  =
 			$p_options_hash_ref->{'cache_root'} ||
@@ -72,19 +73,31 @@ sub new {
 			$p_options_hash_ref->{'cache_depth'} ||
 			$p_options_hash_ref->{'depth'};
 		$ns    = $p_options_hash_ref->{'namespace'};
-		$expires_in =
+		$expires_min =
+			$p_options_hash_ref->{'expires_min'} ||
 			$p_options_hash_ref->{'default_expires_in'} ||
 			$p_options_hash_ref->{'expires_in'};
+		$expires_max =
+			$p_options_hash_ref->{'expires_max'};
+		$increase_factor = $p_options_hash_ref->{'expires_factor'};
+		$check_load      = $p_options_hash_ref->{'check_load'};
 	}
 	$root  = $DEFAULT_CACHE_ROOT  unless defined($root);
 	$depth = $DEFAULT_CACHE_DEPTH unless defined($depth);
 	$ns    = '' unless defined($ns);
-	$expires_in = -1 unless defined($expires_in); # <0 means never
+	$expires_min =   20 unless defined($expires_min);
+	$expires_max = $expires_min
+		if (!defined($expires_max) && exists $p_options_hash_ref->{'expires_in'});
+	$expires_max = 1200 unless (defined($expires_max));
+	$increase_factor = 60 unless defined($increase_factor);
 
 	$self->set_root($root);
 	$self->set_depth($depth);
 	$self->set_namespace($ns);
-	$self->set_expires_in($expires_in);
+	$self->set_expires_min($expires_min);
+	$self->set_expires_max($expires_max);
+	$self->set_increase_factor($increase_factor);
+	$self->set_check_load($check_load);
 
 	return $self;
 }
@@ -94,7 +107,7 @@ sub new {
 
 # http://perldesignpatterns.com/perldesignpatterns.html#AccessorPattern
 
-foreach my $i (qw(depth root namespace expires_in)) {
+foreach my $i (qw(depth root namespace expires_min expires_max increase_factor check_load)) {
 	my $field = $i;
 	no strict 'refs';
 	*{"get_$field"} = sub {
@@ -107,6 +120,38 @@ foreach my $i (qw(depth root namespace expires_in)) {
 	};
 }
 
+# ......................................................................
+# pseudo-accessors
+
+sub get_expires_in {
+	my ($self) = @_;
+
+	# short-circuit
+	if (!defined $self->get_check_load() ||
+	    $self->get_expires_max() <= $self->get_expires_min()) {
+		return $self->get_expires_min();
+	}
+
+	my $expires_in =
+		#$self->get_expires_min() +
+		$self->get_increase_factor() * $self->get_check_load()->();
+
+	if ($expires_in < $self->get_expires_min()) {
+		return $self->get_expires_min();
+	} elsif ($expires_in > $self->get_expires_max()) {
+		return $self->get_expires_max();
+	}
+
+	return $expires_in;
+}
+
+sub set_expires_in {
+	my ($self, $duration) = @_;
+
+	$self->set_expires_min($duration);
+	$self->set_expires_max($duration);
+}
+
 # ----------------------------------------------------------------------
 # utility functions and methods
 
diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index f02ead9..aabed72 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -255,13 +255,36 @@ our %cache_options = (
 	# * File::Spec->catdir(File::Spec->tmpdir(), 'gitweb-cache'),
 	# * '/var/cache/gitweb' (FHS compliant, requires being set up),
 	'cache_root' => 'cache',
+
 	# The number of subdirectories deep to cache object item.  This should be
 	# large enough that no cache directory has more than a few hundred
 	# objects.  Each non-leaf directory contains up to 256 subdirectories
 	# (00-ff).  Must be larger than 0.
 	'cache_depth' => 1,
-	# The (global) expiration time for objects placed in the cache, in seconds.
-	'expires_in' => 20,
+
+	# The (global) minimum expiration time for objects placed in the cache,
+	# in seconds.  If the dynamic adaptive cache exporation time is lower
+	# than this number, we set cache timeout to this minimum.
+	'expires_min' => 20,   # 20 seconds
+
+	# The (global) maximum expiration time for dynamic (adaptive) caching
+	# algorithm, in seconds.  If the adaptive cache lifetime exceeds this
+	# number, we set cache timeout to this maximum.
+	# (If 'expires_min' >= 'expires_max', there is no adaptive cache timeout,
+	# and 'expires_min' is used as expiration time for objects in cache.)
+	'expires_max' => 1200, # 20 minutes
+
+	# Cache lifetime will be increased by applying this factor to the result
+	# from 'check_load' callback (see below).
+	'expires_factor' => 60, # expire time in seconds for 1.0 (100% CPU) load
+
+	# User supplied callback for deciding the cache policy, usually system
+	# load.  Multiplied by 'expires_factor' gives adaptive expiration time,
+	# in seconds, subject to the limits imposed by 'expires_min' and
+	# 'expires_max' bounds.  Set to undef (or delete) to turn off dynamic
+	# lifetime control.
+	# (Compatibile with Cache::Adaptive.)
+	'check_load' => \&get_loadavg,
 );
 
 
diff --git a/t/t9503/test_cache_interface.pl b/t/t9503/test_cache_interface.pl
index 42c49e9..3945adc 100755
--- a/t/t9503/test_cache_interface.pl
+++ b/t/t9503/test_cache_interface.pl
@@ -118,6 +118,28 @@ $cache->set($key, $value);
 ok(!defined($cache->get($key)),          'cache is expired');
 $cache->set_expires_in(-1);
 
+# Test assertions for adaptive cache expiration
+#
+my $load = 0.0;
+sub load { return $load; }
+my $expires_min = 10;
+my $expires_max = 30;
+$cache->set_expires_min($expires_min);
+$cache->set_expires_max($expires_max);
+$cache->set_check_load(\&load);
+cmp_ok($cache->get_expires_min(), '==', $expires_min, '"expires min" set correctly');
+cmp_ok($cache->get_expires_max(), '==', $expires_max, '"expires max" set correctly');
+cmp_ok($cache->get_expires_in(), '>=', $expires_min,
+       '"expires in" bound from down for load=0');
+cmp_ok($cache->get_expires_in(), '<=', $expires_max,
+       '"expires in" bound from up   for load=0');
+$load = 1_000;
+cmp_ok($cache->get_expires_in(), '>=', $expires_min,
+       '"expires in" bound from down for heavy load');
+cmp_ok($cache->get_expires_in(), '<=', $expires_max,
+       '"expires in" bound from up   for heavy load');
+$cache->set_expires_in(-1);
+
 # ......................................................................
 
 # Prepare for testing cache_fetch from $cache_pm
-- 
1.6.6.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCHv2 06/10] gitweb: Use CHI compatibile (compute method) caching
  2010-02-09 10:30 [RFC PATCHv2 00/10] gitweb: Simple file based output caching Jakub Narebski
                   ` (4 preceding siblings ...)
  2010-02-09 10:30 ` [RFC PATCHv2 05/10] gitweb/cache.pm - Adaptive cache expiration time Jakub Narebski
@ 2010-02-09 10:30 ` Jakub Narebski
  2010-02-09 10:30 ` [RFC PATCHv2 07/10] gitweb/cache.pm - Use locking to avoid 'cache miss stampede' problem Jakub Narebski
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Jakub Narebski @ 2010-02-09 10:30 UTC (permalink / raw)
  To: git
  Cc: John 'Warthog9' Hawley, John 'Warthog9' Hawley,
	Petr Baudis, Jakub Narebski

If $cache provides CHI compatible ->compute($key, $code) method, use it
instead of Cache::Cache compatible ->get($key) and ->set($key, $data).
While at it, refactor regenerating cache into cache_calculate subroutine.

GitwebCache::SimpleFileCache provides 'compute' method, which currently
simply use 'get' and 'set' methods in proscribed manner.  Nevertheless
'compute' method can be more flexible in choosing when to refresh cache,
and which process is to refresh/(re)generate cache entry.  This method
would use (advisory) locking to prevent 'cache miss stampede' (aka
'stampeding herd') problem in the next commit.

The support for $cache which do not provide '->compute($key, $code)'
method is left just in case we would want to use such (external)
caching engine.

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
This commit could be moved one commit earlier in this patch series for
shortlog to look nicer (although commit message would have to be adjusted to
that fact) ;-)

Differences from v1:
* Proper commit message
* cache_calculate uses cache_capture and is therefore shorter
* Don't set $/ to undef, which applies to input, for printing (output).

This patch doesn't strictly have an equivalent in J.H. patch adding caching
to gitweb.

 gitweb/cache.pm |   39 ++++++++++++++++++++++++++++++++++++---
 1 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/gitweb/cache.pm b/gitweb/cache.pm
index d81f6e8..899a4b4 100644
--- a/gitweb/cache.pm
+++ b/gitweb/cache.pm
@@ -399,6 +399,41 @@ sub cache_fetch {
 	my ($cache, $action) = @_;
 
 	my $key = gitweb_output_key();
+	if ($cache->can('compute')) {
+		cache_fetch_compute($cache, $action, $key);
+	} else {
+		cache_fetch_get_set($cache, $action, $key);
+	}
+}
+
+# calculate data to regenerate cache
+sub cache_calculate {
+	my ($action) = @_;
+
+	my $data = cache_capture {
+		$actions{$action}->();
+	};
+
+	return $data;
+}
+
+# for $cache which can ->compute($key, $code)
+sub cache_fetch_compute {
+	my ($cache, $action, $key) = @_;
+
+	my $data = $cache->compute($key, sub { cache_calculate($action) });
+
+	if (defined $data) {
+		# print cached data
+		binmode STDOUT, ':raw';
+		print STDOUT $data;
+	}
+}
+
+# for $cache which can ->get($key) and ->set($key, $data)
+sub cache_fetch_get_set {
+	my ($cache, $action, $key) = @_;
+
 	my $data = $cache->get($key);
 
 	if (defined $data) {
@@ -407,9 +442,7 @@ sub cache_fetch {
 		print STDOUT $data;
 
 	} else {
-		$data = cache_capture {
-			$actions{$action}->();
-		};
+		$data = cache_calculate($action);
 
 		if (defined $data) {
 			$cache->set($key, $data);
-- 
1.6.6.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCHv2 07/10] gitweb/cache.pm - Use locking to avoid 'cache miss stampede' problem
  2010-02-09 10:30 [RFC PATCHv2 00/10] gitweb: Simple file based output caching Jakub Narebski
                   ` (5 preceding siblings ...)
  2010-02-09 10:30 ` [RFC PATCHv2 06/10] gitweb: Use CHI compatibile (compute method) caching Jakub Narebski
@ 2010-02-09 10:30 ` Jakub Narebski
  2010-02-09 10:30 ` [RFC PATCHv2 08/10] gitweb/cache.pm - Serve stale data when waiting for filling cache Jakub Narebski
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Jakub Narebski @ 2010-02-09 10:30 UTC (permalink / raw)
  To: git
  Cc: John 'Warthog9' Hawley, John 'Warthog9' Hawley,
	Petr Baudis, Jakub Narebski

In the ->compute($key, $code) method from GitwebCache::SimpleFileCache,
use locking (via flock) to ensure that only one process would generate
data to update/fill-in cache; the rest would wait for the cache to
be (re)generated and would read data from cache.

Currently this feature can not be disabled (via %cache_options).


A test in t9503 shows that in the case where there are two clients
trying to simultaneously access non-existent or stale cache entry,
(and generating data takes (artifically) a bit of time), if they are
using ->compute method the data is (re)generated once, as opposed to
if those clients are just using ->get/->set methods.

To be implemented (from original patch by J.H.):
* background building, and showing stale cache
* server-side progress indicator when waiting for filling cache,
  which in turn requires separating situations (like snapshots and
  other non-HTML responses) where we should not show 'please wait'
  message

Inspired-by-code-by: John 'Warthog9' Hawley <warthog9@kernel.org>
Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
Differences from v1:
* Proper commit message
* ->get_lockname($key) replaced ->_lockfile_to_key($namespace, $key)
* mkpath to ensure that path leading to lockfile exists instead of private
  _Make_Path subroutine (this is a bit of code repetition in new code).
* Testing of removing non-existent key moved to earlier commit

Differences from relevant parts of J.H. patch:
* Locking on separate *.lock file, for simpler work and robustness;
  acquiring exclusive lock might require having file opened for possible
  write, following
    "File Locking Tricks and Traps" (http://perl.plover.com/yak/flock/)
  J.H. patch used separate *.bg file for lock only for background caching 
  (to be implemented in one of next commits).
* Single variable $lock_state, in place of $lockingStatus, $lockStatBG,
  $lockStat, $lockStatus in J.H. patch.
* Use indirect filehandles for lockfiles, instead of barewords, i.e. global
  filehandles: 
    open my $lock_fh, '+>', $lockfile;
  rather than equivalent of
    open LOCK, '>:utf8', $lockfile
* Do not use LOCK_UN: closing lockfile would unlock
* In my opinion much cleaner flow control

Possible improvements:
* When using locking, only one process would be able to write to temporary
  file.  Therefore temporary file can now have non-random name; no need for
  File::Temp nor unique_id (performance).  It would probably be best done
  via ->get_tempfile() method, or something like that.
* Run tests which contain 'sleep 1' in them (which is required to test how
  they handle concurrent access, e.g. testing cache miss stampede situation)
  only when test is run with --long.  This would require update to
  t/test-lib.sh to export GIT_TEST_LONG to external program in test_external
  etc.
* ->get_lockname should perhaps be ->get_lockfile which would return opened
  filehandle, or filehandle + filename like File::Temp::tempfile, or objects
  that functions like filehandle but stringifies to filename like
  File::Temp->new().
* Equivalent of old _Make_Path?
* Would ability to turn this feature off make sense?

 gitweb/cache.pm                 |   29 ++++++++++++++++-
 t/t9503/test_cache_interface.pl |   65 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 92 insertions(+), 2 deletions(-)

diff --git a/gitweb/cache.pm b/gitweb/cache.pm
index 899a4b4..5c2de7e 100644
--- a/gitweb/cache.pm
+++ b/gitweb/cache.pm
@@ -24,6 +24,7 @@ use File::Path qw(mkpath);
 use File::Spec;
 use File::Temp;
 use Digest::MD5 qw(md5_hex);
+use Fcntl qw(:flock);
 
 # by default, the cache nests all entries on the filesystem two
 # directories deep
@@ -195,6 +196,12 @@ sub path_to_key {
 	return $filepath;
 }
 
+sub get_lockname {
+	my $self = shift;
+
+	return $self->path_to_key(@_) . '.lock';
+}
+
 # ----------------------------------------------------------------------
 # worker methods
 
@@ -330,17 +337,35 @@ sub compute {
 	my ($self, $p_key, $p_coderef) = @_;
 
 	my $data = $self->get($p_key);
-	if (!defined $data) {
+	return $data if defined $data;
+
+	my $dir;
+	my $lockfile = $self->get_lockname($p_key, \$dir);
+
+	# ensure that directory leading to lockfile exists
+	mkpath($dir, 0, 0777) if !-d $dir;
+
+	open my $lock_fh, '+>', $lockfile;
+	#	or die "Can't open lockfile '$lockfile': $!";
+	if (my $lock_state = flock($lock_fh, LOCK_EX | LOCK_NB)) {
+		# acquired writers lock
 		$data = $p_coderef->($self, $p_key);
 		$self->set($p_key, $data);
+	} else {
+		# get readers lock
+		flock($lock_fh, LOCK_SH);
+		$data = $self->fetch($p_key);
 	}
-
+	# closing lockfile releases lock
+	close $lock_fh;
 	return $data;
 }
 
 1;
 } # end of package GitwebCache::SimpleFileCache;
 
+# ======================================================================
+
 # human readable key identifying gitweb output
 sub gitweb_output_key {
 	return href(-replay => 1, -full => 1, -path_info => 0);
diff --git a/t/t9503/test_cache_interface.pl b/t/t9503/test_cache_interface.pl
index 3945adc..42fe359 100755
--- a/t/t9503/test_cache_interface.pl
+++ b/t/t9503/test_cache_interface.pl
@@ -107,6 +107,71 @@ is($cache->compute($key, \&get_value), $value, 'compute 2nd time (get)');
 is($cache->compute($key, \&get_value), $value, 'compute 3rd time (get)');
 cmp_ok($call_count, '==', 1, 'get_value() is called once from compute');
 
+# Test 'stampeding herd' / 'cache miss stampede' problem
+# (probably should be run only if GIT_TEST_LONG)
+#
+sub get_value_slow {
+	$call_count++;
+	sleep 1;
+	return $value;
+}
+my ($pid, $kid_fh);
+
+$call_count = 0;
+$cache->remove($key);
+$pid = open $kid_fh, '-|';
+SKIP: {
+	skip "cannot fork: $!", 1
+		unless defined $pid;
+
+	my $data = $cache->get($key);
+	if (!defined $data) {
+		$data = get_value_slow();
+		$cache->set($key, $data);
+	}
+
+	if ($pid) {
+		my $child_count = <$kid_fh>;
+		chomp $child_count;
+
+		waitpid $pid, 0;
+		close $kid_fh;
+
+		$call_count += $child_count;
+	} else {
+		print "$call_count\n";
+		exit 0;
+	}
+
+	cmp_ok($call_count, '==', 2, 'parallel get/set: get_value_slow() called twice');
+}
+
+$call_count = 0;
+$cache->remove($key);
+$pid = open $kid_fh, '-|';
+SKIP: {
+	skip "cannot fork: $!", 1
+		unless defined $pid;
+
+	my $data = $cache->compute($key, \&get_value_slow);
+
+	if ($pid) {
+		my $child_count = <$kid_fh>;
+		chomp $child_count;
+
+		waitpid $pid, 0;
+		close $kid_fh;
+
+		$call_count += $child_count;
+	} else {
+		print "$call_count\n";
+		exit 0;
+	}
+
+	cmp_ok($call_count, '==', 1, 'parallel compute: get_value_slow() called once');
+}
+
+
 # Test cache expiration for 'expire now'
 #
 $cache->set_expires_in(60*60*24); # set expire time to 1 day
-- 
1.6.6.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCHv2 08/10] gitweb/cache.pm - Serve stale data when waiting for filling cache
  2010-02-09 10:30 [RFC PATCHv2 00/10] gitweb: Simple file based output caching Jakub Narebski
                   ` (6 preceding siblings ...)
  2010-02-09 10:30 ` [RFC PATCHv2 07/10] gitweb/cache.pm - Use locking to avoid 'cache miss stampede' problem Jakub Narebski
@ 2010-02-09 10:30 ` Jakub Narebski
  2010-02-09 10:30 ` [RFC PATCHv2 09/10] gitweb/cache.pm - Regenerate (refresh) cache in background Jakub Narebski
  2010-02-09 10:30 ` [RFC PATCHv2 10/10] gitweb: Show appropriate "Generating..." page when regenerating cache Jakub Narebski
  9 siblings, 0 replies; 18+ messages in thread
From: Jakub Narebski @ 2010-02-09 10:30 UTC (permalink / raw)
  To: git
  Cc: John 'Warthog9' Hawley, John 'Warthog9' Hawley,
	Petr Baudis, Jakub Narebski

When process fails to acquire exclusive (writers) lock, then instead
of waiting for the other process to (re)generate and fill cache, serve
stale (expired) data from cache.  This is of course possible only if
there is some stale data in cache for given key.

This feature of GitwebCache::SimpleFileCache is used only for an
->update($key, $code) method.  It is controlled by 'max_lifetime'
cache parameter; you can set it to -1 to always serve stale data
if it exists, and you can set it to 0 (or any value smaller than
'expires_min') to turn this feature off.

This feature, as it is implemented currently, makes ->update() method a
bit assymetric with respect to process that acquired writers lock and
those processes that didn't, which can be seen in the new test in t9503.
The process that is to regenerate (refresh) data in cache must wait for
the data to be generated in full before showing anything to client, while
the other processes show stale (expired) data immediately.  In order to
remove or reduce this assymetry gitweb would need to employ one of the two
alternate solutions.  Either data should be (re)generated in background,
so that process that acquired writers lock would generate data in
background while serving stale data, or alternatively the process that
generates data should pass output to original STDOUT while capturing it
("tee" otput).

When developing this feature, ->is_valid() method acquired additional
extra optional parameter, where one cap pass expire time instead of using
cache-wode global expire time.

To be implemented (from original patch by J.H.):
* background building,
* server-side progress indicator when waiting for filling cache,
  which in turn requires separating situations (like snapshots and
  other non-HTML responses) where we should not show 'please wait'
  message

Inspired-by-code-by: John 'Warthog9' Hawley <warthog9@kernel.org>
Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
Differences from v1:
* Proper commit messages
* Both serving stale data (this patch) and background generation (next
  patch) were together in a single patch in previous version of this
  series.
* There is limit (like in J.H. patch) how stale can cache entry be to be
  eligible to be served while waiting for cache to be refreshed.  You can
  set 'max_lifetime' to -1 to always serve stale data if it exists, no
  matter how long ago it was generated (this is the default, like in v1).
* The above led to modifying ->is_valid() to accept optional parameter
  specifying max age.  Note that this max age is not influenced by
  'check_load' (is not adaptive).
* Because background cache generation was split in separate patch, for now
  serving stale data is used only for readers (processes which would have to
  wait for refreshing cache).
* This feature is turned off in tests, and turned on only for testng this
  feature.  Extra test that this feature can be turned off, which caused
  refactoring common code.

Differences from relevant parts of J.H. patch:
* Instead of using $maxCacheLife, use 'max_lifetime' option (with
  'max_cache_lifetime' as optional spelling in GitwebCache::SimpleFileCache
  constructor).
* Use 5*60*60 seconds for 5 hours, rather than 18000 (or 18_000) seconds.
* Serving stale data was enabled only when background cache regeneration was
  enabled; here it is enabled for readers regardless whether background
  generation (introduced in next commit) is turned on or not. 

Possible improvements:
* Run long tests only with --long (well, they are not _that_ long).

 gitweb/cache.pm                 |   23 ++++++++++----
 gitweb/gitweb.perl              |    8 +++++
 t/t9503/test_cache_interface.pl |   63 +++++++++++++++++++++++++++++++++++++-
 3 files changed, 86 insertions(+), 8 deletions(-)

diff --git a/gitweb/cache.pm b/gitweb/cache.pm
index 5c2de7e..e3bbe57 100644
--- a/gitweb/cache.pm
+++ b/gitweb/cache.pm
@@ -66,6 +66,7 @@ sub new {
 
 	my ($root, $depth, $ns);
 	my ($expires_min, $expires_max, $increase_factor, $check_load);
+	my ($max_lifetime);
 	if (defined $p_options_hash_ref) {
 		$root  =
 			$p_options_hash_ref->{'cache_root'} ||
@@ -82,6 +83,9 @@ sub new {
 			$p_options_hash_ref->{'expires_max'};
 		$increase_factor = $p_options_hash_ref->{'expires_factor'};
 		$check_load      = $p_options_hash_ref->{'check_load'};
+		$max_lifetime =
+			$p_options_hash_ref->{'max_lifetime'} ||
+			$p_options_hash_ref->{'max_cache_lifetime'};
 	}
 	$root  = $DEFAULT_CACHE_ROOT  unless defined($root);
 	$depth = $DEFAULT_CACHE_DEPTH unless defined($depth);
@@ -91,12 +95,14 @@ sub new {
 		if (!defined($expires_max) && exists $p_options_hash_ref->{'expires_in'});
 	$expires_max = 1200 unless (defined($expires_max));
 	$increase_factor = 60 unless defined($increase_factor);
+	$max_lifetime = -1 unless defined($max_lifetime);
 
 	$self->set_root($root);
 	$self->set_depth($depth);
 	$self->set_namespace($ns);
 	$self->set_expires_min($expires_min);
 	$self->set_expires_max($expires_max);
+	$self->set_max_lifetime($max_lifetime);
 	$self->set_increase_factor($increase_factor);
 	$self->set_check_load($check_load);
 
@@ -108,7 +114,7 @@ sub new {
 
 # http://perldesignpatterns.com/perldesignpatterns.html#AccessorPattern
 
-foreach my $i (qw(depth root namespace expires_min expires_max increase_factor check_load)) {
+foreach my $i (qw(depth root namespace expires_min expires_max increase_factor check_load max_lifetime)) {
 	my $field = $i;
 	no strict 'refs';
 	*{"get_$field"} = sub {
@@ -295,7 +301,7 @@ sub remove {
 
 # exists in cache and is not expired
 sub is_valid {
-	my ($self, $key) = @_;
+	my ($self, $key, $expires_in) = @_;
 
 	my $path = $self->path_to_key($key);
 
@@ -303,7 +309,7 @@ sub is_valid {
 	return 0 unless -f $path;
 
 	# expire time can be set to never
-	my $expires_in = $self->get_expires_in();
+	$expires_in = defined $expires_in ? $expires_in : $self->get_expires_in();
 	return 1 unless (defined $expires_in && $expires_in >= 0);
 
 	# is file expired?
@@ -352,9 +358,14 @@ sub compute {
 		$data = $p_coderef->($self, $p_key);
 		$self->set($p_key, $data);
 	} else {
-		# get readers lock
-		flock($lock_fh, LOCK_SH);
-		$data = $self->fetch($p_key);
+		# try to retrieve stale data
+		$data = $self->fetch($p_key)
+			if $self->is_valid($p_key, $self->get_max_lifetime());
+		if (!defined $data) {
+			# get readers lock if there is no stale data to serve
+			flock($lock_fh, LOCK_SH);
+			$data = $self->fetch($p_key);
+		}
 	}
 	# closing lockfile releases lock
 	close $lock_fh;
diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index aabed72..ea0ad25 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -285,6 +285,14 @@ our %cache_options = (
 	# lifetime control.
 	# (Compatibile with Cache::Adaptive.)
 	'check_load' => \&get_loadavg,
+
+	# Maximum cache file life, in seconds.  If cache entry lifetime exceeds
+	# this value, it wouldn't be served as being too stale when waiting for
+	# cache to be regenerated/refreshed, instead of trying to display
+	# existing cache date.
+	# Set it to -1 to always serve existing data if it exists,
+	# set it to 0 to turn off serving stale data - always wait.
+	'max_lifetime' => 5*60*60, # 5 hours
 );
 
 
diff --git a/t/t9503/test_cache_interface.pl b/t/t9503/test_cache_interface.pl
index 42fe359..922c7cd 100755
--- a/t/t9503/test_cache_interface.pl
+++ b/t/t9503/test_cache_interface.pl
@@ -53,8 +53,11 @@ ok($return,          "run   gitweb/cache.pm");
 
 # Test creating a cache
 #
-my $cache = new_ok('GitwebCache::SimpleFileCache',
-	[ { 'cache_root' => 'cache', 'cache_depth' => 2 } ]);
+my $cache = new_ok('GitwebCache::SimpleFileCache', [ {
+	'cache_root' => 'cache',
+	'cache_depth' => 2,
+	'max_lifetime' => 0, # turn it off
+} ]);
 
 # Test that default values are defined
 #
@@ -269,6 +272,62 @@ if ((my $use_perlio_util = $test_perlio_util)) {
 	is($test_data, $cached_output,             'action output is printed (from cache)');
 }
 
+# Test that cache returns stale data in existing but expired cache situation
+# (probably should be run only if GIT_TEST_LONG)
+#
+my $stale_value = 'Stale Value';
+my $child_data = '';
+$cache->set($key, $stale_value);
+$call_count = 0;
+sub run_compute_forked {
+	my $pid = shift;
+
+	my $data = $cache->compute($key, \&get_value_slow);
+
+	if ($pid) {
+		$child_data = <$kid_fh>;
+		chomp $child_data;
+
+		waitpid $pid, 0;
+		close $kid_fh;
+	} else {
+		print "$data\n";
+		exit 0;
+	}
+
+	return $data;
+}
+$cache->set_expires_in(0);    # expire now
+$cache->set_max_lifetime(-1); # forever
+$pid = open $kid_fh, '-|';
+SKIP: {
+	skip "cannot fork: $!", 2
+		unless defined $pid;
+
+	my $data = run_compute_forked($pid);
+
+	# returning stale data works
+	ok($data eq $stale_value || $child_data eq $stale_value,
+	   'stale data in at least one process when expired');
+
+	$cache->set_expires_in(-1); # never expire for next ->get
+	is($cache->get($key), $value, 'value got set correctly, even if stale data returned');
+}
+$cache->set_expires_in(0);   # expire now
+$cache->set_max_lifetime(0); # don't serve stale data
+$pid = open $kid_fh, '-|';
+SKIP: {
+	skip "cannot fork: $!", 1
+		unless defined $pid;
+
+	my $data = run_compute_forked($pid);
+
+	# no returning stale data
+	ok($data ne $stale_value && $child_data ne $stale_value,
+	   'configured to never return stale data');
+}
+$cache->set_expires_in(-1);
+
 done_testing();
 
 print Dumper($cache);
-- 
1.6.6.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCHv2 09/10] gitweb/cache.pm - Regenerate (refresh) cache in background
  2010-02-09 10:30 [RFC PATCHv2 00/10] gitweb: Simple file based output caching Jakub Narebski
                   ` (7 preceding siblings ...)
  2010-02-09 10:30 ` [RFC PATCHv2 08/10] gitweb/cache.pm - Serve stale data when waiting for filling cache Jakub Narebski
@ 2010-02-09 10:30 ` Jakub Narebski
  2010-02-09 22:23   ` Jakub Narebski
  2010-02-09 10:30 ` [RFC PATCHv2 10/10] gitweb: Show appropriate "Generating..." page when regenerating cache Jakub Narebski
  9 siblings, 1 reply; 18+ messages in thread
From: Jakub Narebski @ 2010-02-09 10:30 UTC (permalink / raw)
  To: git
  Cc: John 'Warthog9' Hawley, John 'Warthog9' Hawley,
	Petr Baudis, Jakub Narebski

This commit removes assymetry in serving stale data (if it exists)
when regenerating cache in GitwebCache::SimpleFileCache.  The process
that acquired exclusive (writers) lock, and is therefore selected to
be the one that (re)generates data to fill the cache, can now generate
data in background, while serving stale data.

This feature can be enabled or disabled on demand via 'background_cache'
cache parameter.  It is turned on by default.

To be implemented (from original patch by J.H.):
* server-side progress indicator when waiting for filling cache,
  which in turn requires separating situations (like snapshots and
  other non-HTML responses) where we should not show 'please wait'
  message

Inspired-by-code-by: John 'Warthog9' Hawley <warthog9@kernel.org>
Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
This is second part of what was single commit in previous version of
this series.

Differences from v1:
* Proper commit message
* You can now configure whether to use background cache (re)generation with
  'background_cache' parameter (field), which is turned on by default.
* Simplified code for forking writer.
* Close lockfile, releasing lock, before waitpid for child.
* Child (serving stale data) doesn't close lockfile explicitly, although
  that doesn't mean much as lockfile is closed automatically on leaving
  dynamic scope for indirect filehandle $lock_fh (on its desctruction).
  This change perhaps should be reverted.

Differences from relevant parts of J.H. patch:
* Fork (run generating process in background) only if there is stale data to
  serve (and if background cache is turned on).
* In J.H. patch forking was done uncoditionally, only generation or exit
  depended on $backgroundCache.
* Lock before forking (which I am not sure if is a good idea, but on the
  other hand tests passes, so...)
* Tests (currently only of API)
* In my opinion cleaner control flow.

 gitweb/cache.pm                 |   36 +++++++++++++++++++++++++++++-------
 gitweb/gitweb.perl              |    9 +++++++++
 t/t9503/test_cache_interface.pl |   14 ++++++++------
 3 files changed, 46 insertions(+), 13 deletions(-)

diff --git a/gitweb/cache.pm b/gitweb/cache.pm
index e3bbe57..c3a6c26 100644
--- a/gitweb/cache.pm
+++ b/gitweb/cache.pm
@@ -66,7 +66,7 @@ sub new {
 
 	my ($root, $depth, $ns);
 	my ($expires_min, $expires_max, $increase_factor, $check_load);
-	my ($max_lifetime);
+	my ($max_lifetime, $background_cache);
 	if (defined $p_options_hash_ref) {
 		$root  =
 			$p_options_hash_ref->{'cache_root'} ||
@@ -86,6 +86,7 @@ sub new {
 		$max_lifetime =
 			$p_options_hash_ref->{'max_lifetime'} ||
 			$p_options_hash_ref->{'max_cache_lifetime'};
+		$background_cache = $p_options_hash_ref->{'background_cache'};
 	}
 	$root  = $DEFAULT_CACHE_ROOT  unless defined($root);
 	$depth = $DEFAULT_CACHE_DEPTH unless defined($depth);
@@ -96,6 +97,7 @@ sub new {
 	$expires_max = 1200 unless (defined($expires_max));
 	$increase_factor = 60 unless defined($increase_factor);
 	$max_lifetime = -1 unless defined($max_lifetime);
+	$background_cache = 1 unless defined($background_cache);
 
 	$self->set_root($root);
 	$self->set_depth($depth);
@@ -105,6 +107,7 @@ sub new {
 	$self->set_max_lifetime($max_lifetime);
 	$self->set_increase_factor($increase_factor);
 	$self->set_check_load($check_load);
+	$self->set_background_cache($background_cache);
 
 	return $self;
 }
@@ -114,7 +117,9 @@ sub new {
 
 # http://perldesignpatterns.com/perldesignpatterns.html#AccessorPattern
 
-foreach my $i (qw(depth root namespace expires_min expires_max increase_factor check_load max_lifetime)) {
+foreach my $i (qw(depth root namespace
+                  expires_min expires_max increase_factor check_load
+                  max_lifetime background_cache)) {
 	my $field = $i;
 	no strict 'refs';
 	*{"get_$field"} = sub {
@@ -353,14 +358,31 @@ sub compute {
 
 	open my $lock_fh, '+>', $lockfile;
 	#	or die "Can't open lockfile '$lockfile': $!";
+
+	# try to retrieve stale data
+	$data = $self->fetch($p_key)
+		if $self->is_valid($p_key, $self->get_max_lifetime());
+
 	if (my $lock_state = flock($lock_fh, LOCK_EX | LOCK_NB)) {
 		# acquired writers lock
-		$data = $p_coderef->($self, $p_key);
-		$self->set($p_key, $data);
+		my $pid = fork() if ($data && $self->get_background_cache());
+		if (!defined $pid || $pid) {
+			# parent, or didn't fork
+			$data = $p_coderef->($self, $p_key);
+			$self->set($p_key, $data);
+
+			if ($pid) {
+				# releases lock before waiting and exiting
+				close $lock_fh;
+				# wait for child (which would print) and exit
+				waitpid $pid, 0;
+				exit 0;
+			}
+		} else {
+			# child is to serve stale data
+			return $data;
+		}
 	} else {
-		# try to retrieve stale data
-		$data = $self->fetch($p_key)
-			if $self->is_valid($p_key, $self->get_max_lifetime());
 		if (!defined $data) {
 			# get readers lock if there is no stale data to serve
 			flock($lock_fh, LOCK_SH);
diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index ea0ad25..8a97e50 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -293,6 +293,15 @@ our %cache_options = (
 	# Set it to -1 to always serve existing data if it exists,
 	# set it to 0 to turn off serving stale data - always wait.
 	'max_lifetime' => 5*60*60, # 5 hours
+
+	# This enables/disables background caching.  If it is set to true value,
+	# caching engine would return stale data (if it is not older than
+	# 'max_lifetime' seconds) if it exists, and launch process if regenerating
+	# (refreshing) cache into the background.  If it is set to false value,
+	# the process that fills cache must always wait for data to be generated.
+	# In theory this will make gitweb seem more responsive at the price of
+	# serving possibly stale data.
+	'background_cache' => 1,
 );
 
 
diff --git a/t/t9503/test_cache_interface.pl b/t/t9503/test_cache_interface.pl
index 922c7cd..41c7bc3 100755
--- a/t/t9503/test_cache_interface.pl
+++ b/t/t9503/test_cache_interface.pl
@@ -299,32 +299,34 @@ sub run_compute_forked {
 }
 $cache->set_expires_in(0);    # expire now
 $cache->set_max_lifetime(-1); # forever
+ok($cache->get_background_cache(), 'generate cache in background by default');
 $pid = open $kid_fh, '-|';
 SKIP: {
-	skip "cannot fork: $!", 2
+	skip "cannot fork: $!", 3
 		unless defined $pid;
 
 	my $data = run_compute_forked($pid);
 
 	# returning stale data works
-	ok($data eq $stale_value || $child_data eq $stale_value,
-	   'stale data in at least one process when expired');
+	is($data,       $stale_value, 'stale data in parent when expired');
+	is($child_data, $stale_value, 'stale data in child  when expired');
 
 	$cache->set_expires_in(-1); # never expire for next ->get
+	sleep 1; # wait for background process to have chance to set data
 	is($cache->get($key), $value, 'value got set correctly, even if stale data returned');
 }
 $cache->set_expires_in(0);   # expire now
 $cache->set_max_lifetime(0); # don't serve stale data
 $pid = open $kid_fh, '-|';
 SKIP: {
-	skip "cannot fork: $!", 1
+	skip "cannot fork: $!", 2
 		unless defined $pid;
 
 	my $data = run_compute_forked($pid);
 
 	# no returning stale data
-	ok($data ne $stale_value && $child_data ne $stale_value,
-	   'configured to never return stale data');
+	isnt($data,       $stale_value, 'no stale data in parent if configured');
+	isnt($child_data, $stale_value, 'no stale data in child  if configured');
 }
 $cache->set_expires_in(-1);
 
-- 
1.6.6.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCHv2 10/10] gitweb: Show appropriate "Generating..." page when regenerating cache
  2010-02-09 10:30 [RFC PATCHv2 00/10] gitweb: Simple file based output caching Jakub Narebski
                   ` (8 preceding siblings ...)
  2010-02-09 10:30 ` [RFC PATCHv2 09/10] gitweb/cache.pm - Regenerate (refresh) cache in background Jakub Narebski
@ 2010-02-09 10:30 ` Jakub Narebski
  9 siblings, 0 replies; 18+ messages in thread
From: Jakub Narebski @ 2010-02-09 10:30 UTC (permalink / raw)
  To: git
  Cc: John 'Warthog9' Hawley, John 'Warthog9' Hawley,
	Petr Baudis, Jakub Narebski

When there exist stale/expired (but not too stale) version of
(re)generated page in cache, gitweb returns stale version (and updates
cache in background, assuming 'background_cache' is set to true value).
When there is no stale version suitable to serve the client, currently
we have to wait for the data to be generated in full before showing it.
Add to GitwebCache::SimpleFileCache, via 'generating_info' callback,
the ability to show user some activity indicator / progress bar, to
show that we are working on generating data.

Gitweb itself uses "Generating..." page as activity indicator, which
redirects (via <meta http-equiv="Refresh" ...>) to refreshed version
of the page after the cache is filled (via trick of not closing page
and therefore not closing connection till data is available in cache,
checked by getting shared/readers lock on lockfile for cache entry).
The git_generating_data_html() subroutine, which is used by gitweb
to implement this feature, is highly configurable: you can choose
initial delay, frequency of writing some data so that connection
won't get closed, and maximum time to wait for data in "Generating..."
page (see %generating_options hash).

Currently git_generating_data_html() contains hardcoded "whitelist" of
actions for which such HTML "Generating..." page makes sense.


This implements final feature from the original gitweb output caching
patch by J.H.

Inspired-by-code-by: John 'Warthog9' Hawley <warthog9@kernel.org>
Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
Differences from v1:
* Proper commit message.
* get/set_generating_info accessors are now generated.
* Fixed bug in ->generating_info() method, caught by new tests.
* You can now turn off "Generating..." feature, by unsetting 9setting to
  undef or deleting) of the 'generating_info' key in %cache_options
* git_generating_data_html() uses whitelist of actions where "Generating..."
  page is allowed, instead of blacklist of actions where it should not be
  used.
* Action check is done using extended regexp rather than comparing with
  strings with 'eq'.
* git_generating_data_html() must now stop capturing response; with 
  'print $out <sth>' it was easy to ensure printing activity indicator
  to client (to standard output).

Differences from v1, which are also differences from J.H. patch:
* "Generating..." info behavior can be configured (at least the timings) via
  %generating_options hash, instead of having those options hardcoded.
* There is initial 'startup_delay' (by default 1 second); if the data is
  generated by that time, the "Generating..." page is not shown.
* Waiting is done using blocking flock + alarm, rather than "busy wait"
  loop with non-blocking flock + sleep.
* Basic test for "Generating..." feature (but which didn't caught error
  mentioned below), but only when PerlIO::Util is available.

Differences from relevant parts of J.H. patch:
* The subroutine that is responsible for doing "Generating..." progress
  info / activity indicator (cacheWaitForUpdate() subroutine in J.H. patch,
  git_generating_data_html() in this patch) is in gitweb.perl, and not in
  cache.pm.  It is passed as callback (as code reference) to $cache
  constructor: this method should be used also for other subroutines from
  gitweb which are needed in cache, namely key generation and data
  generation subroutines.
* gitweb prints generating info in more restricted set of situations; the
  set of actions where gitweb does not generate activity indicator is
  larger.  We could probably provide activity indicator also for (possibly)
  non-HTML output, like 'blob_plain' or 'patches', provided that
  'User-Agent' denotes that we are using web browser.
* Removed (by accident) copyright assignment from "Genrating..." page.
  Needs to be updated and brought back.
* Fixed typo in DTD URL.

Required improvements:
* I have just realized that tests of this feature do test the case when
  ->generating_info does redirect.  We need to fork (if enabled) if there
  is stale data to serve OR (what is currently missing) if there is
  ->get_generating_info (and if it is marked as redirecting).
* Remove accidental "#'" needed to fix font-locking (syntax highlighting)
  that got confused in esc_param.

Possible improvements:
* Ajax-y JavaScript-based activity indicator, without need for http-equiv
  refresh trick.  Proof of concept patch of mentioned method was present in
  previous series.

 gitweb/cache.pm                 |   23 +++++-
 gitweb/gitweb.perl              |  154 ++++++++++++++++++++++++++++++++++++++-
 t/t9503/test_cache_interface.pl |   45 +++++++++++
 3 files changed, 216 insertions(+), 6 deletions(-)

diff --git a/gitweb/cache.pm b/gitweb/cache.pm
index c3a6c26..df74ea1 100644
--- a/gitweb/cache.pm
+++ b/gitweb/cache.pm
@@ -66,7 +66,7 @@ sub new {
 
 	my ($root, $depth, $ns);
 	my ($expires_min, $expires_max, $increase_factor, $check_load);
-	my ($max_lifetime, $background_cache);
+	my ($max_lifetime, $background_cache, $generating_info);
 	if (defined $p_options_hash_ref) {
 		$root  =
 			$p_options_hash_ref->{'cache_root'} ||
@@ -87,6 +87,7 @@ sub new {
 			$p_options_hash_ref->{'max_lifetime'} ||
 			$p_options_hash_ref->{'max_cache_lifetime'};
 		$background_cache = $p_options_hash_ref->{'background_cache'};
+		$generating_info  = $p_options_hash_ref->{'generating_info'};
 	}
 	$root  = $DEFAULT_CACHE_ROOT  unless defined($root);
 	$depth = $DEFAULT_CACHE_DEPTH unless defined($depth);
@@ -108,6 +109,7 @@ sub new {
 	$self->set_increase_factor($increase_factor);
 	$self->set_check_load($check_load);
 	$self->set_background_cache($background_cache);
+	$self->set_generating_info($generating_info);
 
 	return $self;
 }
@@ -119,7 +121,7 @@ sub new {
 
 foreach my $i (qw(depth root namespace
                   expires_min expires_max increase_factor check_load
-                  max_lifetime background_cache)) {
+                  max_lifetime background_cache generating_info)) {
 	my $field = $i;
 	no strict 'refs';
 	*{"get_$field"} = sub {
@@ -164,6 +166,14 @@ sub set_expires_in {
 	$self->set_expires_max($duration);
 }
 
+sub generating_info {
+	my $self = shift;
+
+	if (defined $self->get_generating_info()) {
+		$self->get_generating_info()->($self, @_);
+	}
+}
+
 # ----------------------------------------------------------------------
 # utility functions and methods
 
@@ -367,6 +377,9 @@ sub compute {
 		# acquired writers lock
 		my $pid = fork() if ($data && $self->get_background_cache());
 		if (!defined $pid || $pid) {
+			# provide "generating page..." info if there is no stale data to serve
+			$self->generating_info($p_key, $lock_fh)
+				unless ($data);
 			# parent, or didn't fork
 			$data = $p_coderef->($self, $p_key);
 			$self->set($p_key, $data);
@@ -383,8 +396,12 @@ sub compute {
 			return $data;
 		}
 	} else {
+		# some else process is (re)generating cache
 		if (!defined $data) {
-			# get readers lock if there is no stale data to serve
+			# there is no stale data to serve
+			# provide "generating page..." info
+			$self->generating_info($p_key, $lock_fh);
+			# get readers lock
 			flock($lock_fh, LOCK_SH);
 			$data = $self->fetch($p_key);
 		}
diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 8a97e50..27c1adf 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -13,7 +13,7 @@ use CGI qw(:standard :escapeHTML -nosticky);
 use CGI::Util qw(unescape);
 use CGI::Carp qw(fatalsToBrowser);
 use Encode;
-use Fcntl ':mode';
+use Fcntl qw(:mode :flock);
 use File::Find qw();
 use File::Basename qw(basename);
 binmode STDOUT, ':utf8';
@@ -302,8 +302,33 @@ our %cache_options = (
 	# In theory this will make gitweb seem more responsive at the price of
 	# serving possibly stale data.
 	'background_cache' => 1,
-);
 
+	# Subroutine which would be called when gitweb has to wait for data to
+	# be generated (it can't serve stale data because there isn't any,
+	# or if it exists it is older than 'max_lifetime').  The default
+	# is to use git_generating_data_html(), which creates "Generating..."
+	# page, which would then redirect or redraw/rewrite the page when
+	# data is ready.
+	# Set it to `undef' to disable this feature.
+	#
+	# Such subroutine (if invoked from GitwebCache::SimpleFileCache)
+	# is passed the following parameters: $cache instance, human-readable
+	# $key to current page, and filehandle $lock_fh to lockfile.
+	'generating_info' => \&git_generating_data_html,
+);
+# You define site-wide options for "Generating..." page (if enabled) here;
+# override them with $GITWEB_CONFIG as necessary.
+our %generating_options = (
+	# The delay before displaying "Generating..." page, in seconds. It is
+	# intended for "Generating..." page to be shown only when really needed.
+	'startup_delay' => 1,
+	# The time before generating new piece of output, to prevent from
+	# redirection before data is ready, in seconds.
+	'print_interval' => 2,
+	# Maximum time "Generating..." page would be present, waiting for data,
+	# before unconditional redirect, in seconds.
+	'timeout' => $cache_options{'expires_min'},
+);
 
 # You define site-wide feature defaults here; override them with
 # $GITWEB_CONFIG as necessary.
@@ -1265,7 +1290,7 @@ sub to_utf8 {
 # correct, but quoted slashes look too horrible in bookmarks
 sub esc_param {
 	my $str = shift;
-	$str =~ s/([^A-Za-z0-9\-_.~()\/:@ ]+)/CGI::escape($1)/eg;
+	$str =~ s/([^A-Za-z0-9\-_.~()\/:@ ]+)/CGI::escape($1)/eg;#'
 	$str =~ s/ /\+/g;
 	return $str;
 }
@@ -3252,6 +3277,129 @@ sub blob_contenttype {
 ## ======================================================================
 ## functions printing HTML: header, footer, error page
 
+# creates "Generating..." page when caching enabled and not in cache
+sub git_generating_data_html {
+	my ($cache, $key, $lock_fh) = @_;
+
+	# whitelist of actions that should get "Generating..." page
+	unless ($action =~ /(?:blame(?:|_incremental) | blobdiff | blob |
+	                     commitdiff | commit | forks | heads | tags |
+	                     log | shortlog | history | search |
+	                     tag | tree | summary | project_list)/x) {
+		return;
+	}
+	# blacklist of actions that should not have "Generating..." page
+	#if ($action =~ /(?:atom | rss | opml |
+	#                 blob_plain | blobdiff_plain | commitdiff_plain |
+	#                 patch | patches |
+	#                 blame_data | search_help | object | project_index |
+	#                 snapshot/x) { # binary
+	#	return;
+	#}
+
+	# Stop capturing response
+	#
+	# NOTE: this would not be needed if gitweb would do 'print $out',
+	# and one could rely on printing to STDOUT to be not captured
+	cache_stop(); # or gitweb could use 'print $STDOUT' in place of 'print STDOUT'
+
+	# Initial delay
+	if ($generating_options{'startup_delay'} > 0) {
+		eval {
+			local $SIG{ALRM} = sub { die "alarm clock restart\n" }; # NB: \n required
+			alarm $generating_options{'startup_delay'};
+			flock($lock_fh, LOCK_SH); # blocking readers lock
+			alarm 0;
+		};
+		if ($@) {
+			# propagate unexpected errors
+			die unless $@ !~ /alarm clock restart/;
+		} else {
+			# we got response within 'startup_delay' timeout
+			return;
+		}
+	}
+
+	my $title = "[Generating...] $site_name";
+	# TODO: the following fragment of code duplicates the one
+	# in git_header_html, and it should be refactored.
+	if (defined $project) {
+		$title .= " - " . to_utf8($project);
+		if (defined $action) {
+			$title .= "/$action";
+			if (defined $file_name) {
+				$title .= " - " . esc_path($file_name);
+				if ($action eq "tree" && $file_name !~ m|/$|) {
+					$title .= "/";
+				}
+			}
+		}
+	}
+	my $mod_perl_version = $ENV{'MOD_PERL'} ? " $ENV{'MOD_PERL'}" : '';
+
+	# Use the trick that 'refresh' HTTP header equivalent (set via http-equiv)
+	# with timeout of 0 seconds would redirect as soon as page is finished.
+	# This "Generating..." redirect page should not be cached (externally).
+	print STDOUT $cgi->header(-type => 'text/html', -charset => 'utf-8',
+	                          -status=> '200 OK', -expires => 'now');
+	print STDOUT <<"EOF";
+<?xml version="1.0" encoding="utf-8"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
+                      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US" lang="en-US">
+<!-- git web interface version $version -->
+<!-- git core binaries version $git_version -->
+<head>
+<meta http-equiv="content-type" content="text/html; charset=utf-8" />
+<meta http-equiv="refresh" content="0" />
+<meta name="generator" content="gitweb/$version git/$git_version$mod_perl_version" />
+<meta name="robots" content="noindex, nofollow" />
+<title>$title</title>
+</head>
+<body>
+EOF
+	print STDOUT 'Generating...';
+
+	local $| = 1; # autoflush
+	my $total_time = 0;
+	my $interval = $generating_options{'print_interval'} || 1;
+	my $timeout  = $generating_options{'timeout'};
+	my $alarm_handler = sub {
+		print STDOUT '.';
+		$total_time += $interval;
+		if ($total_time > $timeout) {
+			die "timeout\n";
+		}
+	};
+	eval {
+		# check if we can use functions from Time::HiRes
+		if (defined $t0) {
+			local $SIG{ALRM} = $alarm_handler;
+			Time::HiRes::alarm($interval, $interval);
+		} else {
+			local $SIG{ALRM} = sub {
+				$alarm_handler->();
+				alarm($interval);
+			};
+			alarm($interval);
+		}
+		flock($lock_fh, LOCK_SH); # blocking readers lock
+		alarm 0;
+	};
+	# It doesn't really matter if we got lock, or timed-out
+	# but we should re-throw unknown (unexpected) errors
+	die if ($@ and $@ !~ /timeout/);
+
+	print STDOUT <<"EOF";
+
+</body>
+</html>
+EOF
+
+	#exit 0;
+	return;
+}
+
 sub git_header_html {
 	my $status = shift || "200 OK";
 	my $expires = shift;
diff --git a/t/t9503/test_cache_interface.pl b/t/t9503/test_cache_interface.pl
index 41c7bc3..ac760a2 100755
--- a/t/t9503/test_cache_interface.pl
+++ b/t/t9503/test_cache_interface.pl
@@ -330,6 +330,51 @@ SKIP: {
 }
 $cache->set_expires_in(-1);
 
+# Test 'generating_info' feature
+#
+$cache->remove($key);
+my $progress_info = "Generating...";
+sub test_generating_info {
+	local $| = 1;
+	print "$progress_info";
+}
+$cache->set_generating_info(\&test_generating_info);
+# NOTE: I coulnd't be bothered to rewrite those tests to work without PerlIO::Util
+$pid = open $kid_fh, '-|' if $perlio_util;
+my ($data, $output, $child_output);
+SKIP: {
+	skip "No PerlIO::Util", 2
+		unless $perlio_util;
+	skip "cannot fork: $!", 2
+		unless defined $pid;
+
+	*STDOUT->push_layer(scalar => \$output);
+	$data = $cache->compute($key, \&get_value_slow);
+	*STDOUT->pop_layer();
+
+	if ($pid) {
+		local $/ = "\0";
+		chomp($child_output = <$kid_fh>);
+		chomp($child_data   = <$kid_fh>);
+
+		waitpid $pid, 0;
+		close $kid_fh;
+	} else {
+		local $| = 1;
+		print "$output\0$data\0";
+		exit 0;
+	}
+
+	is("$output$data",             "$progress_info$value",
+	   'progress info and data from parent');
+	is("$child_output$child_data", "$progress_info$value",
+	   'progress info and data from child');
+}
+diag("parent output is $output");
+diag("parent data   is $data");
+diag("child  output is $child_output");
+diag("child  data   is $child_data");
+
 done_testing();
 
 print Dumper($cache);
-- 
1.6.6.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHv2 09/10] gitweb/cache.pm - Regenerate (refresh) cache in background
  2010-02-09 10:30 ` [RFC PATCHv2 09/10] gitweb/cache.pm - Regenerate (refresh) cache in background Jakub Narebski
@ 2010-02-09 22:23   ` Jakub Narebski
  0 siblings, 0 replies; 18+ messages in thread
From: Jakub Narebski @ 2010-02-09 22:23 UTC (permalink / raw)
  To: git
  Cc: John 'Warthog9' Hawley, John 'Warthog9' Hawley,
	Petr Baudis

On Tue, 9 Feb 2010, Jakub Narebski wrote:

[...]
> +               my $pid = fork() if ($data && $self->get_background_cache());
> +               if (!defined $pid || $pid) {
> +                       # parent, or didn't fork
> +                       $data = $p_coderef->($self, $p_key);
> +                       $self->set($p_key, $data);
> +
> +                       if ($pid) {
> +                               # releases lock before waiting and exiting
> +                               close $lock_fh;
> +                               # wait for child (which would print) and exit
> +                               waitpid $pid, 0;
> +                               exit 0;
> +                       }
> +               } else {
> +                       # child is to serve stale data
> +                       return $data;
> +               }
[...]

I have decide to rely on $SIG{CHLD} = 'IGNORE' to not generate zombies,
and consistently exit in child (in forked process) and return in parent.
This change would be in next version of this series.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHv2 04/10] gitweb: Use Cache::Cache compatibile (get, set) output caching
  2010-02-09 10:30 ` [RFC PATCHv2 04/10] gitweb: Use Cache::Cache compatibile (get, set) output caching Jakub Narebski
@ 2010-02-10  1:12   ` Jakub Narebski
  2010-02-10  1:23     ` Petr Baudis
  0 siblings, 1 reply; 18+ messages in thread
From: Jakub Narebski @ 2010-02-10  1:12 UTC (permalink / raw)
  To: git
  Cc: John 'Warthog9' Hawley, John 'Warthog9' Hawley,
	Petr Baudis

On Tue, 9 Feb 2010 at 11:30 +0100, Jakub Narebski wrote:

> The cache_fetch subroutine captures output (from STDOUT only, as
> STDERR is usually logged) using either ->push_layer()/->pop_layer()
> from PerlIO::Util submodule (if it is available), or by setting and
> restoring *STDOUT.  Note that only the former could be tested reliably
> to be reliable in t9503 test!

Scratch that, I have just checked that (at least for Apache + mod_cgi,
but I don't think that it matters) the latter solution, with setting
and restoring *STDOUT doesn't work: I would get data in cache (so it
can be restored later), but instead of output I would get Internal Server
Error ("The server encountered an internal error or misconfiguration and
was unable to complete your request.") without even a hint what the
problem was.  Sprinkling "die ...: $!" didn't help to catch this error:
I suspect that the problem is with capturing.

So we either would have to live with non-core PerlIO::Util or (pure Perl)
Capture::Tiny, or do the 'print -> print $out' patch...

[....]
> Capturing gitweb output
> =======================
> When output (response) caching is enabled, the caching mechanism has to
> capture gitweb output (perhaps while printing it to standard output) to
> save it to cache, unless the data is available in cache and not expired.
> 
> Because die_error uses 'exit', and because it uses git_header_html and
> other printing subroutines (which output has to be captured in other
> situations), it needs to disable caching, unless we are "tee"-ing.
> Otherwise we would get no output from die_error (because it is captured),
> but also we would not get data to be saved in cache and printed, because
> 'exit' in die_error would exit capture, too.  This restricts methods and
> modules that can be used to capture output.
> 
> Below there are presented various considered and used ways of capturing the
> output, or "tee"-ing it (capturing while printing), with their advantages
> and disadvantages.
> 
> 
> Capturing output (capture)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
[...]

> 5. Without 'print <sth>' to 'print $out <sth>' change to gitweb, one can try
>    manipulating *STDOUT directly, first saving *STDOUT or \*STDOUT to some
>    variable, then setting *STDOUT = $data_fh, where $data_fh is filehandle
>    opened to in-memory file.
> 
>    This seems to work, does not require large patch to gitweb, and does not
>    require extra (non-core) Perl module.  Nevertheless it seems fragile with
>    respect to restoring STDOUT, and even though basic tests (not included)
>    of this mechanism producted expected result, I wasn't able to write
>    working tests when using this method.
> 
>    We could probably examine how Capture::Tiny does it, and reuse (copy)
>    relevant parts of code, perhaps duplicating STDOUT, closing it and then
>    reopening as in-memory filehandle.
> 
>    YMMV (Your Mileage May Vary).

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHv2 04/10] gitweb: Use Cache::Cache compatibile (get, set) output caching
  2010-02-10  1:12   ` Jakub Narebski
@ 2010-02-10  1:23     ` Petr Baudis
  2010-02-10 11:28       ` Jakub Narebski
  0 siblings, 1 reply; 18+ messages in thread
From: Petr Baudis @ 2010-02-10  1:23 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: git, John 'Warthog9' Hawley, John 'Warthog9' Hawley

On Wed, Feb 10, 2010 at 02:12:24AM +0100, Jakub Narebski wrote:
> On Tue, 9 Feb 2010 at 11:30 +0100, Jakub Narebski wrote:
> 
> > The cache_fetch subroutine captures output (from STDOUT only, as
> > STDERR is usually logged) using either ->push_layer()/->pop_layer()
> > from PerlIO::Util submodule (if it is available), or by setting and
> > restoring *STDOUT.  Note that only the former could be tested reliably
> > to be reliable in t9503 test!
> 
> Scratch that, I have just checked that (at least for Apache + mod_cgi,
> but I don't think that it matters) the latter solution, with setting
> and restoring *STDOUT doesn't work: I would get data in cache (so it
> can be restored later), but instead of output I would get Internal Server
> Error ("The server encountered an internal error or misconfiguration and
> was unable to complete your request.") without even a hint what the
> problem was.  Sprinkling "die ...: $!" didn't help to catch this error:
> I suspect that the problem is with capturing.
> 
> So we either would have to live with non-core PerlIO::Util or (pure Perl)
> Capture::Tiny, or do the 'print -> print $out' patch...

All the magic methods seem to be troublesome, but in that case I'd
really prefer a level of indirection instead of filehandle - as is,
'print (...) -> output (...)' ins. of 'print (...) -> print $out (...)'
(or whatever). That should be really flexible and completely
futureproof, and I don't think the level of indirection would incur any
measurable overhead, would it?

-- 
				Petr "Pasky" Baudis
If you can't see the value in jet powered ants you should turn in
your nerd card. -- Dunbal (464142)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHv2 04/10] gitweb: Use Cache::Cache compatibile (get, set) output caching
  2010-02-10  1:23     ` Petr Baudis
@ 2010-02-10 11:28       ` Jakub Narebski
  2010-02-10 12:02         ` Petr Baudis
  0 siblings, 1 reply; 18+ messages in thread
From: Jakub Narebski @ 2010-02-10 11:28 UTC (permalink / raw)
  To: Petr Baudis
  Cc: git, John 'Warthog9' Hawley, John 'Warthog9' Hawley

[-- Attachment #1: Type: text/plain, Size: 4118 bytes --]

On Wed, 10 Feb 2010, Petr Baudis wrote:
> On Wed, Feb 10, 2010 at 02:12:24AM +0100, Jakub Narebski wrote:
> > On Tue, 9 Feb 2010 at 11:30 +0100, Jakub Narebski wrote:
> > 
> > > The cache_fetch subroutine captures output (from STDOUT only, as
> > > STDERR is usually logged) using either ->push_layer()/->pop_layer()
> > > from PerlIO::Util submodule (if it is available), or by setting and
> > > restoring *STDOUT.  Note that only the former could be tested reliably
> > > to be reliable in t9503 test!
> > 
> > Scratch that, I have just checked that (at least for Apache + mod_cgi,
> > but I don't think that it matters) the latter solution, with setting
> > and restoring *STDOUT doesn't work: I would get data in cache (so it
> > can be restored later), but instead of output I would get Internal Server
> > Error ("The server encountered an internal error or misconfiguration and
> > was unable to complete your request.") without even a hint what the
> > problem was.  Sprinkling "die ...: $!" didn't help to catch this error:
> > I suspect that the problem is with capturing.
> > 
> > So we either would have to live with non-core PerlIO::Util or (pure Perl)
> > Capture::Tiny, or do the 'print -> print $out' patch...
> 
> All the magic methods seem to be troublesome, but in that case I'd
> really prefer a level of indirection instead of filehandle - as is,
> 'print (...) -> output (...)' ins. of 'print (...) -> print $out (...)'
> (or whatever). That should be really flexible and completely
> futureproof, and I don't think the level of indirection would incur any
> measurable overhead, would it?

First, it is not only 'print (...) -> print $out (...)'; you need to
do all those:

   print  <sth>            ->  print  $out <sth>
   printf <sth>            ->  printf $out <sth>
   binmode STDOUT, <mode>  ->  binmode $out, <mode>

Second, using "tie" on filehandle (on *STDOUT) can be used also for 
just capturing output, not only for "tee"-ing; what's more to print
while capturing one has to do extra work.  It is quite similar to
replacing 'print (...)' with 'output (...)' etc., but using
tie/untie doesn't require large patch to gitweb.

Third, as you can see below tie-ing is about 1% slower than using
'output (...)', which in turn is less than 10% slower than explicit
filehandle solution i.e. 'print $out (...)'... and is almost twice
slower than solution using PerlIO::Util

Benchmark: timing 50000 iterations of output, perlio, print \$out, tie *STDOUT...
     output: 1.81462 wallclock secs ( 1.77 usr +  0.00 sys =  1.77 CPU) @ 28248.59/s (n=50000)
     perlio: 1.05585 wallclock secs ( 1.03 usr +  0.00 sys =  1.03 CPU) @ 48543.69/s (n=50000)
print \$out: 1.70027 wallclock secs ( 1.66 usr +  0.00 sys =  1.66 CPU) @ 30120.48/s (n=50000)
tie *STDOUT: 1.82248 wallclock secs ( 1.79 usr +  0.00 sys =  1.79 CPU) @ 27932.96/s (n=50000)
               Rate tie *STDOUT      output print \$out      perlio
tie *STDOUT 27933/s          --         -1%         -7%        -42%
output      28249/s          1%          --         -6%        -42%
print \$out 30120/s          8%          7%          --        -38%
perlio      48544/s         74%         72%         61%          --

Benchmark: running output, perlio, print \$out, tie *STDOUT for at least 10 CPU seconds...
     output: 10.7199 wallclock secs (10.53 usr +  0.00 sys = 10.53 CPU) @ 28029.63/s (n=295152)
     perlio: 11.2884 wallclock secs (10.46 usr +  0.00 sys = 10.46 CPU) @ 49967.11/s (n=522656)
print \$out: 10.5978 wallclock secs (10.43 usr +  0.00 sys = 10.43 CPU) @ 30318.79/s (n=316225)
tie *STDOUT: 11.3525 wallclock secs (10.68 usr +  0.00 sys = 10.68 CPU) @ 27635.96/s (n=295152)
               Rate tie *STDOUT      output print \$out      perlio
tie *STDOUT 27636/s          --         -1%         -9%        -45%
output      28030/s          1%          --         -8%        -44%
print \$out 30319/s         10%          8%          --        -39%
perlio      49967/s         81%         78%         65%          --
need

Attached there is script that was used to produce those results.
-- 
Jakub Narebski
Poland

[-- Attachment #2: Benchmark of different ways of capturing output in Perl code --]
[-- Type: text/plain, Size: 3491 bytes --]

#!/usr/bin/perl

use strict;
use warnings;

use Test::More;
use Time::HiRes;
use Benchmark qw(:all :hireswallclock);
use Data::Dumper;

use PerlIO::Util;

my $chunk = "test\n";

my $lorem = <<'EOF';
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
minim veniam, quis nostrud exercitation ullamco laboris nisi ut
aliquip ex ea commodo consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum.

Sed ut perspiciatis unde omnis iste natus error sit voluptatem
accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae
ab illo inventore veritatis et quasi architecto beatae vitae dicta
sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit
aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos
qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui
dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed
quia non numquam eius modi tempora incidunt ut labore et dolore magnam
aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum
exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex
ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in
ea voluptate velit esse quam nihil molestiae consequatur, vel illum
qui dolorem eum fugiat quo voluptas nulla pariatur?
EOF

sub capture_perlio {
	my $data = '';

	*STDOUT->push_layer(scalar => \$data);

	print $chunk;

	*STDOUT->pop_layer();

	return $data;
}

my $out = \*STDOUT;

sub capture_filehandle {
	my $data = '';

	open my $data_fh, '>', \$data;
	$out = $data_fh;

	print $out $chunk;

	$out = \*STDOUT;
	close $data_fh;

	return $data;
}

sub output {
	print $out @_;
}

sub capture_indirection {
	my $data = '';

	open my $data_fh, '>', \$data;
	$out = $data_fh;

	output $chunk;

	$out = \*STDOUT;
	close $data_fh;

	return $data;
}

sub capture_tied {
	my $data = '';

	tie *STDOUT, 'CatchSTDOUT', \$data;

	print $chunk;

	untie *STDOUT;

	return $data;
}

{
	package CatchSTDOUT;

	use strict;
	use warnings;
	use Data::Dumper;

	sub TIEHANDLE {
		my ($proto, $dataref) = @_;
		my $class = ref($proto) || $proto;

		my $self = {};
		$self = bless($self, $class);
		$self->{'scalar'} = $dataref;
		return $self;
	}

	sub PRINT {
		my $self = shift;
		${$self->{'scalar'}} .= join('',@_);
	}
}

# ----------------------------------------------------------------------

print 'capture_perlio      = '.capture_perlio();
print 'capture_filehandle  = '.capture_filehandle();
print 'capture_indirection = '.capture_indirection();
print 'capture_tied        = '.capture_tied();

$chunk = "another test\n";
print 'capture_perlio      = '.capture_perlio();
print 'capture_filehandle  = '.capture_filehandle();
print 'capture_indirection = '.capture_indirection();
print 'capture_tied        = '.capture_tied();

print "\n";


$chunk = $lorem;

my $result = timethese(50_000, {
	'perlio'      => \&capture_perlio,
	'print \$out' => \&capture_filehandle,
	'output'      => \&capture_indirection,
	'tie *STDOUT' => \&capture_indirection,
});
cmpthese($result);
print "\n";

$result = timethese(-10, {
	'perlio'      => \&capture_perlio,
	'print \$out' => \&capture_filehandle,
	'output'      => \&capture_indirection,
	'tie *STDOUT' => \&capture_indirection,
});
cmpthese($result);
print "\n";

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHv2 04/10] gitweb: Use Cache::Cache compatibile (get, set) output caching
  2010-02-10 11:28       ` Jakub Narebski
@ 2010-02-10 12:02         ` Petr Baudis
  2010-02-10 18:22           ` Jakub Narebski
  0 siblings, 1 reply; 18+ messages in thread
From: Petr Baudis @ 2010-02-10 12:02 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: git, John 'Warthog9' Hawley, John 'Warthog9' Hawley

On Wed, Feb 10, 2010 at 12:28:14PM +0100, Jakub Narebski wrote:
> On Wed, 10 Feb 2010, Petr Baudis wrote:
> > On Wed, Feb 10, 2010 at 02:12:24AM +0100, Jakub Narebski wrote:
> > > On Tue, 9 Feb 2010 at 11:30 +0100, Jakub Narebski wrote:
> > > 
> > > > The cache_fetch subroutine captures output (from STDOUT only, as
> > > > STDERR is usually logged) using either ->push_layer()/->pop_layer()
> > > > from PerlIO::Util submodule (if it is available), or by setting and
> > > > restoring *STDOUT.  Note that only the former could be tested reliably
> > > > to be reliable in t9503 test!
> > > 
> > > Scratch that, I have just checked that (at least for Apache + mod_cgi,
> > > but I don't think that it matters) the latter solution, with setting
> > > and restoring *STDOUT doesn't work: I would get data in cache (so it
> > > can be restored later), but instead of output I would get Internal Server
> > > Error ("The server encountered an internal error or misconfiguration and
> > > was unable to complete your request.") without even a hint what the
> > > problem was.  Sprinkling "die ...: $!" didn't help to catch this error:
> > > I suspect that the problem is with capturing.
> > > 
> > > So we either would have to live with non-core PerlIO::Util or (pure Perl)
> > > Capture::Tiny, or do the 'print -> print $out' patch...
> > 
> > All the magic methods seem to be troublesome, but in that case I'd
> > really prefer a level of indirection instead of filehandle - as is,
> > 'print (...) -> output (...)' ins. of 'print (...) -> print $out (...)'
> > (or whatever). That should be really flexible and completely
> > futureproof, and I don't think the level of indirection would incur any
> > measurable overhead, would it?
> 
> First, it is not only 'print (...) -> print $out (...)'; you need to
> do all those:
> 
>    print  <sth>            ->  print  $out <sth>
>    printf <sth>            ->  printf $out <sth>
>    binmode STDOUT, <mode>  ->  binmode $out, <mode>
> 
> Second, using "tie" on filehandle (on *STDOUT) can be used also for 
> just capturing output, not only for "tee"-ing; what's more to print
> while capturing one has to do extra work.  It is quite similar to
> replacing 'print (...)' with 'output (...)' etc., but using
> tie/untie doesn't require large patch to gitweb.
> 
> Third, as you can see below tie-ing is about 1% slower than using
> 'output (...)', which in turn is less than 10% slower than explicit
> filehandle solution i.e. 'print $out (...)'... and is almost twice
> slower than solution using PerlIO::Util
> 
> Benchmark: timing 50000 iterations of output, perlio, print \$out, tie *STDOUT...
>      output: 1.81462 wallclock secs ( 1.77 usr +  0.00 sys =  1.77 CPU) @ 28248.59/s (n=50000)
>      perlio: 1.05585 wallclock secs ( 1.03 usr +  0.00 sys =  1.03 CPU) @ 48543.69/s (n=50000)
> print \$out: 1.70027 wallclock secs ( 1.66 usr +  0.00 sys =  1.66 CPU) @ 30120.48/s (n=50000)
> tie *STDOUT: 1.82248 wallclock secs ( 1.79 usr +  0.00 sys =  1.79 CPU) @ 27932.96/s (n=50000)
>                Rate tie *STDOUT      output print \$out      perlio
> tie *STDOUT 27933/s          --         -1%         -7%        -42%
> output      28249/s          1%          --         -6%        -42%
> print \$out 30120/s          8%          7%          --        -38%
> perlio      48544/s         74%         72%         61%          --
> 
> Benchmark: running output, perlio, print \$out, tie *STDOUT for at least 10 CPU seconds...
>      output: 10.7199 wallclock secs (10.53 usr +  0.00 sys = 10.53 CPU) @ 28029.63/s (n=295152)
>      perlio: 11.2884 wallclock secs (10.46 usr +  0.00 sys = 10.46 CPU) @ 49967.11/s (n=522656)
> print \$out: 10.5978 wallclock secs (10.43 usr +  0.00 sys = 10.43 CPU) @ 30318.79/s (n=316225)
> tie *STDOUT: 11.3525 wallclock secs (10.68 usr +  0.00 sys = 10.68 CPU) @ 27635.96/s (n=295152)
>                Rate tie *STDOUT      output print \$out      perlio
> tie *STDOUT 27636/s          --         -1%         -9%        -45%
> output      28030/s          1%          --         -8%        -44%
> print \$out 30319/s         10%          8%          --        -39%
> perlio      49967/s         81%         78%         65%          --
> need
> 
> Attached there is script that was used to produce those results.

Ok, on my machine it's similar:

                Rate      output tie *STDOUT print \$out
output      150962/s          --         -1%         -7%
tie *STDOUT 152769/s          1%          --         -6%
print \$out 162604/s          8%          6%          --

is roughly consistent image coming out of it.

I guess the time spent here is generally negligible in gitweb anyway...
I suggested using output() because I think hacking it would be _very_
_slightly_ easier than tied filehandle, but you are right that doing
that is also really easy; having the possibility to use PerlIO::Util if
available would be non-essentially nice, but requiring it by stock
gitweb is not reasonable, especially seeing that it's not packaged even
for Debian. ;-)

-- 
				Petr "Pasky" Baudis
If you can't see the value in jet powered ants you should turn in
your nerd card. -- Dunbal (464142)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHv2 04/10] gitweb: Use Cache::Cache compatibile (get, set) output caching
  2010-02-10 12:02         ` Petr Baudis
@ 2010-02-10 18:22           ` Jakub Narebski
  2010-02-10 20:32             ` Jakub Narebski
  0 siblings, 1 reply; 18+ messages in thread
From: Jakub Narebski @ 2010-02-10 18:22 UTC (permalink / raw)
  To: Petr Baudis
  Cc: git, John 'Warthog9' Hawley, John 'Warthog9' Hawley

Dnia środa 10. lutego 2010 13:02, Petr Baudis napisał:
> On Wed, Feb 10, 2010 at 12:28:14PM +0100, Jakub Narebski wrote:
>> On Wed, 10 Feb 2010, Petr Baudis wrote:
>>> On Wed, Feb 10, 2010 at 02:12:24AM +0100, Jakub Narebski wrote:

[...]
>>>> So we either would have to live with non-core PerlIO::Util or (pure Perl)
>>>> Capture::Tiny, or do the 'print -> print $out' patch...
>>> 
>>> All the magic methods seem to be troublesome, but in that case I'd
>>> really prefer a level of indirection instead of filehandle - as is,
>>> 'print (...) -> output (...)' ins. of 'print (...) -> print $out (...)'
>>> (or whatever). That should be really flexible and completely
>>> futureproof, and I don't think the level of indirection would incur any
>>> measurable overhead, would it?
>> 
>> First, it is not only 'print (...) -> print $out (...)'; [...]
>> 
>> Second, using "tie" on filehandle (on *STDOUT) can be used also for 
>> just capturing output, not only for "tee"-ing; [...]
>> 
>> Third, as you can see below tie-ing is about 1% slower than using
>> 'output (...)', which in turn is less than 10% slower than explicit
>> filehandle solution i.e. 'print $out (...)'... and is almost twice
>> slower than solution using PerlIO::Util
[...]
>>                Rate tie *STDOUT      output print \$out      perlio
>> tie *STDOUT 27636/s          --         -1%         -9%        -45%
>> output      28030/s          1%          --         -8%        -44%
>> print \$out 30319/s         10%          8%          --        -39%
>> perlio      49967/s         81%         78%         65%          --
>> need
 
> Ok, on my machine it's similar:
> 
>                 Rate      output tie *STDOUT print \$out
> output      150962/s          --         -1%         -7%
> tie *STDOUT 152769/s          1%          --         -6%
> print \$out 162604/s          8%          6%          --

I wonder why in my case the 'output (...)' solution was faster than tie,
and you have tie faster than 'output'... but I guess 1% is the noise
level.
 
> is roughly consistent image coming out of it.
> 
> I guess the time spent here is generally negligible in gitweb anyway...
> I suggested using output() because I think hacking it would be _very_
> _slightly_ easier than tied filehandle, but you are right that doing
> that is also really easy; having the possibility to use PerlIO::Util if
> available would be non-essentially nice, but requiring it by stock
> gitweb is not reasonable, especially seeing that it's not packaged even
> for Debian. ;-)

Well, the idea was to use PerlIO::Util if possible, checking it via

  out $use_perlio_layers = eval { require PerlIO::Util; 1 };

and fallback to generic mechanism if it is not present.  Only the
generic mechanism would have to be changed from manipulating *STDOUT
(*STDOUT = $data_fh etc.) to tied filehandle.

What we need to be careful about is ':utf8' vs ':raw' mode (IO layer).
In the PerlIO layers solution, and in 'print <sth> -> print $out <sth>'
solution where $out = $data_fh, and $data_fh was opened to in-memory
file, the data saved in variable is already converted, already passed
via 'utf8' layer, and is saved as bytes.  And if we use binary mode,
it is passed unchanged, and is also saved as bytes.  Therefore we can
save to cache file in ':raw' more, and read from cache file in ':raw'
mode, and that is why we don't need separate files for text and for
binary output.

PRINT method in class tied to filehandle gets _untransformed_ argument,
so we have to use utf8::encode($str) if in ':utf8' mode, and either
use PerlIO::get_layers on *STDOUT, or provide BINMODE method in tied
class to watch for mode changes.  (Note that e.g. for snapshots we 
print HTTP headers in ':utf8' mode, and the snapshot itself in ':raw'
i.e. binary mode.).

But all that is doable, and not that much work.  Well, perhaps more
than in the case of 'print -> print $out' etc., and opening in-memory
file via

  open $data_fh, '>', \$data;

but not that more, and we don't need extra global variable $out.  But
no large gitweb patch, and no worry about somebody accidentally using
'print <sth>;' or 'printf <fmt>, <sth>;' instead of respectively
'print $out <sth>;' and 'printf $out <fmt>, <sth>;'.


As to how I installed PerlIO::Util for myself (this might be interesting
to other people): in short, I use local::lib bootstrapping and cpan 
client.  I could from start install some Perl modules from CPAN locally
using 'cpan' client (included in perl RPM).  I have asked on #perl 
channel on FreeNode what to do, and they recommended local::lib.  After
following the bootstapping technique described in local::lib manpage
(see e.g. http://p3rl.org/local::lib) installing PerlIO::Util in
~/perl5 is as simple as 'cpan -i PerlIO::Util' (or using 'cpan' client
interactively).

You can always put
  use lib '/path/to/perl5/lib';
in your $GITWEB_CONFIG file.

Perhaps adding something like "use lib __DIR__.'/lib';" somewhere near
beginning of file (where __DIR__ is appropriate expression that expands
to directory the gitweb.cgi/gitweb.perl is in) to gitweb would be a good
idea?  Then you would be able to make __DIR__/lib symlink to local Perl
modules, or put extra modules by hand under __DIR__/lib.
-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHv2 04/10] gitweb: Use Cache::Cache compatibile (get, set) output caching
  2010-02-10 18:22           ` Jakub Narebski
@ 2010-02-10 20:32             ` Jakub Narebski
  0 siblings, 0 replies; 18+ messages in thread
From: Jakub Narebski @ 2010-02-10 20:32 UTC (permalink / raw)
  To: Petr Baudis
  Cc: git, John 'Warthog9' Hawley, John 'Warthog9' Hawley

Dnia środa 10. lutego 2010 19:22, Jakub Narebski napisał:

> Well, the idea was to use PerlIO::Util if possible, checking it via
> 
>   out $use_perlio_layers = eval { require PerlIO::Util; 1 };
> 
> and fallback to generic mechanism if it is not present.  Only the
> generic mechanism would have to be changed from manipulating *STDOUT
> (*STDOUT = $data_fh etc.) to tied filehandle.

Well, damn, this is not needed[1].  Instead of manipulating *STDOUT,
it is enough to use 'select FILEHANDLE'.  It means that the capture
would look like this:

  open my $data_fh, '>', \$data;
  my $oldfh = select($data_fh);

  # ... code that uses 'print <sth>'

  select($oldfh);
  close $data_fh;

One thing that needs to be done is changing 'binmode STDOUT ...' into
'binmode select() ...'

[1] Thanks to mst on #perl channel for pointing this to me.
-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2010-02-10 20:33 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-09 10:30 [RFC PATCHv2 00/10] gitweb: Simple file based output caching Jakub Narebski
2010-02-09 10:30 ` [RFC PATCHv2 01/10] gitweb: href(..., -path_info => 0|1) Jakub Narebski
2010-02-09 10:30 ` [RFC PATCHv2 02/10] gitweb/cache.pm - Very simple file based caching Jakub Narebski
2010-02-09 10:30 ` [RFC PATCHv2 03/10] gitweb/cache.pm - Stat-based cache expiration Jakub Narebski
2010-02-09 10:30 ` [RFC PATCHv2 04/10] gitweb: Use Cache::Cache compatibile (get, set) output caching Jakub Narebski
2010-02-10  1:12   ` Jakub Narebski
2010-02-10  1:23     ` Petr Baudis
2010-02-10 11:28       ` Jakub Narebski
2010-02-10 12:02         ` Petr Baudis
2010-02-10 18:22           ` Jakub Narebski
2010-02-10 20:32             ` Jakub Narebski
2010-02-09 10:30 ` [RFC PATCHv2 05/10] gitweb/cache.pm - Adaptive cache expiration time Jakub Narebski
2010-02-09 10:30 ` [RFC PATCHv2 06/10] gitweb: Use CHI compatibile (compute method) caching Jakub Narebski
2010-02-09 10:30 ` [RFC PATCHv2 07/10] gitweb/cache.pm - Use locking to avoid 'cache miss stampede' problem Jakub Narebski
2010-02-09 10:30 ` [RFC PATCHv2 08/10] gitweb/cache.pm - Serve stale data when waiting for filling cache Jakub Narebski
2010-02-09 10:30 ` [RFC PATCHv2 09/10] gitweb/cache.pm - Regenerate (refresh) cache in background Jakub Narebski
2010-02-09 22:23   ` Jakub Narebski
2010-02-09 10:30 ` [RFC PATCHv2 10/10] gitweb: Show appropriate "Generating..." page when regenerating cache Jakub Narebski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.