git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, "Jakub Narebski" <jnareb@gmail.com>,
	"Ted Ts'o" <tytso@mit.edu>,
	"Jonathan Nieder" <jrnieder@gmail.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Clemens Buchacher" <drizzd@aon.at>,
	"Shawn O. Pearce" <spearce@spearce.org>
Subject: Re: [RFC/PATCHv2 6/6] limit "contains" traversals based on commit generation
Date: Wed, 13 Jul 2011 16:58:44 -0400	[thread overview]
Message-ID: <20110713205844.GA15435@sigill.intra.peff.net> (raw)
In-Reply-To: <7vaach7wfh.fsf@alter.siamese.dyndns.org>

On Wed, Jul 13, 2011 at 01:33:22PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > It takes barely any time to get the generation of the new commit, but we
> > spend .25 seconds writing the whole new cache file out. This could be
> > improved with a more clever disk format that contained a journal of
> > unsorted newly written entries. You'd still write the full cache out
> > once in a while, but the cost would be amortized.
> 
> This series consists of three somewhat related ideas:
> 
>  - A generic API to persistently annotate 20-byte keys (typically object
>    names);
> 
>  - Using that API to implement commit generation numbers;
> 
>  - Using commit generation numbers in "tag --contains" traversal.

Yup, I think that is accurate.

> I think the first one is independently a good change, but I have been
> wondering if the entire history needs to be annotated with the generation
> number for the goal of the third item. There may be stretches of history
> where timestamps are screwed up, but if the commits we should dig through
> while traversing (because they, their parents or their children record
> skewed timestamps) are minority in the history, the same generic API could
> be used to mark only these commits as such, by using far smaller number of
> disk I/Os, no?

I'm not sure it's workable. To use generations as a cutoff, even
for a subset the subset of commits with broken timestamps, you have to
know the generations of other commits, so you know where the cutoff is.
E.g., in "git tag --contains HEAD~1000", I want to search no farther
back than the generation of HEAD~1000. Which means I need to know what
its generation is, which involves going to the roots at least once. We
don't want to go to the roots on-demand and cache only that one value,
since doing so is expensive. So we may as well cache all generations
as we figure them out, not knowing which ones will be needed for future
traversals.

Or are you suggesting dropping generations entirely, and just using
marked-up commit timestamps (or even a flag saying "this timestamp is
bogus, don't use it for cutoffs")?  I sent such a patch with timings
earlier in this discussion (I can dig it up if you want). Even based on
a notes-cache[1], it's fast (because there aren't very many entries).

But there's a big question of deciding which timestamps are bogus. You
can only compare commits against their ancestors.  A commit skewed to
the past is easy to find; its timestamp is less than one of its
ancestors. But for a commit skewed to the future, its descendants will
all look skewed into the past.

I think we can write our algorithms such that future-skewed timestamps
don't give _wrong_ answers, but are just suboptimal (i.e., they may mark
many legitimate commits as "don't trust this timestamp for cutoff", even
though it is their future-skewed ancestor that is actually the problem).
But I think I still like generation numbers because:

  1. They're simple, complete, and unambiguous. It makes them easy to
     understand and use. And I suspect they can be applied in more
     places than just cutoff. For example, I seem to recall somebody
     mentioning that we could do topo-sorting much more efficiently with
     generation numbers. I'm not sure the same "future-skewed commits
     are correct but slow" property would hold there.

  2. The cache can be generated and maintained on the fly. A cache that
     is simply "if you are in this list, your timestamp is bogus"
     suffers from the problem I mentioned elsewhere. If a commit is not
     in the list, is the timestamp good, or has it simply not been
     checked yet?

If the performance numbers were way worse, I would be more inclined to
stay with a timestamp solution. But they're not really worse. The
performance for initial cache build is about the same (you have to go to
the roots in both cases), and the performance for using the cache is
about the same. The only slowness for the generation slowness is the
extra I/O on writing out the cache. But it's not very much, and it's
actually not that hard a problem to solve; I'm mainly leaving it because
I'm lazy. But it's not as if file system implementors and key/value
database designers haven't been solving the problem for the past 30
years.

-Peff

[1] If we did want to go the route of "is this commit in the set of
commits with bogus timestamps", you could probably make things even
simpler by using a fixed-size bloom filter. Sized appropriately, it will
occasionally give a false positive "this commit has a bogus timestamp".
But as discussed above, that is not going to cause a traversal cutoff to
do the wrong thing, but rather only to consider one extra commit it
might not have otherwise.

  reply	other threads:[~2011-07-13 20:59 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-07-13  6:47 [RFC/PATCHv2 0/6] generation numbers for faster traversals Jeff King
2011-07-13  6:57 ` [RFC/PATCHv2 1/6] decorate: allow storing values instead of pointers Jeff King
2011-07-13 17:52   ` Jonathan Nieder
2011-07-13 20:08     ` Jeff King
2011-07-14 17:34       ` Jeff King
2011-07-14 17:51         ` [PATCH 1/3] implement generic key/value map Jeff King
2011-07-14 18:52           ` Bert Wesarg
2011-07-14 18:54             ` Bert Wesarg
2011-07-14 18:55               ` Jeff King
2011-07-14 19:07                 ` Bert Wesarg
2011-07-14 19:14                   ` Jeff King
2011-07-14 19:18                     ` Bert Wesarg
2011-07-14 17:52         ` [PATCH 2/3] fast-export: use object to uint32 map instead of "decorate" Jeff King
2011-07-15  9:40           ` Sverre Rabbelier
2011-07-15 20:00             ` Jeff King
2011-07-14 17:53         ` [PATCH 3/3] decorate: use "map" for the underlying implementation Jeff King
2011-07-14 21:06         ` [RFC/PATCHv2 1/6] decorate: allow storing values instead of pointers Junio C Hamano
2011-08-04 22:43           ` [RFC/PATCH 0/5] macro-based key/value maps Jeff King
2011-08-04 22:45             ` [PATCH 1/5] implement generic key/value map Jeff King
2011-08-04 22:46             ` [PATCH 2/5] fast-export: use object to uint32 map instead of "decorate" Jeff King
2011-08-04 22:46             ` [PATCH 3/5] decorate: use "map" for the underlying implementation Jeff King
2011-08-04 22:46             ` [PATCH 4/5] map: implement persistent maps Jeff King
2011-08-04 22:46             ` [PATCH 5/5] implement metadata cache subsystem Jeff King
2011-08-04 22:48             ` [RFC/PATCH 0/2] patch-id caching Jeff King
2011-08-04 22:49               ` [PATCH 1/2] cherry: read default config Jeff King
2011-08-04 22:49               ` [PATCH 2/2] cache patch ids on disk Jeff King
2011-08-04 22:52                 ` Jeff King
2011-08-05 11:03             ` [RFC/PATCH 0/5] macro-based key/value maps Jeff King
2011-08-05 15:31               ` René Scharfe
2011-08-06  6:30                 ` Jeff King
2011-07-13  7:04 ` [RFC/PATCHv2 2/6] add metadata-cache infrastructure Jeff King
2011-07-13  8:18   ` Bert Wesarg
2011-07-13  8:31     ` Jeff King
2011-07-13  8:45       ` Bert Wesarg
2011-07-13 19:18         ` Jeff King
2011-07-13 19:40       ` Junio C Hamano
2011-07-13 19:33   ` Junio C Hamano
2011-07-13 20:25     ` Jeff King
2011-07-13  7:05 ` [RFC/PATCHv2 3/6] commit: add commit_generation function Jeff King
2011-07-13 14:26   ` Eric Sunshine
2011-07-13  7:05 ` [RFC/PATCHv2 4/6] pretty: support %G to show the generation number of a commit Jeff King
2011-07-13  7:06 ` [RFC/PATCHv2 5/6] check commit generation cache validity against grafts Jeff King
2011-07-13 14:26   ` Eric Sunshine
2011-07-13 19:35     ` Jeff King
2011-07-13  7:06 ` [RFC/PATCHv2 6/6] limit "contains" traversals based on commit generation Jeff King
2011-07-13  7:23   ` Jeff King
2011-07-13 20:33     ` Junio C Hamano
2011-07-13 20:58       ` Jeff King [this message]
2011-07-13 21:12         ` Junio C Hamano
2011-07-13 21:18           ` Jeff King
2011-07-15 18:22   ` Junio C Hamano
2011-07-15 20:40     ` Jeff King
2011-07-15 21:04       ` Junio C Hamano
2011-07-15 21:14         ` Jeff King
2011-07-15 21:01 ` Generation numbers and replacement objects Jakub Narebski
2011-07-15 21:10   ` Jeff King
2011-07-16 21:10     ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110713205844.GA15435@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=avarab@gmail.com \
    --cc=drizzd@aon.at \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jnareb@gmail.com \
    --cc=jrnieder@gmail.com \
    --cc=spearce@spearce.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).