All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git@vger.kernel.org, stolee@gmail.com
Subject: Re: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()
Date: Fri, 7 Dec 2018 03:53:34 -0500	[thread overview]
Message-ID: <20181207085334.GA5167@sigill.intra.peff.net> (raw)
In-Reply-To: <20181206235446.147173-1-jonathantanmy@google.com>

On Thu, Dec 06, 2018 at 03:54:46PM -0800, Jonathan Tan wrote:

> This makes sense - I thought I shouldn't mention the commit graph in the
> code since it seems like a layering violation, but I felt the need to
> mention commit graph in a comment, so maybe the need to mention commit
> graph in the code is there too. Subsequently, maybe the lookup-for-type
> could be replaced by a lookup-in-commit-graph (maybe by using
> parse_commit_in_graph() directly), which should be at least slightly
> faster.

That makes more sense to me. If we don't have a commit graph at all,
it's a quick noop. If we do, we might binary search in the list of
commits for a non-commit. But that's strictly faster than finding the
object's type (which involves a binary search of a larger list, followed
by actually accessing the type info).

> > In general, it would be nice if we had a more incremental API
> > for accessing objects: open, get metadata, then read the data. That
> > would make these kinds of optimizations "free".
> 
> Would this be assuming that to read the data, you would (1) first need to
> read the metadata, and (2) there would be no redundancy in reading the
> two? It seems to me that for loose objects, you would want to perform
> all your reads at once, since any read requires opening the file, and
> for commit graphs, you just want to read what you want, since the
> metadata and the data are in separate places.

By metadata here, I don't mean the commit-graph data, but just the
object type and size. So I'm imagining an interface more like:

  - object_open() locates the object, and stores either the pack
    file/offset or a descriptor to a loose path in an opaque handle
    struct

  - object_size() and object_type() on that handle would do what you
    expect. For loose objects, these would parse the header (the
    equivalent of unpack_sha1_header()). For packed ones, they'd use the
    object header in the pack (and chase down the delta bits as needed).

  - object_contents() would return the full content

  - object_read() could sequentially read a subset of the file (this
    could replace the streaming interface we currently have)

We have most of the low-level bits for this already, if you poke into
what object_info_extended() is doing. We just don't have them packaged
in an interface which can persist across multiple calls.

With an interface like that, parse_object()'s large-blob check could be
something like the patch below.

But your case here is a bit more interesting. If we have a commit graph,
then we can avoid opening (or even finding!) the on-disk object at all.
So I actually think it makes sense to just check the commit-graph first,
as discussed above.

---
diff --git a/object.c b/object.c
index e54160550c..afce58c0bc 100644
--- a/object.c
+++ b/object.c
@@ -254,23 +254,31 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
 	const struct object_id *repl = lookup_replace_object(r, oid);
 	void *buffer;
 	struct object *obj;
+	struct object_handle oh;
 
 	obj = lookup_object(r, oid->hash);
 	if (obj && obj->parsed)
 		return obj;
 
-	if ((obj && obj->type == OBJ_BLOB && has_object_file(oid)) ||
-	    (!obj && has_object_file(oid) &&
-	     oid_object_info(r, oid, NULL) == OBJ_BLOB)) {
-		if (check_object_signature(repl, NULL, 0, NULL) < 0) {
+	if (object_open(&oh, oid) < 0)
+		return NULL; /* missing object */
+
+	if (object_type(&oh) == OBJ_BLOB) {
+		/* this will call object_read() on 4k chunks */
+		if (check_object_signature_stream(&oh, oid)) {
 			error(_("sha1 mismatch %s"), oid_to_hex(oid));
 			return NULL;
 		}
+		object_close(&oh); /* we don't care about contents */
 		parse_blob_buffer(lookup_blob(r, oid), NULL, 0);
 		return lookup_object(r, oid->hash);
 	}
 
-	buffer = read_object_file(oid, &type, &size);
+	type = object_type(&oh);
+	size = object_size(&oh);
+	buffer = object_contents(&oh);
+	object_close(&oh);
+
 	if (buffer) {
 		if (check_object_signature(repl, buffer, size, type_name(type)) < 0) {
 			free(buffer);

  reply	other threads:[~2018-12-07  8:53 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-04 22:42 [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference() Jonathan Tan
2018-12-04 23:12 ` Stefan Beller
2018-12-06 23:36   ` Jonathan Tan
2018-12-07 13:49     ` Derrick Stolee
2018-12-05  4:54 ` Jeff King
2018-12-06 23:54   ` Jonathan Tan
2018-12-07  8:53     ` Jeff King [this message]
2018-12-05 23:15 ` Junio C Hamano
2018-12-07 21:50 ` [PATCH on master v2] " Jonathan Tan
2018-12-09  0:51   ` Junio C Hamano
2018-12-09  1:49     ` Junio C Hamano
2018-12-11 10:54     ` Jeff King
2018-12-12 19:58       ` Jonathan Tan
2018-12-13  1:27         ` Jeff King
2018-12-13 16:20           ` Derrick Stolee
2018-12-13 18:54 ` [PATCH v3] " Jonathan Tan
2018-12-14  3:20   ` Junio C Hamano
2018-12-14  8:45   ` Jeff King
2019-01-25 15:33 ` Regression in: [PATCH on sb/more-repo-in-api] " SZEDER Gábor
2019-01-25 19:56   ` Stefan Beller
2019-01-25 22:01     ` Jonathan Tan
2019-01-25 22:14     ` SZEDER Gábor
2019-01-25 22:21       ` SZEDER Gábor
2019-01-27 13:08         ` [PATCH] object_as_type: initialize commit-graph-related fields of 'struct commit' SZEDER Gábor
2019-01-27 13:28           ` SZEDER Gábor
2019-01-27 18:40             ` Derrick Stolee
2019-01-28 16:15           ` Jeff King
2019-01-28 16:57           ` Jonathan Tan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181207085334.GA5167@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.