Re: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()

From: Jeff King <peff@peff.net>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git@vger.kernel.org, stolee@gmail.com
Subject: Re: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()
Date: Fri, 7 Dec 2018 03:53:34 -0500	[thread overview]
Message-ID: <20181207085334.GA5167@sigill.intra.peff.net> (raw)
In-Reply-To: <20181206235446.147173-1-jonathantanmy@google.com>

On Thu, Dec 06, 2018 at 03:54:46PM -0800, Jonathan Tan wrote:

> This makes sense - I thought I shouldn't mention the commit graph in the
> code since it seems like a layering violation, but I felt the need to
> mention commit graph in a comment, so maybe the need to mention commit
> graph in the code is there too. Subsequently, maybe the lookup-for-type
> could be replaced by a lookup-in-commit-graph (maybe by using
> parse_commit_in_graph() directly), which should be at least slightly
> faster.

That makes more sense to me. If we don't have a commit graph at all,
it's a quick noop. If we do, we might binary search in the list of
commits for a non-commit. But that's strictly faster than finding the
object's type (which involves a binary search of a larger list, followed
by actually accessing the type info).

> > In general, it would be nice if we had a more incremental API
> > for accessing objects: open, get metadata, then read the data. That
> > would make these kinds of optimizations "free".
> 
> Would this be assuming that to read the data, you would (1) first need to
> read the metadata, and (2) there would be no redundancy in reading the
> two? It seems to me that for loose objects, you would want to perform
> all your reads at once, since any read requires opening the file, and
> for commit graphs, you just want to read what you want, since the
> metadata and the data are in separate places.

By metadata here, I don't mean the commit-graph data, but just the
object type and size. So I'm imagining an interface more like:

  - object_open() locates the object, and stores either the pack
    file/offset or a descriptor to a loose path in an opaque handle
    struct

  - object_size() and object_type() on that handle would do what you
    expect. For loose objects, these would parse the header (the
    equivalent of unpack_sha1_header()). For packed ones, they'd use the
    object header in the pack (and chase down the delta bits as needed).

  - object_contents() would return the full content

  - object_read() could sequentially read a subset of the file (this
    could replace the streaming interface we currently have)

We have most of the low-level bits for this already, if you poke into
what object_info_extended() is doing. We just don't have them packaged
in an interface which can persist across multiple calls.

With an interface like that, parse_object()'s large-blob check could be
something like the patch below.

But your case here is a bit more interesting. If we have a commit graph,
then we can avoid opening (or even finding!) the on-disk object at all.
So I actually think it makes sense to just check the commit-graph first,
as discussed above.

---

diff --git a/object.c b/object.c
index e54160550c..afce58c0bc 100644
--- a/object.c
+++ b/object.c
@@ -254,23 +254,31 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
 	const struct object_id *repl = lookup_replace_object(r, oid);
 	void *buffer;
 	struct object *obj;
+	struct object_handle oh;
 
 	obj = lookup_object(r, oid->hash);
 	if (obj && obj->parsed)
 		return obj;
 
-	if ((obj && obj->type == OBJ_BLOB && has_object_file(oid)) ||
-	    (!obj && has_object_file(oid) &&
-	     oid_object_info(r, oid, NULL) == OBJ_BLOB)) {
-		if (check_object_signature(repl, NULL, 0, NULL) < 0) {
+	if (object_open(&oh, oid) < 0)
+		return NULL; /* missing object */
+
+	if (object_type(&oh) == OBJ_BLOB) {
+		/* this will call object_read() on 4k chunks */
+		if (check_object_signature_stream(&oh, oid)) {
 			error(_("sha1 mismatch %s"), oid_to_hex(oid));
 			return NULL;
 		}
+		object_close(&oh); /* we don't care about contents */
 		parse_blob_buffer(lookup_blob(r, oid), NULL, 0);
 		return lookup_object(r, oid->hash);
 	}
 
-	buffer = read_object_file(oid, &type, &size);
+	type = object_type(&oh);
+	size = object_size(&oh);
+	buffer = object_contents(&oh);
+	object_close(&oh);
+
 	if (buffer) {
 		if (check_object_signature(repl, buffer, size, type_name(type)) < 0) {
 			free(buffer);