All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jonathan Tan <jonathantanmy@google.com>
To: Jeff Hostetler <git@jeffhostetler.com>
Cc: git@vger.kernel.org, gitster@pobox.com, peff@peff.net,
	Jeff Hostetler <jeffhost@microsoft.com>
Subject: Re: [PATCH v2 4/6] list-objects: filter objects in traverse_commit_list
Date: Thu, 2 Nov 2017 12:32:45 -0700	[thread overview]
Message-ID: <20171102123245.0f768968703ec4e35d3d1f81@google.com> (raw)
In-Reply-To: <20171102175013.3371-5-git@jeffhostetler.com>

On Thu,  2 Nov 2017 17:50:11 +0000
Jeff Hostetler <git@jeffhostetler.com> wrote:

> +int parse_list_objects_filter(struct list_objects_filter_options *filter_options,
> +			      const char *arg)

Returning void is fine, I think. It seems that all your code paths
either return 0 or die.

> +{
> +	struct object_context oc;
> +	struct object_id sparse_oid;
> +	const char *v0;
> +	const char *v1;
> +
> +	if (filter_options->choice)
> +		die(_("multiple object filter types cannot be combined"));
> +
> +	/*
> +	 * TODO consider rejecting 'arg' if it contains any
> +	 * TODO injection characters (since we might send this
> +	 * TODO to a sub-command or to the server and we don't
> +	 * TODO want to deal with legacy quoting/escaping for
> +	 * TODO a new feature).
> +	 */
> +
> +	filter_options->raw_value = strdup(arg);
> +
> +	if (skip_prefix(arg, "blob:", &v0) || skip_prefix(arg, "blobs:", &v0)) {
> +		if (!strcmp(v0, "none")) {
> +			filter_options->choice = LOFC_BLOB_NONE;
> +			return 0;
> +		}
> +
> +		if (skip_prefix(v0, "limit=", &v1) &&
> +		    git_parse_ulong(v1, &filter_options->blob_limit_value)) {
> +			filter_options->choice = LOFC_BLOB_LIMIT;
> +			return 0;
> +		}
> +	}
> +	else if (skip_prefix(arg, "sparse:", &v0)) {

Style: join the 2 lines above.

> +		if (skip_prefix(v0, "oid=", &v1)) {
> +			filter_options->choice = LOFC_SPARSE_OID;
> +			if (!get_oid_with_context(v1, GET_OID_BLOB,
> +						  &sparse_oid, &oc)) {
> +				/*
> +				 * We successfully converted the <oid-expr>
> +				 * into an actual OID.  Rewrite the raw_value
> +				 * in canonoical form with just the OID.
> +				 * (If we send this request to the server, we
> +				 * want an absolute expression rather than a
> +				 * local-ref-relative expression.)
> +				 */

I think this would lead to confusing behavior - for example, a fetch
with "--filter=oid=mybranch:sparseconfig" would have different results
depending on whether "mybranch" refers to a valid object locally.

The way I see it, this should either (i) only accept full 40-character
OIDs or (ii) retain the raw string to be interpreted only when the
filtering is done. (i) is simpler and safer, but is not so useful. In
both cases, if the user really wants client-side interpretation, they
can still use "$(git rev-parse foo)" to make it explicit.

> +				free((char *)filter_options->raw_value);
> +				filter_options->raw_value =
> +					xstrfmt("sparse:oid=%s",
> +						oid_to_hex(&sparse_oid));
> +				filter_options->sparse_oid_value =
> +					oiddup(&sparse_oid);
> +			} else {
> +				/*
> +				 * We could not turn the <oid-expr> into an
> +				 * OID.  Leave the raw_value as is in case
> +				 * the server can parse it.  (It may refer to
> +				 * a branch, commit, or blob we don't have.)
> +				 */
> +			}
> +			return 0;
> +		}
> +
> +		if (skip_prefix(v0, "path=", &v1)) {
> +			filter_options->choice = LOFC_SPARSE_PATH;
> +			filter_options->sparse_path_value = strdup(v1);
> +			return 0;
> +		}
> +	}
> +
> +	die(_("invalid filter expression '%s'"), arg);
> +	return 0;
> +}

[snip]

> +void arg_format_list_objects_filter(
> +	struct argv_array *argv_array,
> +	const struct list_objects_filter_options *filter_options)

Is this function used anywhere (in this patch or subsequent patches)?

> diff --git a/list-objects-filter.c b/list-objects-filter.c
> +/* See object.h and revision.h */
> +#define FILTER_REVISIT (1<<25)

Looking later in the code, this flag indicates that a tree has been
SHOWN, so it might be better to just call this FILTER_SHOWN.

Also document this flag in object.h in the table above FLAG_BITS.

> +static enum list_objects_filter_result filter_blobs_limit(
> +	enum list_objects_filter_type filter_type,
> +	struct object *obj,
> +	const char *pathname,
> +	const char *filename,
> +	void *filter_data_)
> +{
> +	struct filter_blobs_limit_data *filter_data = filter_data_;
> +	unsigned long object_length;
> +	enum object_type t;
> +	int is_special_filename;
> +
> +	switch (filter_type) {
> +	default:
> +		die("unkown filter_type");

Spelling of "unknown" (also elsewhere in this file).

> +		return LOFR_ZERO;
> +
> +	case LOFT_BEGIN_TREE:
> +		assert(obj->type == OBJ_TREE);
> +		/* always include all tree objects */
> +		return LOFR_MARK_SEEN | LOFR_SHOW;
> +
> +	case LOFT_END_TREE:
> +		assert(obj->type == OBJ_TREE);
> +		return LOFR_ZERO;
> +
> +	case LOFT_BLOB:
> +		assert(obj->type == OBJ_BLOB);
> +		assert((obj->flags & SEEN) == 0);
> +
> +		is_special_filename = ((strncmp(filename, ".git", 4) == 0) &&
> +				       filename[4]);

You can just use starts_with(). Also, the filename[4] check is probably
unnecessary.

> +		if (is_special_filename) {
> +			/*
> +			 * Alwayse include ".git*" special files (regardless
> +			 * of size).
> +			 *
> +			 * (This may cause us to include blobs that we do
> +			 * not have locally because we are only looking at
> +			 * the filename and don't actually have to read
> +			 * them.)
> +			 */
> +			goto include_it;
> +		}
> +
> +		t = sha1_object_info(obj->oid.hash, &object_length);
> +		if (t != OBJ_BLOB) { /* probably OBJ_NONE */
> +			/*
> +			 * We DO NOT have the blob locally, so we cannot
> +			 * apply the size filter criteria.  Be conservative
> +			 * and force show it (and let the caller deal with
> +			 * the ambiguity).  (This matches the behavior above
> +			 * when the special filename matches.)
> +			 */
> +			goto include_it;
> +		}
> +
> +		if (object_length < filter_data->max_bytes)
> +			goto include_it;
> +
> +		/*
> +		 * Provisionally omit it.  We've already established
> +		 * that this blob is too big and doesn't have a special
> +		 * filename, so we *WANT* to omit it.  However, there
> +		 * may be a special file elsewhere in the tree that
> +		 * references this same blob, so we cannot reject it
> +		 * just yet.  Leave the LOFR_ bits unset so that *IF*
> +		 * the blob appears again in the traversal, we will
> +		 * be asked again.
> +		 *
> +		 * If we are keeping a list of the ommitted objects,
> +		 * provisionally add it to the list.
> +		 */
> +
> +		if (filter_data->omits)
> +			oidset_insert(filter_data->omits, &obj->oid);
> +		return LOFR_ZERO;
> +	}
> +
> +include_it:
> +	if (filter_data->omits)
> +		oidset_remove(filter_data->omits, &obj->oid);
> +	return LOFR_MARK_SEEN | LOFR_SHOW;
> +}

[snip]

> +struct frame {
> +	int defval;

Document this variable?

> +	int child_prov_omit : 1;

I think it's clearer if we use "unsigned" here. Also, document this
(e.g. "1 if any descendant of this tree object was provisionally
omitted").

> +/*
> + * During list-object traversal we allow certain objects to be
> + * filtered (omitted) from the result.  The active filter uses
> + * these result values to guide list-objects.
> + *
> + * _ZERO      : Do nothing with the object at this time.  It may
> + *              be revisited if it appears in another place in
> + *              the tree or in another commit during the overall
> + *              traversal.
> + *
> + * _MARK_SEEN : Mark this object as "SEEN" in the object flags.
> + *              This will prevent it from being revisited during
> + *              the remainder of the traversal.  This DOES NOT
> + *              imply that it will be included in the results.
> + *
> + * _SHOW      : Show this object in the results (call show() on it).
> + *              In general, objects should only be shown once, but
> + *              this result DOES NOT imply that we mark it SEEN.
> + *
> + * Most of the time, you want the combination (_MARK_SEEN | _SHOW)
> + * but they can be used independently, such as when sparse-checkout
> + * pattern matching is being applied.
> + *
> + * A _MARK_SEEN without _SHOW can be called a hard-omit -- the
> + * object is not shown and will never be reconsidered (unless a
> + * previous iteration has already shown it).
> + *
> + * A _ZERO is can be called a provisional-omit -- the object is
> + * not shown, but *may* be revisited (if the object appears again
> + * in the traversal).  Therefore, it will be omitted from the
> + * results *unless* a later iteration causes it to be shown.
> + */
> +enum list_objects_filter_result {
> +	LOFR_ZERO      = 0,
> +	LOFR_MARK_SEEN = 1<<0,
> +	LOFR_SHOW      = 1<<1,
> +};

Thanks - this looks like a good explanation to me.

> +enum list_objects_filter_type {
> +	LOFT_BEGIN_TREE,
> +	LOFT_END_TREE,
> +	LOFT_BLOB
> +};

Optional: probably a better name would be list_objects_filter_situation.

> +void traverse_commit_list_filtered(
> +	struct list_objects_filter_options *filter_options,
> +	struct rev_info *revs,
> +	show_commit_fn show_commit,
> +	show_object_fn show_object,
> +	void *show_data,
> +	struct oidset *omitted)
> +{
> +	filter_object_fn filter_fn = NULL;
> +	filter_free_fn filter_free_fn = NULL;
> +	void *filter_data = NULL;
> +
> +	filter_data = list_objects_filter__init(omitted, filter_options,
> +						&filter_fn, &filter_free_fn);
> +	do_traverse(revs, show_commit, show_object, show_data,
> +		    filter_fn, filter_data);
> +	if (filter_data && filter_free_fn)
> +		filter_free_fn(filter_data);
> +}

This function traverse_commit_list_filtered() is in list-objects.c but
in list-objects-filter.h, if I'm reading the diff correctly?

Overall, this looks like a good change. Object traversal was upgraded
with the behaviors of MARK_SEEN and SHOW independently controllable and
with the ability to do things post-tree (in addition to pre-tree and
blob), and this was used to support a few types of filtering, which
subsequent patches will allow the user to invoke through "--filter=".

  reply	other threads:[~2017-11-02 19:32 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-02 17:50 [PATCH v2 0/6] Partial clone part 1: object filtering Jeff Hostetler
2017-11-02 17:50 ` [PATCH v2 1/6] dir: allow exclusions from blob in addition to file Jeff Hostetler
2017-11-02 17:50 ` [PATCH v2 2/6] oidmap: add oidmap iterator methods Jeff Hostetler
2017-11-02 17:50 ` [PATCH v2 3/6] oidset: add iterator methods to oidset Jeff Hostetler
2017-11-02 17:50 ` [PATCH v2 4/6] list-objects: filter objects in traverse_commit_list Jeff Hostetler
2017-11-02 19:32   ` Jonathan Tan [this message]
2017-11-03 11:54     ` Johannes Schindelin
2017-11-03 13:37       ` Jeff Hostetler
2017-11-07 18:54     ` Jeff Hostetler
2017-11-06 17:51   ` Jeff Hostetler
2017-11-06 18:08     ` Jonathan Tan
2017-11-02 17:50 ` [PATCH v2 5/6] rev-list: add list-objects filtering support Jeff Hostetler
2017-11-02 17:50 ` [PATCH v2 6/6] pack-objects: add list-objects filtering Jeff Hostetler
2017-11-02 19:44 ` [PATCH v2 0/6] Partial clone part 1: object filtering Jonathan Tan
2017-11-03 13:43   ` Jeff Hostetler
2017-11-03 15:05     ` Junio C Hamano
2017-11-03 18:34       ` Jeff Hostetler
2017-11-08  0:41         ` Jonathan Tan
2017-11-08  0:54           ` Junio C Hamano
2017-11-08 14:39             ` Jeff Hostetler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171102123245.0f768968703ec4e35d3d1f81@google.com \
    --to=jonathantanmy@google.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jeffhost@microsoft.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.