All of lore.kernel.org
 help / color / mirror / Atom feed
From: Lars Schneider <larsxschneider@gmail.com>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org, gitster@pobox.com, tboegi@web.de,
	e@80x24.org, ttaylorr@github.com, peartben@gmail.com
Subject: Re: [PATCH v5 5/5] convert: add "status=delayed" to filter process protocol
Date: Sat, 24 Jun 2017 19:22:40 +0200	[thread overview]
Message-ID: <A8FAEFDC-1ADA-4355-80DC-069A739F829C@gmail.com> (raw)
In-Reply-To: <20170624141941.usy2pyhid3jrf3ku@sigill.intra.peff.net>


> On 24 Jun 2017, at 16:19, Jeff King <peff@peff.net> wrote:
> 
> On Thu, Jun 01, 2017 at 10:22:03AM +0200, Lars Schneider wrote:
> 
>> Some `clean` / `smudge` filters might require a significant amount of
>> time to process a single blob (e.g. the Git LFS smudge filter might
>> perform network requests). During this process the Git checkout
>> operation is blocked and Git needs to wait until the filter is done to
>> continue with the checkout.
>> 
>> Teach the filter process protocol (introduced in edcc858) to accept the
>> status "delayed" as response to a filter request. Upon this response Git
>> continues with the checkout operation. After the checkout operation Git
>> calls "finish_delayed_checkout" which queries the filter for remaining
>> blobs. If the filter is still working on the completion, then the filter
>> is expected to block. If the filter has completed all remaining blobs
>> then an empty response is expected.
>> 
>> Git has a multiple code paths that checkout a blob. Support delayed
>> checkouts only in `clone` (in unpack-trees.c) and `checkout` operations.
> 
> It might be worth giving a reason in this last paragraph. I think the
> reason is "because it's more complicated for the caller, as they have to
> be OK with out-of-order processing and remembering to go back and handle
> the delayed cases".

Correct! However, my real reason was that these code paths process all
files of the tree. Therefore the "out-of-order" processing can be
effective. 

How about this:

Git has a multiple code paths that checkout a blob. Support delayed
checkouts only in `clone` (in unpack-trees.c) and `checkout` operations
for now. The optimization is most effective in these code paths as all 
files of the tree are processed.


> Speaking of which, _are_ we OK with out-of-order processing in things
> like checkout? Certainly we care about deleting items before checking
> out new ones (so getting rid of "foo/bar" to make room for "foo" and
> vice versa).  I guess it's OK as long as the delayed items are always
> filters that check out new items.

Junio noticed that too but thinks we should be OK:
"[...] We checkout removals first to make room so
that creation of a path X can succeed if an existing path X/Y
that used to want to see X as a directory can succeed [...]"

http://public-inbox.org/git/xmqqvavotych.fsf@gitster.mtv.corp.google.com/


>> ...
> 
>> +After Git received the pathnames, it will request the corresponding
>> +blobs again. These requests contain a pathname and an empty content
>> +section. The filter is expected to respond with the smudged content
>> +in the usual way as explained above.
>> +------------------------
>> +packet:          git> command=smudge
>> +packet:          git> pathname=path/testfile.dat
>> +packet:          git> 0000
>> +packet:          git> 0000  # empty content!
>> +packet:          git< status=success
>> +packet:          git< 0000
>> +packet:          git< SMUDGED_CONTENT
>> +packet:          git< 0000
>> +packet:          git< 0000  # empty list, keep "status=success" unchanged!
>> +------------------------
> 
> Out of curiosity, what should happen if we re-ask for a pathname that
> wasn't mentioned in list_available_blobs? I guess it would depend on the
> filter implementation, but probably most would just block (since we
> wouldn't have asked for can-delay). The other option is returning an
> error, I suppose, but blocking and filling the request sounds
> friendlier.

I agree that is up to the filter. I would expect the blocking strategy.
The filter cannot delay it anymore as we have not send the "can-delay"
flag.


> ...
>> diff --git a/cache.h b/cache.h
>> index ae4c45d379..523ead1d95 100644
>> --- a/cache.h
>> +++ b/cache.h
>> @@ -1543,16 +1543,20 @@ extern int ident_cmp(const struct ident_split *, const struct ident_split *);
>> struct checkout {
>> 	struct index_state *istate;
>> 	const char *base_dir;
>> +	struct delayed_checkout *delayed_checkout;
>> 	int base_dir_len;
> 
> Should base_dir_len and base_dir be kept together?
> 
> I suspect you ordered it this way because...
> 
>> 	unsigned force:1,
>> 		 quiet:1,
>> 		 not_new:1,
>> 		 refresh_cache:1;
>> };
>> -#define CHECKOUT_INIT { NULL, "" }
>> +#define CHECKOUT_INIT { NULL, "", NULL }
>> +
> 
> ...you wanted to add the NULL to the initializer here. But it's not
> necessary. Once one element of a struct is initialized, all the
> remaining members are initialized according to the usual static rules
> (so NULL for pointers). We're already using that to zero out
> base_dir_len and the flags (the "" is necessary because we want a real
> empty string, not NULL).
> 
> Since you want NULL for your new member, you can just leave the
> initializer as-is and it will do the right thing.

That was indeed my reasoning. I am no C expert and I wasn't sure about
the initialization. The "Once one element of a struct is initialized"
thing is news to me :-) Thanks!

I'll keep the base_dir* together then!


>> @@ -533,7 +534,8 @@ static int start_multi_file_filter_fn(struct subprocess_entry *subprocess)
>> 	if (err)
>> 		goto done;
>> 
>> -	err = packet_writel(process->in, "capability=clean", "capability=smudge", NULL);
>> +	err = packet_writel(process->in,
>> +		"capability=clean", "capability=smudge", "capability=delay", NULL);
> 
> Why do we need to tell the filter we know about delay? Shouldn't it just
> need to tell us that it knows about delay, and then we choose whether to
> ask for can-delay for particular entries?

Because in the protocol I defined that the filter needs to answer with
a strict subset of this list [1]. I thought that this would make the protocol
more future proof/backward compatible. Because the filter is not allowed to
answer with something that Git does not understand.

[1] https://github.com/git/git/blob/5402b1352f5181247405fbff1887008a0cb3b04a/Documentation/gitattributes.txt#L408-L411


> 
>> @@ -647,6 +653,14 @@ static int apply_multi_file_filter(const char *path, const char *src, size_t len
>> 	if (err)
>> 		goto done;
>> 
>> +	if (CAP_DELAY & entry->supported_capabilities &&
>> +	    dco && dco->state == CE_CAN_DELAY) {
> 
> In a complicated conditional with bit operations, we usually put the bit
> operation in its own parentheses so it's more obvious that it wasn't
> supposed to be "&&". Like:
> 
>  if ((CAP_DELAY & entry->supported_capabilities) &&
>      dco && dco->state == CE_CAN_DELAY))

Agreed!


> The operator precedence is such that it works without them, so this is
> just a style question (I'd also usually put the flags field before the
> flag itself, but that's really getting into aesthetics).

You mean (entry & CAP_DELAY) instead of (CAP_DELAY & entry)?


> The dco check here is basically asking if can-delay is enabled for the
> whole checkout. Would we ever need to mark a specific entry as
> not-delayable? Probably not from the start, but I wondered if something
> might get marked as a failure and need to be retried without delay or
> similar. It's probably not worth going down that road until some caller
> needs it.

Agreed.

> 
>> @@ -662,14 +676,78 @@ static int apply_multi_file_filter(const char *path, const char *src, size_t len
>> 	if (err)
>> 		goto done;
>> 
>> -	err = strcmp(filter_status.buf, "success");
>> +	if (can_delay && !strcmp(filter_status.buf, "delayed")) {
>> +		dco->state = CE_DELAYED;
>> +		string_list_insert(&dco->filters, cmd);
>> +		string_list_insert(&dco->paths, path);
> 
> These string_list_insert()s have bad worst-case performance, because
> they keep the list sorted. So the worst case is reverse-sorted input,
> where each entry goes at the front of the existing array and we have to
> memmove() the whole array on each insert, and we do O(n^2) byte copies.
> 
> The best case is when the input is already sorted. And I think in
> general that's how we'll iterate over paths to checkout. The cmds are
> more likely to be random, but we're also not likely to have more than a
> handful of them. So this is probably an OK data structure to use, at
> least for now.

Agreed! Up until now I tested the delayed technique only with my small test 
case. It will become really interesting when GitLFS implements this feature.
Then I can test it with our repos that have +20k filtered files. At this point
I will be able to identify the spots that require more optimization.


>> +	for (;;) {
>> +		const char* pre = "pathname=";
>> +		const int pre_len = strlen(pre);
>> +		line = packet_read_line(process->out, NULL);
>> +		if (!line)
>> +			break;
>> +		err = strlen(line) <= pre_len || strncmp(line, pre, pre_len);
>> +		if (err)
>> +			goto done;
>> +		string_list_insert(delayed_paths, xstrdup(line+pre_len));
>> +	}
> 
> I think the prefix matching can be done more simply (and slightly more
> efficiently) with:
> 
>  const char *path;
>  ...
>  err = !skip_prefix(line, "pathname=", &path);
>  if (err)
> 	goto done;
>  string_list_insert(delayed_paths, xstrdup(path));
> 
> I notice we loop here, though. Would we want to be resilient to seeing
> other keys besides pathname? Something like:
> 
>  while ((line = packet_read_line(process->out, NULL)) {
> 	const char *path;
> 	if (skip_prefix(line, "pathname=", &path))
> 		string_list_insert(delayed_paths, xstrdup(path);
> 	else
> 		; /* ignore unknown keys */
>  }
> 
> I can imagine some future where the filter passes back additional
> information about the delayed item. It would be nice if older versions
> of Git would just quietly ignore that if they don't know how to handle
> it.

I agree with all the above. skip_prefix is neat! Although it's so
dominant in the source code I never noticed it.


>> +int finish_delayed_checkout(struct checkout *state)
>> +{
>> +	int errs = 0;
>> +	struct string_list_item *filter, *path;
>> +	struct delayed_checkout *dco = state->delayed_checkout;
>> +
>> +	if (!state->delayed_checkout) {
>> +		return errs;
>> +	}
> 
> Style: we'd usually omit the braces here.

Sure! I'll remove it.

(This is the only rule of the Git style guide that
don't agree with :-) 

> 
>> +			if (available_paths.nr <= 0) {
>> +				/* Filter responded with no entries. That means
>> +				   the filter is done and we can remove the
>> +				   filter from the list (see
>> +				   "string_list_remove_empty_items" call below).
>> +				*/
> 
> This is true because we know the filter would block until it had at
> least one item to return.
> 
> It does seem to put a lot of faith in the filter. What happens if we hit
> this conditional and we know that we still have entries in dco->paths to
> be handled?

We check this later as you noticed below.


>> +			/* In dco->paths we store a list of all delayed paths.
>> +			   The filter just send us a list of available paths.
>> +			   Remove them from the list.
>> +			*/
>> +			filter_string_list(&dco->paths, 0,
>> +				&remove_available_paths, &available_paths);
> 
> This is O(m log n), which is fine. Since both lists are sorted, we could
> technically do it in an O(m+n) merge, but I don't think we have a
> function to do that easily.
> 
> Sorry to harp on run-times, but it's one place I think it would be easy
> to subtly screw this up, so I'm trying to pay close attention.

No worries at all! Thanks for thinking it through on that level.
As I wrote above - I plan to really profile it under "real life conditions"
as soon as we have full GitLFS support.


>> ...
> 
>> +	/* At this point we should not have any delayed paths anymore. */
>> +	errs |= dco->paths.nr;
> 
> And I guess this answers my question above about a filter with no
> entries. When we bail from the loop, we'll hit this point and complain.
> 
> We probably should call string_list_clear(&dco->paths) here to avoid
> leaking the paths on error (possibly we should also print them? Or maybe
> they'd be warned about elsewhere).

Agreed, I will clear the list. I'll also print the items because as far
as I can see they are not printed elsewhere.

Git just says:
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry the checkout with 'git checkout -f HEAD'

How about this?

	errs |= dco->paths.nr;
	for_each_string_list_item(path, &dco->paths) {
		warning("%s was not processed properly.", path->string);
	}
	string_list_clear(&dco->paths, 0);

The output would be:

warning: test-delay10.a was not processed properly.
warning: test-delay10.b was not processed properly.
warning: test-delay11.a was not processed properly.
warning: test-delay20.a was not processed properly.
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry the checkout with 'git checkout -f HEAD'

I contemplated about the warning text.
"$FILE was not filtered properly." is technical more
correct but maybe it would confuse the user?

Plus, at this point I don't really know what filter should
have given me the files. That's why I can't print it in 
the warning message.

> 
>> @@ -179,11 +254,36 @@ static int write_entry(struct cache_entry *ce,
>> 		/*
>> 		 * Convert from git internal format to working tree format
>> 		 */
>> -		if (ce_mode_s_ifmt == S_IFREG &&
>> -		    convert_to_working_tree(ce->name, new, size, &buf)) {
>> -			free(new);
>> -			new = strbuf_detach(&buf, &newsize);
>> -			size = newsize;
>> +		if (ce_mode_s_ifmt == S_IFREG) {
>> +			struct delayed_checkout *dco = state->delayed_checkout;
>> +			if (dco && dco->state != CE_NO_DELAY) {
>> +				/* Do not send the blob in case of a retry. */
>> +				if (dco->state == CE_RETRY) {
>> +					new = NULL;
>> +					size = 0;
>> +				}
>> +				ret = async_convert_to_working_tree(
>> +					ce->name, new, size, &buf, dco);
>> +				if (ret && dco->state == CE_DELAYED) {
>> +					free(new);
>> +					/* Reset the state of the next blob */
>> +					dco->state = CE_CAN_DELAY;
>> +					goto finish;
>> +				}
> 
> Hmm. This "reset the state" bit at the end surprised me. I guess it's
> not wrong, but it goes against the mental model I had formed above. ;)
> 
> We really are using dco->state as a per-entry state flag. It just
> happens to be in a persistent shared struct. I don't think it's wrong,
> it was mostly just surprising. I don't know if it's worth trying to
> simplify, but I think you could do it by:
> 
>  1. Passing back the "was delayed" state from async_convert... in the
>     return value or via a separate out-parameter.

In the beginning I had it implemented that way. But that meant that I
had to pass two variables through the entire convert stack:

async_convert_to_working_tree
-> convert_to_working_tree_internal
--> apply_filter
---> apply_multi_file_filter

> 
>  2. Setting dco->state to CE_RETRY at the top of finish_delayed... so
>     that it's clear that it's about what phase of the conversation
>     we're in.

I could do that. However, I thought it is safer to set the state *before*
every checkout operation in case convert.c messes with this field (it
should not in this phase).


> But I'm OK with it as-is, too.

I'll try 2.


>> diff --git a/t/t0021/rot13-filter.pl b/t/t0021/rot13-filter.pl
>> index 5e43faeec1..f0dc0aff4a 100644
>> --- a/t/t0021/rot13-filter.pl
>> +++ b/t/t0021/rot13-filter.pl
> 
> I'll admit to skimming the perl changes here, but I didn't see anything
> obviously wrong.


Thanks a lot for the review,
Lars


  reply	other threads:[~2017-06-24 17:22 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-01  8:21 [PATCH v5 0/5] convert: add "status=delayed" to filter process protocol Lars Schneider
2017-06-01  8:21 ` [PATCH v5 1/5] t0021: keep filter log files on comparison Lars Schneider
2017-06-01  8:22 ` [PATCH v5 2/5] t0021: make debug log file name configurable Lars Schneider
2017-06-01  8:22 ` [PATCH v5 3/5] t0021: write "OUT" only on success Lars Schneider
2017-06-01  8:22 ` [PATCH v5 4/5] convert: move multiple file filter error handling to separate function Lars Schneider
2017-06-18  7:20   ` Torsten Bögershausen
2017-06-18 11:47     ` Lars Schneider
2017-06-19 17:18       ` Torsten Bögershausen
2017-06-19 17:47         ` Lars Schneider
2017-06-01  8:22 ` [PATCH v5 5/5] convert: add "status=delayed" to filter process protocol Lars Schneider
2017-06-02  2:21   ` Junio C Hamano
2017-06-05 11:36     ` Lars Schneider
2017-06-24 14:19   ` Jeff King
2017-06-24 17:22     ` Lars Schneider [this message]
2017-06-24 18:51       ` Junio C Hamano
2017-06-24 20:36         ` Jeff King
2017-06-24 20:32       ` Jeff King
2017-06-01  9:44 ` [PATCH v5 0/5] " Junio C Hamano
2017-06-02  2:06 ` Junio C Hamano
2017-06-24 14:23 ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=A8FAEFDC-1ADA-4355-80DC-069A739F829C@gmail.com \
    --to=larsxschneider@gmail.com \
    --cc=e@80x24.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peartben@gmail.com \
    --cc=peff@peff.net \
    --cc=tboegi@web.de \
    --cc=ttaylorr@github.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.