From: ZheNing Hu <adlternative@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Git List <git@vger.kernel.org>
Subject: Re: [PATCH 1/2] ref-filter: hacky "streaming" mode
Date: Sun, 5 Sep 2021 16:20:02 +0800 [thread overview]
Message-ID: <CAOLTT8Tka0nxHb3G9yb-fs8ue7RaPCUVSKi5PM+GY+rMjFRnog@mail.gmail.com> (raw)
In-Reply-To: <YTNpeH+jO0zQgAVT@coredump.intra.peff.net>
Jeff King <peff@peff.net> 于2021年9月4日周六 下午8:41写道:
>
> The ref-filter code is very keen to collect all of the refs in an array
> before writing any output. This makes things slower than necessary when
> using the default sort order (which is already sorted by refname when we
> call for_each_ref()), and when no filtering options require it.
>
> This commit adds a mildly-ugly interface to detect this case and stream
> directly from filter_refs(). The caller just needs to call the
> "maybe_stream" function. Either way, they can call the usual sort/print
> functions; they'll just be noops if we did stream instead of collecting.
>
> Here are some timings on a fully-packed 500k-ref repo:
>
> Benchmark #1: ./git.orig for-each-ref --format='%(objectname) %(refname)'
> Time (mean ± σ): 340.2 ms ± 5.3 ms [User: 300.5 ms, System: 39.6 ms]
> Range (min … max): 332.9 ms … 347.0 ms 10 runs
>
> Benchmark #2: ./git.stream for-each-ref --format='%(objectname) %(refname)'
> Time (mean ± σ): 224.0 ms ± 3.4 ms [User: 222.7 ms, System: 1.3 ms]
> Range (min … max): 218.1 ms … 228.5 ms 13 runs
>
> Summary
> './git.stream for-each-ref --format='%(objectname) %(refname)'' ran
> 1.52 ± 0.03 times faster than './git.orig for-each-ref --format='%(objectname) %(refname)''
>
> So we definitely shave off some time, though we're still _much_ slower
> than a simple `wc -l <packed-refs` (which is around 10ms, though of
> course it isn't actually doing as much work).
>
> The point here is to reduce overall effort, though of course the time to
> first output is much improved in the streaming version, too (about 250ms
> versus 4ms).
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> builtin/for-each-ref.c | 7 ++++++
> ref-filter.c | 50 ++++++++++++++++++++++++++++++++++--------
> ref-filter.h | 8 +++++++
> 3 files changed, 56 insertions(+), 9 deletions(-)
>
> diff --git a/builtin/for-each-ref.c b/builtin/for-each-ref.c
> index 89cb6307d4..fe0b92443f 100644
> --- a/builtin/for-each-ref.c
> +++ b/builtin/for-each-ref.c
> @@ -70,6 +70,13 @@ int cmd_for_each_ref(int argc, const char **argv, const char *prefix)
> if (verify_ref_format(&format))
> usage_with_options(for_each_ref_usage, opts);
>
> + /*
> + * try streaming, but only without maxcount; in theory the ref-filter
> + * code could learn about maxcount, but for now it is our own thing
> + */
> + if (!maxcount)
> + ref_filter_maybe_stream(&filter, sorting, icase, &format);
> +
Yes, I think this maxcount is easy to support.
> if (!sorting)
> sorting = ref_default_sorting();
> ref_sorting_set_sort_flags_all(sorting, REF_SORTING_ICASE, icase);
> diff --git a/ref-filter.c b/ref-filter.c
> index 93ce2a6ef2..17b78b1d30 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -2283,15 +2283,19 @@ static int ref_filter_handler(const char *refname, const struct object_id *oid,
> return 0;
> }
>
> - /*
> - * We do not open the object yet; sort may only need refname
> - * to do its job and the resulting list may yet to be pruned
> - * by maxcount logic.
> - */
> - ref = ref_array_push(ref_cbdata->array, refname, oid);
> - ref->commit = commit;
> - ref->flag = flag;
> - ref->kind = kind;
> + if (ref_cbdata->filter->streaming_format) {
> + pretty_print_ref(refname, oid, ref_cbdata->filter->streaming_format);
So we directly use pretty_print_ref() in streaming mode, OK.
> + } else {
> + /*
> + * We do not open the object yet; sort may only need refname
> + * to do its job and the resulting list may yet to be pruned
> + * by maxcount logic.
> + */
> + ref = ref_array_push(ref_cbdata->array, refname, oid);
> + ref->commit = commit;
> + ref->flag = flag;
> + ref->kind = kind;
> + }
>
> return 0;
> }
Therefore, in streaming mode, there is no need to push ref to
ref_array, which can
reduce the overhead of malloc(), free(), which makes sense.
But here is a terrible fact: we did not use ref_array_sort() for sorting here.
So in fact, for_each_fullref_in() does the sorting work () for us by
default sort (%(refname)),
This may be due to the file system or some implementation of ref_iterator.
But this limit the application of this optimization when we use other
atoms to sort.
> @@ -2563,6 +2567,34 @@ void ref_array_sort(struct ref_sorting *sorting, struct ref_array *array)
> QSORT_S(array->items, array->nr, compare_refs, sorting);
> }
>
> +void ref_filter_maybe_stream(struct ref_filter *filter,
> + const struct ref_sorting *sort, int icase,
> + struct ref_format *format)
> +{
> + /* streaming only works with default for_each_ref sort order */
> + if (sort || icase)
> + return;
> +
Yes, this really can only be optimized on the default sort.
> + /* these filters want to see all candidates up front */
> + if (filter->reachable_from || filter->unreachable_from)
> + return;
> +
Make Sence.
> + /*
> + * the %(symref) placeholder is broken with pretty_print_ref(),
> + * which our streaming code uses. I suspect this is a sign of breakage
> + * in other callers like verify_tag(), which should be fixed. But for
> + * now just disable streaming.
> + *
> + * Note that this implies we've parsed the format already with
> + * verify_ref_format().
> + */
> + if (need_symref)
> + return;
> +
I haven’t taken a closer look at why pretty_print_ref() does not
support %(symref),
we can skip it first.
> + /* OK to stream */
> + filter->streaming_format = format;
> +}
> +
> static void append_literal(const char *cp, const char *ep, struct ref_formatting_state *state)
> {
> struct strbuf *s = &state->stack->output;
> diff --git a/ref-filter.h b/ref-filter.h
> index c15dee8d6b..ecea1837a2 100644
> --- a/ref-filter.h
> +++ b/ref-filter.h
> @@ -69,6 +69,9 @@ struct ref_filter {
> lines;
> int abbrev,
> verbose;
> +
> + /* if non-NULL, streaming output during filter_refs() is enabled */
> + struct ref_format *streaming_format;
> };
>
> struct ref_format {
> @@ -135,6 +138,11 @@ char *get_head_description(void);
> /* Set up translated strings in the output. */
> void setup_ref_filter_porcelain_msg(void);
>
> +/* enable streaming during filter_refs() if options allow it */
> +void ref_filter_maybe_stream(struct ref_filter *filter,
> + const struct ref_sorting *sort, int icase,
> + struct ref_format *format);
> +
> /*
> * Print a single ref, outside of any ref-filter. Note that the
> * name must be a fully qualified refname.
> --
> 2.33.0.618.g5b11852304
>
Unfortunately, this optimization may not be helpful for git cat-file --batch,
see [1], batch_object_write() directly constructs a ref_array_item and call
format_ref_array_item() to grab data, It does not use ref_array. So it also
does not have this malloc() | free() overhead.
[1]: https://lore.kernel.org/git/9c5fddf6885875ccd3ce3f047bb938c77d9bbca2.1628842990.git.gitgitgadget@gmail.com/
--
ZheNing Hu
next prev parent reply other threads:[~2021-09-05 8:20 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-09-04 12:40 [hacky PATCH 0/2] speeding up trivial for-each-ref invocations Jeff King
2021-09-04 12:41 ` [PATCH 1/2] ref-filter: hacky "streaming" mode Jeff King
2021-09-05 8:20 ` ZheNing Hu [this message]
2021-09-05 13:04 ` Jeff King
2021-09-07 5:28 ` ZheNing Hu
2021-09-07 18:01 ` Jeff King
2021-09-09 14:45 ` ZheNing Hu
2021-09-10 14:26 ` Jeff King
2021-09-15 12:27 ` ZheNing Hu
2021-09-15 14:23 ` ZheNing Hu
2021-09-16 21:45 ` Jeff King
2021-09-20 7:42 ` ZheNing Hu
2021-09-16 21:31 ` Jeff King
2021-09-05 13:15 ` Jeff King
2021-09-07 5:42 ` ZheNing Hu
2021-09-04 12:42 ` [PATCH 2/2] ref-filter: implement "quick" formats Jeff King
2021-09-05 8:20 ` ZheNing Hu
2021-09-05 13:07 ` Jeff King
2021-09-06 13:34 ` ZheNing Hu
2021-09-07 20:06 ` Junio C Hamano
2021-09-05 8:19 ` [hacky PATCH 0/2] speeding up trivial for-each-ref invocations ZheNing Hu
2021-09-05 12:49 ` Jeff King
2021-09-06 13:30 ` ZheNing Hu
2021-09-07 17:28 ` Jeff King
2021-09-09 13:20 ` ZheNing Hu
2021-09-06 6:54 ` Patrick Steinhardt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAOLTT8Tka0nxHb3G9yb-fs8ue7RaPCUVSKi5PM+GY+rMjFRnog@mail.gmail.com \
--to=adlternative@gmail.com \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).