From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, "Jeff King" <peff@peff.net>,
"Andrzej Hunt" <ajrhunt@google.com>,
"Martin Ågren" <martin.agren@gmail.com>
Subject: Re: Whether to keep using UNLEAK() in built-ins
Date: Fri, 18 Feb 2022 20:31:23 +0100 [thread overview]
Message-ID: <220218.86fsoggdou.gmgdl@evledraar.gmail.com> (raw)
In-Reply-To: <xmqqwnhs11he.fsf@gitster.g>
On Fri, Feb 18 2022, Junio C Hamano wrote:
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>>>> Specifically, the cited commit turned an existing strbuf_release()
>>>> on &err into UNLEAK(). If that and the other strbuf (sb) were so
>>>> easily releasable, why didn't we do so back then already?
>>>
>>> I suspect that the answer to the above question is because these
>>> allocations are in the top-level cmd_commit() function, which is
>>> never called recursively or repeatedly as a subroutine. The only
>>> significant thing that happens after we return from it is to exit.
>>>
>>> In such a code path, marking a variable as UNLEAK() is a better
>>> thing to do than calling strbuf_release(). Both will work as a way
>>> to squelch sanitizers from reporting a leak that does not matter,
>>> but calling strbuf_release() means we'd spend extra cycles to return
>>> pieces of memory to the pool, even though we know that the pool
>>> itself will be cleaned immediately later at exit.
>>>
>>> We already have UNLEAK to tell sanitizers not to distract us from
>>> spotting and plugging real leaks by reporting these apparent leaks
>>> that do not matter. It is of somewhat dubious value to do a "we
>>> care too much about pleasing sanitizer and spend extra cycles at
>>> runtime while real users are doing real work" change.
>
>> Per https://lore.kernel.org/git/87a6k8daeu.fsf@evledraar.gmail.com/ the
>> real goal I have in mind here is to use the built-ins as a stand-in for
>> testing whether the underlying APIs are leak-free.
>>
>> Because of that having to reason about UNLEAK() doing the right thing or
>> not is just unneeded distraction. Yes for a "struct strbuf" it won't
>> matter, but most of what we UNLEAK() is more complex stuff like "struct
>> rev_info". We won't really make headway making revision.c not leak
>> memory without using "git log" et al as the test subjects for whether
>> that API leaks.
>
> I have to say that you have a wrong goal and wrong priority. The
> number of UNLEAK in cmd_foo() functions is not even a poor
> approximation of our progress.
To clarify I'm not saying it's an approximation of how close we are to
getting rid of memory leaks. I think I'll know better than anyone what
the current state of shoveling shit uphill that is.
I was suggesting that these's a trivial number of these, and mostly we
strbuf_release() already, and that on modern libc's free() is pretty
much free anyway, so worrying about using UNLEAK() v.s. just freeing
before exit is some combination of a premature optimization and
needlessly retaining a special-case for the built-ins.
> Imagine that a subsystem that are repeatedly set-up and used during
> a single program invocation was leaky. Let's say a program calls
> diff_setup() to prepare the diff_options structure, feeds two things
> to compare, and calls diff_flush() to show the comparison result,
> but let's assume this sequence leaks some resources.
>
> Now cmd_diff() may be such a program that does a setup, feeds two
> trees, calls diff_flush() and exits. If we didn't do anything to
> it, diff_options may "leak". Marking it with UNLEAK may be a good
> measure, if we want to keep the leak checker from reporting a leak
> that does not matter in practice so that we can concentrate on
> plugging real leaks that matter.
>
> But consider cmd_log(), running something like "git log -p". It
> iterates over commits, does the <setup, compare two trees, flush>
> repeatedly for each commit it encounters during the walk on the same
> diff_options structure. Now, the leak in the code path does matter.
> If diff_flush() is left leaky, it will show up in the leak checker's
> output, and that is reporting real leaks that matter. The thing is,
> the <setup, compare, flush> sequence whose leak matters is not in
> cmd_log(); it is much deeper in the revision walking machinery. We
> do not want to paper over such leaks with UNLEAK.
>
> See the difference? The number of UNLEAK OUTside built-ins does
> matter. We do not want to have UNLEAK there to hide leaks in
> possible callees that may be called arbitrary number of times in
> arbitrary order. Compared to that, UNLEAK in cmd_foo() to mark an
> on-stack variable that was used only once without even a recursion
> is purely to squelch the leak checker from reporting leaks that does
> not matter. For simple things like strbuf on stack of the top-level
> cmd_foo() functions, we could even leave strbuf_release() not called
> and leave the resource reclamation to _exit(2). That would cause
> the leak checkers to report them and distract us, and UNLEAK is a
> better way to squelch the distraction without wasting extra cycles
> at runtime to actually free them.
>
> So, don't look at "built-ins as a stand-in". It is a wrong
> direction to go in to let the "leak checker" tail wag the dog.
Yes, I see the difference and I fully accept the point you're making.
I just don't find it to be a useful distinction in practice, because for
e.g. "git log" we do have certain uses of the API which are only
performed by the top-level part of a "git log --whatever".
So if we take the view that e.g. the "struct rev_info rev" should be
UNLEAK()'d because "we're about to exit anyway" then as soon as we *do*
use that part of the API we'll run into the previously unfixed and
hidden leak.
Which is why as noted in the linked-to-above
https://lore.kernel.org/git/87a6k8daeu.fsf@evledraar.gmail.com/ I think
it's useful to fix leaks even in built-ins, because they're useful
canaries for those APIs.
Now, in this case does it really matter? No, it doesn't currently. But I
think the direction we should be heading in is turning more of these
cmd_whatever() into properly functioning libraries, as e.g. John Cai's
just-sent series to do that for "reflog delete" does:
https://lore.kernel.org/git/pull.1218.git.git.1645209647.gitgitgadget@gmail.com/
Once we do that the UNLEAK() here will absolutely need to be changed to
a strbuf_release() or equivalent, since if you make two commits with
that new library we'll have leaked memory.
So I think it makes sense to just fix leaks everywhere if it's easy
without worrying about the distinction, particularly since I haven't
seen that "wasting cycles" concern matter in practice.
If it did I'd think adding a "GIT_DESTRUCT_LEVEL" as I suggested in the
linked E-Mail would make more sense, since e.g. you could also use that
if you knew you were walking 100 commits & exiting, which can be useful
to further reduce free() churn, and is how a similar global in the Perl
API (PERL_DESTRUCT_LEVEL) works.
next prev parent reply other threads:[~2022-02-18 19:47 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-02-16 8:21 [PATCH 0/2] commit: trivial leak fix, add 2 tests to linux-leaks CI Ævar Arnfjörð Bjarmason
2022-02-16 8:21 ` [PATCH 1/2] commit: fix "author_ident" leak Ævar Arnfjörð Bjarmason
2022-02-16 17:59 ` Junio C Hamano
2022-02-16 8:21 ` [PATCH 2/2] commit: use strbuf_release() instead of UNLEAK() Ævar Arnfjörð Bjarmason
2022-02-16 18:03 ` Junio C Hamano
2022-02-16 18:30 ` Junio C Hamano
2022-02-18 12:35 ` Whether to keep using UNLEAK() in built-ins (was: [PATCH 2/2] commit: use strbuf_release() instead of UNLEAK()) Ævar Arnfjörð Bjarmason
2022-02-18 18:19 ` Whether to keep using UNLEAK() in built-ins Junio C Hamano
2022-02-18 19:31 ` Ævar Arnfjörð Bjarmason [this message]
2022-05-12 22:51 ` [PATCH] commit: fix "author_ident" leak Junio C Hamano
2022-05-17 13:48 ` Ævar Arnfjörð Bjarmason
2022-05-18 16:30 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=220218.86fsoggdou.gmgdl@evledraar.gmail.com \
--to=avarab@gmail.com \
--cc=ajrhunt@google.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=martin.agren@gmail.com \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).