All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, "Jeff King" <peff@peff.net>,
	"Andrzej Hunt" <ajrhunt@google.com>,
	"Martin Ågren" <martin.agren@gmail.com>
Subject: Re: Whether to keep using UNLEAK() in built-ins
Date: Fri, 18 Feb 2022 20:31:23 +0100	[thread overview]
Message-ID: <220218.86fsoggdou.gmgdl@evledraar.gmail.com> (raw)
In-Reply-To: <xmqqwnhs11he.fsf@gitster.g>


On Fri, Feb 18 2022, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>>>> Specifically, the cited commit turned an existing strbuf_release()
>>>> on &err into UNLEAK().  If that and the other strbuf (sb) were so
>>>> easily releasable, why didn't we do so back then already?
>>>
>>> I suspect that the answer to the above question is because these
>>> allocations are in the top-level cmd_commit() function, which is
>>> never called recursively or repeatedly as a subroutine.  The only
>>> significant thing that happens after we return from it is to exit.
>>>
>>> In such a code path, marking a variable as UNLEAK() is a better
>>> thing to do than calling strbuf_release().  Both will work as a way
>>> to squelch sanitizers from reporting a leak that does not matter,
>>> but calling strbuf_release() means we'd spend extra cycles to return
>>> pieces of memory to the pool, even though we know that the pool
>>> itself will be cleaned immediately later at exit.
>>>
>>> We already have UNLEAK to tell sanitizers not to distract us from
>>> spotting and plugging real leaks by reporting these apparent leaks
>>> that do not matter.  It is of somewhat dubious value to do a "we
>>> care too much about pleasing sanitizer and spend extra cycles at
>>> runtime while real users are doing real work" change.
>
>> Per https://lore.kernel.org/git/87a6k8daeu.fsf@evledraar.gmail.com/ the
>> real goal I have in mind here is to use the built-ins as a stand-in for
>> testing whether the underlying APIs are leak-free.
>>
>> Because of that having to reason about UNLEAK() doing the right thing or
>> not is just unneeded distraction. Yes for a "struct strbuf" it won't
>> matter, but most of what we UNLEAK() is more complex stuff like "struct
>> rev_info". We won't really make headway making revision.c not leak
>> memory without using "git log" et al as the test subjects for whether
>> that API leaks.
>
> I have to say that you have a wrong goal and wrong priority.  The
> number of UNLEAK in cmd_foo() functions is not even a poor
> approximation of our progress.

To clarify I'm not saying it's an approximation of how close we are to
getting rid of memory leaks. I think I'll know better than anyone what
the current state of shoveling shit uphill that is.

I was suggesting that these's a trivial number of these, and mostly we
strbuf_release() already, and that on modern libc's free() is pretty
much free anyway, so worrying about using UNLEAK() v.s. just freeing
before exit is some combination of a premature optimization and
needlessly retaining a special-case for the built-ins.

> Imagine that a subsystem that are repeatedly set-up and used during
> a single program invocation was leaky.  Let's say a program calls
> diff_setup() to prepare the diff_options structure, feeds two things
> to compare, and calls diff_flush() to show the comparison result,
> but let's assume this sequence leaks some resources.
>
> Now cmd_diff() may be such a program that does a setup, feeds two
> trees, calls diff_flush() and exits.  If we didn't do anything to
> it, diff_options may "leak".  Marking it with UNLEAK may be a good
> measure, if we want to keep the leak checker from reporting a leak
> that does not matter in practice so that we can concentrate on
> plugging real leaks that matter.
>
> But consider cmd_log(), running something like "git log -p".  It
> iterates over commits, does the <setup, compare two trees, flush>
> repeatedly for each commit it encounters during the walk on the same
> diff_options structure.  Now, the leak in the code path does matter.
> If diff_flush() is left leaky, it will show up in the leak checker's
> output, and that is reporting real leaks that matter.  The thing is,
> the <setup, compare, flush> sequence whose leak matters is not in
> cmd_log(); it is much deeper in the revision walking machinery.  We
> do not want to paper over such leaks with UNLEAK.
>
> See the difference?  The number of UNLEAK OUTside built-ins does
> matter.  We do not want to have UNLEAK there to hide leaks in
> possible callees that may be called arbitrary number of times in
> arbitrary order.  Compared to that, UNLEAK in cmd_foo() to mark an
> on-stack variable that was used only once without even a recursion
> is purely to squelch the leak checker from reporting leaks that does
> not matter.  For simple things like strbuf on stack of the top-level
> cmd_foo() functions, we could even leave strbuf_release() not called
> and leave the resource reclamation to _exit(2).  That would cause
> the leak checkers to report them and distract us, and UNLEAK is a
> better way to squelch the distraction without wasting extra cycles
> at runtime to actually free them.
>
> So, don't look at "built-ins as a stand-in".  It is a wrong
> direction to go in to let the "leak checker" tail wag the dog.

Yes, I see the difference and I fully accept the point you're making.

I just don't find it to be a useful distinction in practice, because for
e.g. "git log" we do have certain uses of the API which are only
performed by the top-level part of a "git log --whatever".

So if we take the view that e.g. the "struct rev_info rev" should be
UNLEAK()'d because "we're about to exit anyway" then as soon as we *do*
use that part of the API we'll run into the previously unfixed and
hidden leak.

Which is why as noted in the linked-to-above
https://lore.kernel.org/git/87a6k8daeu.fsf@evledraar.gmail.com/ I think
it's useful to fix leaks even in built-ins, because they're useful
canaries for those APIs.

Now, in this case does it really matter? No, it doesn't currently. But I
think the direction we should be heading in is turning more of these
cmd_whatever() into properly functioning libraries, as e.g. John Cai's
just-sent series to do that for "reflog delete" does:
https://lore.kernel.org/git/pull.1218.git.git.1645209647.gitgitgadget@gmail.com/

Once we do that the UNLEAK() here will absolutely need to be changed to
a strbuf_release() or equivalent, since if you make two commits with
that new library we'll have leaked memory.

So I think it makes sense to just fix leaks everywhere if it's easy
without worrying about the distinction, particularly since I haven't
seen that "wasting cycles" concern matter in practice.

If it did I'd think adding a "GIT_DESTRUCT_LEVEL" as I suggested in the
linked E-Mail would make more sense, since e.g. you could also use that
if you knew you were walking 100 commits & exiting, which can be useful
to further reduce free() churn, and is how a similar global in the Perl
API (PERL_DESTRUCT_LEVEL) works.

  reply	other threads:[~2022-02-18 19:47 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-16  8:21 [PATCH 0/2] commit: trivial leak fix, add 2 tests to linux-leaks CI Ævar Arnfjörð Bjarmason
2022-02-16  8:21 ` [PATCH 1/2] commit: fix "author_ident" leak Ævar Arnfjörð Bjarmason
2022-02-16 17:59   ` Junio C Hamano
2022-02-16  8:21 ` [PATCH 2/2] commit: use strbuf_release() instead of UNLEAK() Ævar Arnfjörð Bjarmason
2022-02-16 18:03   ` Junio C Hamano
2022-02-16 18:30     ` Junio C Hamano
2022-02-18 12:35       ` Whether to keep using UNLEAK() in built-ins (was: [PATCH 2/2] commit: use strbuf_release() instead of UNLEAK()) Ævar Arnfjörð Bjarmason
2022-02-18 18:19         ` Whether to keep using UNLEAK() in built-ins Junio C Hamano
2022-02-18 19:31           ` Ævar Arnfjörð Bjarmason [this message]
2022-05-12 22:51 ` [PATCH] commit: fix "author_ident" leak Junio C Hamano
2022-05-17 13:48   ` Ævar Arnfjörð Bjarmason
2022-05-18 16:30     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=220218.86fsoggdou.gmgdl@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=ajrhunt@google.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=martin.agren@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.