All of lore.kernel.org
 help / color / mirror / Atom feed
From: Derrick Stolee <derrickstolee@github.com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>,
	Bagas Sanjaya <bagasdotme@gmail.com>
Subject: Re: [PATCH] pack-objects: lazily set up "struct rev_info", don't leak
Date: Fri, 25 Mar 2022 12:41:35 -0400	[thread overview]
Message-ID: <8d368240-dae5-7a66-6c0c-9e0a960ca34c@github.com> (raw)
In-Reply-To: <220325.86r16qkodl.gmgdl@evledraar.gmail.com>

On 3/25/2022 12:00 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Fri, Mar 25 2022, Derrick Stolee wrote:
> 
>> On 3/25/2022 10:25 AM, Ævar Arnfjörð Bjarmason wrote:
>>> In the preceding [1] (pack-objects: move revs out of
>>> get_object_list(), 2022-03-22) the "repo_init_revisions()" was moved
>>> to cmd_pack_objects() so that it unconditionally took place for all
>>> invocations of "git pack-objects".
>>>
>>> We'd thus start leaking memory, which is easily reproduced in
>>> e.g. git.git by feeding e83c5163316 (Initial revision of "git", the
>>> information manager from hell, 2005-04-07) to "git pack-objects";
>>>
>>>     $ echo e83c5163316f89bfbde7d9ab23ca2e25604af290 | ./git pack-objects initial
>>>     [...]
>>> 	==19130==ERROR: LeakSanitizer: detected memory leaks
>>>
>>> 	Direct leak of 7120 byte(s) in 1 object(s) allocated from:
>>> 	    #0 0x455308 in __interceptor_malloc (/home/avar/g/git/git+0x455308)
>>> 	    #1 0x75b399 in do_xmalloc /home/avar/g/git/wrapper.c:41:8
>>> 	    #2 0x75b356 in xmalloc /home/avar/g/git/wrapper.c:62:9
>>> 	    #3 0x5d7609 in prep_parse_options /home/avar/g/git/diff.c:5647:2
>>> 	    #4 0x5d415a in repo_diff_setup /home/avar/g/git/diff.c:4621:2
>>> 	    #5 0x6dffbb in repo_init_revisions /home/avar/g/git/revision.c:1853:2
>>> 	    #6 0x4f599d in cmd_pack_objects /home/avar/g/git/builtin/pack-objects.c:3980:2
>>> 	    #7 0x4592ca in run_builtin /home/avar/g/git/git.c:465:11
>>> 	    #8 0x457d81 in handle_builtin /home/avar/g/git/git.c:718:3
>>> 	    #9 0x458ca5 in run_argv /home/avar/g/git/git.c:785:4
>>> 	    #10 0x457b40 in cmd_main /home/avar/g/git/git.c:916:19
>>> 	    #11 0x562259 in main /home/avar/g/git/common-main.c:56:11
>>> 	    #12 0x7fce792ac7ec in __libc_start_main csu/../csu/libc-start.c:332:16
>>> 	    #13 0x4300f9 in _start (/home/avar/g/git/git+0x4300f9)
>>>
>>> 	SUMMARY: LeakSanitizer: 7120 byte(s) leaked in 1 allocation(s).
>>> 	Aborted
>>>
>>> Narrowly fixing that commit would have been easy, just add call
>>> repo_init_revisions() right before get_object_list(), which is
>>> effectively what was done before that commit.
>>>
>>> But an unstated constraint when setting it up early is that it was
>>> needed for the subsequent [2] (pack-objects: parse --filter directly
>>> into revs.filter, 2022-03-22), i.e. we might have a --filter
>>> command-line option, and need to either have the "struct rev_info"
>>> setup when we encounter that option, or later.
>>>
>>> Let's just change the control flow so that we'll instead set up the
>>> "struct rev_info" only when we need it. Doing so leads to a bit more
>>> verbosity, but it's a lot clearer what we're doing and why.
>>
>> This makes sense.
>>
>>> We could furthermore combine the two get_object_list() invocations
>>> here by having repo_init_revisions() invoked on &pfd.revs, but I think
>>> clearly separating the two makes the flow clearer. Likewise
>>> redundantly but explicitly (i.e. redundant v.s. a "{ 0 }") "0" to
>>> "have_revs" early in cmd_pack_objects().
>>
>> I disagree, especially when you later want to make sure we free
>> the data from revs using your release_revisions().
> 
> Because we'll need two release_revisions() instead of one?

And it would be easy to miss one of the two if you are only testing
certain paths with leak-check on.

>>> This does add the future constraint to opt_parse_list_objects_filter()
>>> that we'll need to adjust this wrapper code if it looks at any other
>>> value of the "struct option" than the "value" member.
>>
>> So we are coupling ourselves to the implementation of this method.
> 
> Sure, but aren't all OPT_* where the macro is providing some custom
> callback coupling themselves to how it's implemented?

No, I mean that the callback function you are making here is coupling
itself to the existing callback function in a novel way.

I tried to find another example of a nested callback, but I didn't
even find another instance of creating a new 'struct option'.

>>> But that regression should be relatively easy to spot. I'm
>>> intentionally not initializing the "struct wrap" with e.g. "{ 0 }" so
>>> that various memory sanity checkers would spot that, we just
>>> initialize the "value" in po_filter_cb(). By doing this e.g. we'll die
>>> on e.g. this test if we were to use another member of "opt" in
>>> opt_parse_list_objects_filter()>
>>
>> So you are using uninitialized memory as a way to discover any
>> necessary changes to that coupling. I'm not sure this is super-safe
>> because we don't necessarily run memory checkers during CI builds.
>>
>> I'd rather have a consistently initialized chunk of data that would
>> behave predictably (and hopefully we'd discover it is behaving
>> incorrectly with that predictable behavior).
> 
> I'd like to keep this as-is, i.e. if you zero it out it might also
> behave unpredictably or even in undefined ways (e.g. a NULL ptr
> dereference). Likewise in my version, but it has the benefit of being
> caught by some tooling we have.

I guess by "predictable" I mean "well-defined" or "deterministic".

I don't want the build options to change how this works (for instance,
debug builds sometimes initialize data to zero).

>> +struct rev_info_maybe_empty {
>> +	int has_revs;
>> +	struct rev_info revs;
>> +};

Thinking about this a second time, perhaps it would be best to add
an "unsigned initialized:1;" to struct rev_info so we can look at
such a struct and know whether or not repo_init_revisions() has
been run or not. Avoids the custom struct and unifies a few things.

In particular, release_revisions() could choose to do nothing if
revs->initialized is false.

Further, a second repo_init_revisions() could do nothing if
revs->initialized is true. This allows us to safely "re-init"
without our own "if (has_revs)" checks...
>> 	else {
>> 		if (!pfd.have_revs) {
>> 			repo_init_revisions(the_repository, &pfd.revs, NULL);
>> 			pfd.have_revs = 1;
>> 		}
>> 		get_object_list(&pfd.revs, rp.nr, rp.v);
>> 	}

...making this just

	else {
		repo_init_revisions(the_repository, &revs, NULL);
		get_object_list(&revs. rp.nr, rp.v);
	}

> Conceptually I think the saving of that one line isn't worth it to the
> reader.
> 
> Then you'd need to read up and see exactly how pfd.revs gets mutated,
> and its callback etc., only to see we're just doing this to save
> ourselves a variable deceleration and a call to release_revisions().
> 
>> and then later you can add
>>
>> 	if (pfd.have_revs)
>> 		release_revisions(&pfd.revs);

And this would just be

	release_revisions(&revs);

>> to clear the memory in exactly one place.
> 
> Yeah, it would work. I'd just prefer control flow that's trivial to
> reason about over saving a couple of lines here.

I think having multiple revision things that can live at different
levels (one embedded in a custom struct, one not) is not trivial to
reason about. If we change the "is this initialized" indicator to be
within 'struct rev_info', then this gets simpler.

It seems to me that it is easier to track a single struct and release
the memory in one place based on its lifespan.

Thanks,
-Stolee

  reply	other threads:[~2022-03-25 16:41 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-22 17:28 [PATCH 0/5] Partial bundle follow ups Derrick Stolee via GitGitGadget
2022-03-22 17:28 ` [PATCH 1/5] list-objects-filter: remove CL_ARG__FILTER Derrick Stolee via GitGitGadget
2022-03-22 17:28 ` [PATCH 2/5] pack-objects: move revs out of get_object_list() Derrick Stolee via GitGitGadget
2022-03-22 17:28 ` [PATCH 3/5] pack-objects: parse --filter directly into revs.filter Derrick Stolee via GitGitGadget
2022-03-22 19:37   ` [-SPAM-] " Ramsay Jones
2022-03-23 13:48     ` Derrick Stolee
2022-03-22 21:15   ` Ævar Arnfjörð Bjarmason
2022-03-22 17:28 ` [PATCH 4/5] bundle: move capabilities to end of 'verify' Derrick Stolee via GitGitGadget
2022-03-23  7:08   ` Bagas Sanjaya
2022-03-23 13:39     ` Derrick Stolee
2022-03-22 17:28 ` [PATCH 5/5] bundle: output hash information in 'verify' Derrick Stolee via GitGitGadget
2022-03-23 21:27 ` [PATCH 0/5] Partial bundle follow ups Junio C Hamano
2022-03-25 14:25 ` [PATCH] pack-objects: lazily set up "struct rev_info", don't leak Ævar Arnfjörð Bjarmason
2022-03-25 14:57   ` Derrick Stolee
2022-03-25 16:00     ` Ævar Arnfjörð Bjarmason
2022-03-25 16:41       ` Derrick Stolee [this message]
2022-03-25 17:34         ` Ævar Arnfjörð Bjarmason
2022-03-25 19:08           ` Derrick Stolee
2022-03-26  0:52             ` Ævar Arnfjörð Bjarmason
2022-03-28 14:04               ` Derrick Stolee
2022-03-25 18:53   ` Junio C Hamano
2022-03-26  1:09     ` Ævar Arnfjörð Bjarmason
2022-03-28 15:43   ` [PATCH v2] " Ævar Arnfjörð Bjarmason
2022-03-28 15:58     ` Derrick Stolee
2022-03-28 17:10     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8d368240-dae5-7a66-6c0c-9e0a960ca34c@github.com \
    --to=derrickstolee@github.com \
    --cc=avarab@gmail.com \
    --cc=bagasdotme@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.