From: Jeff King <peff@peff.net>
To: John Cai <johncai86@gmail.com>
Cc: git <git@vger.kernel.org>, Christian Couder <christian.couder@gmail.com>
Subject: [PATCH 1/3] fsck: free tree buffers after walking unreachable objects
Date: Thu, 22 Sep 2022 06:11:43 -0400 [thread overview]
Message-ID: <Yyw031PqCyYlEqCX@coredump.intra.peff.net> (raw)
In-Reply-To: <Yyw0PSVe3YTQGgRS@coredump.intra.peff.net>
After calling fsck_walk(), a tree object struct may be left in the
parsed state, with the full tree contents available via tree->buffer.
It's the responsibility of the caller to free these when it's done with
the object to avoid having many trees allocated at once.
In a regular "git fsck", we hit fsck_walk() only from fsck_obj(), which
does call free_tree_buffer(). Likewise for "--connectivity-only", we see
most objects via traverse_one_object(), which makes a similar call.
The exception is in mark_unreachable_referents(). When using both
"--connectivity-only" and "--dangling" (the latter of which is the
default), we walk all of the unreachable objects, and there we forget to
free. Most cases would not notice this, because they don't have a lot of
unreachable objects, but you can make a pathological case like this:
git clone --bare /path/to/linux.git repo.git
cd repo.git
rm packed-refs ;# now everything is unreachable!
git fsck --connectivity-only
That ends up with peak heap usage ~18GB, which is (not coincidentally)
close to the size of all uncompressed trees in the repository. After
this patch, the peak heap is only ~2GB.
A few things to note:
- it might seem like fsck_walk(), if it is parsing the trees, should
be responsible for freeing them. But the situation is quite tricky.
In the non-connectivity mode, after we call fsck_walk() we then
proceed with fsck_object() which actually does the type-specific
sanity checks on the object contents. We do pass our own separate
buffer to fsck_object(), but there's a catch: our earlier call to
parse_object_buffer() may have attached that buffer to the object
struct! So by freeing it, we leave the rest of the code with a
dangling pointer.
Likewise, the call to fsck_walk() in index-pack is subtle. It
attaches a buffer to the tree object that must not be freed! And
so rather than calling free_tree_buffer(), it actually detaches it
by setting tree->buffer to NULL.
These cases would _probably_ be fixable by having fsck_walk() free
the tree buffer only when it was the one who allocated it via
parse_tree(). But that would still leave the callers responsible for
freeing other cases, so they wouldn't be simplified. While the
current semantics for fsck_walk() make it easy to accidentally leak
in new callers, at least they are simple to explain, and it's not a
function that's likely to get a lot of new call-sites.
And in any case, it's probably sensible to fix the leak first with
this simple patch, and try any more complicated refactoring
separately.
- a careful reader may notice that fsck_obj() also frees commit
buffers, but neither the call in traverse_one_object() nor the one
touched in this patch does so. And indeed, this is another problem
for --connectivity-only (and accounts for most of the 2GB heap after
this patch), but it's one we'll fix in a separate commit.
Signed-off-by: Jeff King <peff@peff.net>
---
builtin/fsck.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/builtin/fsck.c b/builtin/fsck.c
index f7916f06ed..34e575a170 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -228,6 +228,8 @@ static void mark_unreachable_referents(const struct object_id *oid)
options.walk = mark_used;
fsck_walk(obj, NULL, &options);
+ if (obj->type == OBJ_TREE)
+ free_tree_buffer((struct tree *)obj);
}
static int mark_loose_unreachable_referents(const struct object_id *oid,
--
2.38.0.rc1.583.ga560cd8328
next prev parent reply other threads:[~2022-09-22 10:11 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-20 19:27 [INVESTIGATION] why is fsck --connectivity-only so much more expensive than rev-list --objects --all? John Cai
2022-09-20 20:41 ` Jeff King
2022-09-22 10:09 ` [PATCH 0/3] reducing fsck memory usage Jeff King
2022-09-22 10:11 ` Jeff King [this message]
2022-09-22 18:40 ` [PATCH 1/3] fsck: free tree buffers after walking unreachable objects Junio C Hamano
2022-09-22 18:58 ` Jeff King
2022-09-22 19:27 ` Junio C Hamano
2022-09-22 22:16 ` Jeff King
2022-09-22 10:13 ` [PATCH 2/3] fsck: turn off save_commit_buffer Jeff King
2022-09-22 10:15 ` [PATCH 3/3] parse_object_buffer(): respect save_commit_buffer Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yyw031PqCyYlEqCX@coredump.intra.peff.net \
--to=peff@peff.net \
--cc=christian.couder@gmail.com \
--cc=git@vger.kernel.org \
--cc=johncai86@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).