[Cluster-devel] [gfs2 PATCH] gfs2: allocate pages for clone bitmaps

From: Andreas Gruenbacher <agruenba@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [gfs2 PATCH] gfs2: allocate pages for clone bitmaps
Date: Mon, 12 Apr 2021 13:32:03 +0200	[thread overview]
Message-ID: <CAHc6FU5=0p6=V3va3UNPB0ci2At3TuZ+TxgD2yQPBNjGzb4WqQ@mail.gmail.com> (raw)
In-Reply-To: <344305871.6577253.1618062541261.JavaMail.zimbra@redhat.com>

On Sat, Apr 10, 2021 at 3:49 PM Bob Peterson <rpeterso@redhat.com> wrote:
> Resource group (rgrp) bitmaps have in-core-only "clone" bitmaps that
> ensure freed fs space from deletes are not reused until the transaction
> is complete. Before this patch, these clone bitmaps were allocated with
> kmalloc, but with the default 4K block size, kmalloc is wasteful because
> of the way slab keeps track of them. As a matter of fact, kernel docs
> only recommend slab for allocations "less than page size." See:
> https://www.kernel.org/doc/html/v5.0/core-api/mm-api.html#mm-api-gfp-flags
> In fact, if you turn on kernel slab debugging options, slab will give
> you warnings that gfs2 should not do this.
>
> This patch switches the clone bitmap allocations to alloc_page, which
> has much less overhead and uses less memory. The down side is: if we
> allocate a whole page for block sizes smaller than page size, we will
> use more memory and it will be wasteful. But in general, we've always
> recommended using block size = page size for efficiency and performance.

If we really want to switch to page-granularity allocations, vmalloc
would be more appropriate. Note that vmalloc doesn't support
__GFP_NOFAIL, so we should get rid of that by doing the allocation in
a context where we can sleep first.
Looking at rgblk_free and gfs2_free_clones, another cheap improvement
would be to make a single allocation for all clone bitmaps of a
resource group instead of an allocation per bitmap.

But first, I'd like to understand what's actually going on here.

> In a recent test I did with 24 simultaneous recursive file deletes,
> on a large dataset (each working to delete a separate directory), this
> patch yielded a 16 percent increase in speed. Total accumulated real
> (clock) time of the test went from 41310 seconds (11.5 hours) down to
> just 34742 seconds (9.65 hours) (This was lock_nolock on a single node).

I find that really hard to believe. Did you look at the frequency of
clone bitmap allocations? If that is the problem, are we simply too
aggressive freeing the clone bitmaps?

Thanks,
Andreas