[Cluster-devel] [GFS2 PATCH] gfs2: Return all reservations when rgrp_brelse is called

From: Andreas Gruenbacher <agruenba@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [GFS2 PATCH] gfs2: Return all reservations when rgrp_brelse is called
Date: Mon, 23 Jul 2018 18:52:23 +0200	[thread overview]
Message-ID: <CAHc6FU7ziKB4BPETazyEXEDm3o-pqDY9+hN0syVoeioJWobU1Q@mail.gmail.com> (raw)
In-Reply-To: <972293249.53472433.1532359760435.JavaMail.zimbra@redhat.com>

On 23 July 2018 at 17:29, Bob Peterson <rpeterso@redhat.com> wrote:
> ----- Original Message -----
>> > Before this patch, function gfs2_rgrp_brelse would release its
>> > buffer_heads for the rgrp bitmaps, but it did not release its
>> > reservations. The problem is: When we need to call brelse, we're
>> > basically letting go of the bitmaps, which means our reservations
>> > are no longer valid: someone on another node may have reserved those
>> > blocks and even used them. This patch makes the function returns all
>> > the block reservations held for the rgrp whose buffer_heads are
>> > being released.
>> What advantage does this give us? The reservations are intended to be
>> hints, so this should not be a problem,
>>
>> Steve.
>
> Hi Steve,
>
> I've been working toward a block reservation system that allows multiple
> writers to allocate blocks while sharing resource groups (rgrps) that
> are locked for EX. The goal, of course, is to improve write concurrency
> and eliminate intra-node write imbalances.
>
> My patches all work reasonably well until rgrps fill up and get down
> to their last few blocks, in which case the spans of free blocks aren't
> long enough for a minimum reservation, due to rgrp fragmentation. With
> today's current code, multiple processes doing block allocations can't
> bump heads and interfere with one another because they each lock the
> rgrp in EX for a span that covers from the block reservation all the
> way up to block allocation. But when we enable multiple processes to
> share the rgrp, we need to ensure they can't over-commit the rgrp.
>
> So the problem is over-commitment of rgrps.
>
> For example, let's say you have 10 processes that each want to write
> 5 blocks. The rgrp glock is locked in EX and they begin sharing it.
> When they check the rgrp's free space in inplace_reserve, each one of
> the 10, in turn, asks the question, "Does this rgrp have 5 free blocks
> available?" Let's say the rgrp has 20 free blocks, so of course the
> answer is "yes" for all 10 processes. But when they go to actually
> allocate those blocks, the first 4 processes use up those 20 free
> blocks. At that point, the rgrp is completely full, but the other 6
> processes are over-committed to use that rgrp based on their
> requirements. Now we have 6 processes that are unable to allocate
> a single block from an rgrp that had previously deemed to have enough.
>
> So basically, the problem is that our current block reservations system
> has a concept of (a) "This rgrp has X reservations," and (b) "there are
> Y blocks leftover that cannot be reserved for general use." That allows
> for rgrps to be over-committed.
>
> (1) My proposed solution
>
> To allow for rgrp sharing, I'm trying to tighten things up and
> eliminate the "slop". I'm trying to transition the system from a simple
> hint to an actual "promise", i.e. an accounting system so that
> over-committing is not possible. With this new system there is still a
> concept of (a) "This rgrp has X reservations," but (b) now becomes
> "This rgrp has X of my remaining blocks promised to various processes."
> So accounting is done to keep track of how many of the free blocks are
> promised for allocations outside of reservations. After all block
> allocations have been done for a transaction, any remaining blocks that
> have been promised from the rgrp to that process are then rescinded,
> which means they go back to the general pool of free blocks for
> other processes to use.

I'd call that a reservation, as opposed to what the code does
currently. Right now, processes are asking for a TARGET number of
blocks and at least MIN_TARGET blocks, but they get whatever is
available in the chosen resource group. There is no checking if
processes overrun what they've been asking for, and we don't know if
the reservations were large enough.

> There are, of course, other ways we can do this. For example, we can:
>
> (2) Alternate solution 1 - "the rest goes to process n"
>
> Automatically assign "all remaining unreserved blocks" to
> the first process needing them and force all the others to a different
> rgrp. But after years of use, when the rgrps become severely
> fragmented, that system would cause much more of a slowdown.
>
> (3) Alternate solution 2 - "hold the lock from reservation to allocation"
>
> Of course, we could also block other processes from the rgrp from
> reservation to allocation, but that would have almost no advantage
> over what we do today: it would pretty much negate rgrp sharing
> and we'd end up with the write imbalance problems we have today.
>
> (4) Alternate solution 3 - "Hint with looping"
>
> We could also put a system in place whereby we still use "hints".
> Processes that had called inplace_reserve for a given rgrp, but
> are now out of free blocks because of over-commitment (we had a
> hint, not a promise) must then loop back around and call function
> inplace_reserve again, searching for a different rgrp that might
> work instead, in which case it could run into the same situation
> multiple times. That would most likely be a performance disaster.
>
> (5) Alternate solution 4 - "One (or small) block reservations"
>
> We could allow for one-block (or at least small-block) reservations
> and keep a queue of them or something to fulfill a multi-block
> allocation requirement. I suspect this would be a nightmare of
> kmem_cache_alloc requests and have a lot more overhead than simply
> making promises.
>
> (6) Alternate solution 5 - "assign unique rgrps"
>
> We could go back to a system where multiple allocators are
> given unique rgrps to work on (which I've proposed in the past,
> but been rejected) but it's pretty much what RHEL6 and prior
> releases do by using "try" locks on rgrps. (Which is why
> simultaneous allocators often perform better on RHEL6 and oder.)
>
> In my opinion, there's no advantage to using a hint when we can
> do actual accounting to keep track of spans of blocks too small
> to be considered for a reservation.

AFAIK, fallocate is the only caller based on "give me all you have"
semantics. It shouldn't be hard to change that to simply take out
pretty large reservations; regular writes should have a reasonably
good idea how many blocks they may require. And with the recent iomap
restructuring, they'll know exactly how many blocks they'll require
pretty soon. So I'm all for moving to actual reservations.

> Regards,
>
> Bob Peterson
> Red Hat File Systems

Thanks,
Andreas