From mboxrd@z Thu Jan  1 00:00:00 1970
From: Bob Peterson <rpeterso@redhat.com>
Date: Mon, 23 Jul 2018 11:29:20 -0400 (EDT)
Subject: [Cluster-devel] [GFS2 PATCH] gfs2: Return all reservations when
 rgrp_brelse is called
In-Reply-To: <85954133-6611-1665-c0b6-d422d16f4d9c@redhat.com>
References: <963882949.50535174.1531513587686.JavaMail.zimbra@redhat.com>
	<85954133-6611-1665-c0b6-d422d16f4d9c@redhat.com>
Message-ID: <972293249.53472433.1532359760435.JavaMail.zimbra@redhat.com>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

----- Original Message -----
> > Before this patch, function gfs2_rgrp_brelse would release its
> > buffer_heads for the rgrp bitmaps, but it did not release its
> > reservations. The problem is: When we need to call brelse, we're
> > basically letting go of the bitmaps, which means our reservations
> > are no longer valid: someone on another node may have reserved those
> > blocks and even used them. This patch makes the function returns all
> > the block reservations held for the rgrp whose buffer_heads are
> > being released.
> What advantage does this give us? The reservations are intended to be
> hints, so this should not be a problem,
> 
> Steve.

Hi Steve,

I've been working toward a block reservation system that allows multiple
writers to allocate blocks while sharing resource groups (rgrps) that
are locked for EX. The goal, of course, is to improve write concurrency
and eliminate intra-node write imbalances.

My patches all work reasonably well until rgrps fill up and get down
to their last few blocks, in which case the spans of free blocks aren't
long enough for a minimum reservation, due to rgrp fragmentation. With
today's current code, multiple processes doing block allocations can't
bump heads and interfere with one another because they each lock the
rgrp in EX for a span that covers from the block reservation all the
way up to block allocation. But when we enable multiple processes to
share the rgrp, we need to ensure they can't over-commit the rgrp.

So the problem is over-commitment of rgrps.

For example, let's say you have 10 processes that each want to write
5 blocks. The rgrp glock is locked in EX and they begin sharing it.
When they check the rgrp's free space in inplace_reserve, each one of
the 10, in turn, asks the question, "Does this rgrp have 5 free blocks
available?" Let's say the rgrp has 20 free blocks, so of course the
answer is "yes" for all 10 processes. But when they go to actually
allocate those blocks, the first 4 processes use up those 20 free
blocks. At that point, the rgrp is completely full, but the other 6
processes are over-committed to use that rgrp based on their
requirements. Now we have 6 processes that are unable to allocate
a single block from an rgrp that had previously deemed to have enough.

So basically, the problem is that our current block reservations system
has a concept of (a) "This rgrp has X reservations," and (b) "there are
Y blocks leftover that cannot be reserved for general use." That allows
for rgrps to be over-committed.

(1) My proposed solution

To allow for rgrp sharing, I'm trying to tighten things up and
eliminate the "slop". I'm trying to transition the system from a simple
hint to an actual "promise", i.e. an accounting system so that
over-committing is not possible. With this new system there is still a
concept of (a) "This rgrp has X reservations," but (b) now becomes
"This rgrp has X of my remaining blocks promised to various processes."
So accounting is done to keep track of how many of the free blocks are
promised for allocations outside of reservations. After all block
allocations have been done for a transaction, any remaining blocks that
have been promised from the rgrp to that process are then rescinded,
which means they go back to the general pool of free blocks for
other processes to use.

There are, of course, other ways we can do this. For example, we can:

(2) Alternate solution 1 - "the rest goes to process n"

Automatically assign "all remaining unreserved blocks" to
the first process needing them and force all the others to a different
rgrp. But after years of use, when the rgrps become severely
fragmented, that system would cause much more of a slowdown.

(3) Alternate solution 2 - "hold the lock from reservation to allocation"

Of course, we could also block other processes from the rgrp from
reservation to allocation, but that would have almost no advantage
over what we do today: it would pretty much negate rgrp sharing
and we'd end up with the write imbalance problems we have today.

(4) Alternate solution 3 - "Hint with looping"

We could also put a system in place whereby we still use "hints".
Processes that had called inplace_reserve for a given rgrp, but
are now out of free blocks because of over-commitment (we had a
hint, not a promise) must then loop back around and call function
inplace_reserve again, searching for a different rgrp that might
work instead, in which case it could run into the same situation
multiple times. That would most likely be a performance disaster.

(5) Alternate solution 4 - "One (or small) block reservations"

We could allow for one-block (or at least small-block) reservations
and keep a queue of them or something to fulfill a multi-block
allocation requirement. I suspect this would be a nightmare of
kmem_cache_alloc requests and have a lot more overhead than simply
making promises.

(6) Alternate solution 5 - "assign unique rgrps"

We could go back to a system where multiple allocators are
given unique rgrps to work on (which I've proposed in the past,
but been rejected) but it's pretty much what RHEL6 and prior
releases do by using "try" locks on rgrps. (Which is why
simultaneous allocators often perform better on RHEL6 and older.)

In my opinion, there's no advantage to using a hint when we can
do actual accounting to keep track of spans of blocks too small
to be considered for a reservation.

Regards,

Bob Peterson
Red Hat File Systems