[Cluster-devel] [GFS2 PATCH] gfs2: Return all reservations when rgrp

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Cluster-devel] [GFS2 PATCH] gfs2: Return all reservations when rgrp_brelse is called
       [not found] <224749285.50534453.1531513566639.JavaMail.zimbra@redhat.com>
@ 2018-07-13 20:26 ` Bob Peterson
  2018-07-13 21:19   ` Steven Whitehouse
  0 siblings, 1 reply; 4+ messages in thread
From: Bob Peterson @ 2018-07-13 20:26 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

Before this patch, function gfs2_rgrp_brelse would release its
buffer_heads for the rgrp bitmaps, but it did not release its
reservations. The problem is: When we need to call brelse, we're
basically letting go of the bitmaps, which means our reservations
are no longer valid: someone on another node may have reserved those
blocks and even used them. This patch makes the function returns all
the block reservations held for the rgrp whose buffer_heads are
being released.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/rgrp.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 7e22918d32d6..9348a18d56b9 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1263,6 +1263,7 @@ void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd)
 {
 	int x, length = rgd->rd_length;

+	return_all_reservations(rgd);
 	for (x = 0; x < length; x++) {
 		struct gfs2_bitmap *bi = rgd->rd_bits + x;
 		if (bi->bi_bh) {

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [Cluster-devel] [GFS2 PATCH] gfs2: Return all reservations when rgrp_brelse is called
  2018-07-13 20:26 ` [Cluster-devel] [GFS2 PATCH] gfs2: Return all reservations when rgrp_brelse is called Bob Peterson
@ 2018-07-13 21:19   ` Steven Whitehouse
  2018-07-23 15:29     ` Bob Peterson
  0 siblings, 1 reply; 4+ messages in thread
From: Steven Whitehouse @ 2018-07-13 21:19 UTC (permalink / raw)
  To: cluster-devel.redhat.com



On 13/07/18 21:26, Bob Peterson wrote:
> Hi,
>
> Before this patch, function gfs2_rgrp_brelse would release its
> buffer_heads for the rgrp bitmaps, but it did not release its
> reservations. The problem is: When we need to call brelse, we're
> basically letting go of the bitmaps, which means our reservations
> are no longer valid: someone on another node may have reserved those
> blocks and even used them. This patch makes the function returns all
> the block reservations held for the rgrp whose buffer_heads are
> being released.
What advantage does this give us? The reservations are intended to be 
hints, so this should not be a problem,

Steve.

> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> ---
>   fs/gfs2/rgrp.c | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
> index 7e22918d32d6..9348a18d56b9 100644
> --- a/fs/gfs2/rgrp.c
> +++ b/fs/gfs2/rgrp.c
> @@ -1263,6 +1263,7 @@ void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd)
>   {
>   	int x, length = rgd->rd_length;
>   
> +	return_all_reservations(rgd);
>   	for (x = 0; x < length; x++) {
>   		struct gfs2_bitmap *bi = rgd->rd_bits + x;
>   		if (bi->bi_bh) {
>



^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Cluster-devel] [GFS2 PATCH] gfs2: Return all reservations when rgrp_brelse is called
  2018-07-13 21:19   ` Steven Whitehouse
@ 2018-07-23 15:29     ` Bob Peterson
  2018-07-23 16:52       ` Andreas Gruenbacher
  0 siblings, 1 reply; 4+ messages in thread
From: Bob Peterson @ 2018-07-23 15:29 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> > Before this patch, function gfs2_rgrp_brelse would release its
> > buffer_heads for the rgrp bitmaps, but it did not release its
> > reservations. The problem is: When we need to call brelse, we're
> > basically letting go of the bitmaps, which means our reservations
> > are no longer valid: someone on another node may have reserved those
> > blocks and even used them. This patch makes the function returns all
> > the block reservations held for the rgrp whose buffer_heads are
> > being released.
> What advantage does this give us? The reservations are intended to be
> hints, so this should not be a problem,
> 
> Steve.

Hi Steve,

I've been working toward a block reservation system that allows multiple
writers to allocate blocks while sharing resource groups (rgrps) that
are locked for EX. The goal, of course, is to improve write concurrency
and eliminate intra-node write imbalances.

My patches all work reasonably well until rgrps fill up and get down
to their last few blocks, in which case the spans of free blocks aren't
long enough for a minimum reservation, due to rgrp fragmentation. With
today's current code, multiple processes doing block allocations can't
bump heads and interfere with one another because they each lock the
rgrp in EX for a span that covers from the block reservation all the
way up to block allocation. But when we enable multiple processes to
share the rgrp, we need to ensure they can't over-commit the rgrp.

So the problem is over-commitment of rgrps.

For example, let's say you have 10 processes that each want to write
5 blocks. The rgrp glock is locked in EX and they begin sharing it.
When they check the rgrp's free space in inplace_reserve, each one of
the 10, in turn, asks the question, "Does this rgrp have 5 free blocks
available?" Let's say the rgrp has 20 free blocks, so of course the
answer is "yes" for all 10 processes. But when they go to actually
allocate those blocks, the first 4 processes use up those 20 free
blocks. At that point, the rgrp is completely full, but the other 6
processes are over-committed to use that rgrp based on their
requirements. Now we have 6 processes that are unable to allocate
a single block from an rgrp that had previously deemed to have enough.

So basically, the problem is that our current block reservations system
has a concept of (a) "This rgrp has X reservations," and (b) "there are
Y blocks leftover that cannot be reserved for general use." That allows
for rgrps to be over-committed.

(1) My proposed solution

To allow for rgrp sharing, I'm trying to tighten things up and
eliminate the "slop". I'm trying to transition the system from a simple
hint to an actual "promise", i.e. an accounting system so that
over-committing is not possible. With this new system there is still a
concept of (a) "This rgrp has X reservations," but (b) now becomes
"This rgrp has X of my remaining blocks promised to various processes."
So accounting is done to keep track of how many of the free blocks are
promised for allocations outside of reservations. After all block
allocations have been done for a transaction, any remaining blocks that
have been promised from the rgrp to that process are then rescinded,
which means they go back to the general pool of free blocks for
other processes to use.

There are, of course, other ways we can do this. For example, we can:

(2) Alternate solution 1 - "the rest goes to process n"

Automatically assign "all remaining unreserved blocks" to
the first process needing them and force all the others to a different
rgrp. But after years of use, when the rgrps become severely
fragmented, that system would cause much more of a slowdown.

(3) Alternate solution 2 - "hold the lock from reservation to allocation"

Of course, we could also block other processes from the rgrp from
reservation to allocation, but that would have almost no advantage
over what we do today: it would pretty much negate rgrp sharing
and we'd end up with the write imbalance problems we have today.

(4) Alternate solution 3 - "Hint with looping"

We could also put a system in place whereby we still use "hints".
Processes that had called inplace_reserve for a given rgrp, but
are now out of free blocks because of over-commitment (we had a
hint, not a promise) must then loop back around and call function
inplace_reserve again, searching for a different rgrp that might
work instead, in which case it could run into the same situation
multiple times. That would most likely be a performance disaster.

(5) Alternate solution 4 - "One (or small) block reservations"

We could allow for one-block (or at least small-block) reservations
and keep a queue of them or something to fulfill a multi-block
allocation requirement. I suspect this would be a nightmare of
kmem_cache_alloc requests and have a lot more overhead than simply
making promises.

(6) Alternate solution 5 - "assign unique rgrps"

We could go back to a system where multiple allocators are
given unique rgrps to work on (which I've proposed in the past,
but been rejected) but it's pretty much what RHEL6 and prior
releases do by using "try" locks on rgrps. (Which is why
simultaneous allocators often perform better on RHEL6 and older.)

In my opinion, there's no advantage to using a hint when we can
do actual accounting to keep track of spans of blocks too small
to be considered for a reservation.

Regards,

Bob Peterson
Red Hat File Systems

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Cluster-devel] [GFS2 PATCH] gfs2: Return all reservations when rgrp_brelse is called
  2018-07-23 15:29     ` Bob Peterson
@ 2018-07-23 16:52       ` Andreas Gruenbacher
  0 siblings, 0 replies; 4+ messages in thread
From: Andreas Gruenbacher @ 2018-07-23 16:52 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On 23 July 2018 at 17:29, Bob Peterson <rpeterso@redhat.com> wrote:
> ----- Original Message -----
>> > Before this patch, function gfs2_rgrp_brelse would release its
>> > buffer_heads for the rgrp bitmaps, but it did not release its
>> > reservations. The problem is: When we need to call brelse, we're
>> > basically letting go of the bitmaps, which means our reservations
>> > are no longer valid: someone on another node may have reserved those
>> > blocks and even used them. This patch makes the function returns all
>> > the block reservations held for the rgrp whose buffer_heads are
>> > being released.
>> What advantage does this give us? The reservations are intended to be
>> hints, so this should not be a problem,
>>
>> Steve.
>
> Hi Steve,
>
> I've been working toward a block reservation system that allows multiple
> writers to allocate blocks while sharing resource groups (rgrps) that
> are locked for EX. The goal, of course, is to improve write concurrency
> and eliminate intra-node write imbalances.
>
> My patches all work reasonably well until rgrps fill up and get down
> to their last few blocks, in which case the spans of free blocks aren't
> long enough for a minimum reservation, due to rgrp fragmentation. With
> today's current code, multiple processes doing block allocations can't
> bump heads and interfere with one another because they each lock the
> rgrp in EX for a span that covers from the block reservation all the
> way up to block allocation. But when we enable multiple processes to
> share the rgrp, we need to ensure they can't over-commit the rgrp.
>
> So the problem is over-commitment of rgrps.
>
> For example, let's say you have 10 processes that each want to write
> 5 blocks. The rgrp glock is locked in EX and they begin sharing it.
> When they check the rgrp's free space in inplace_reserve, each one of
> the 10, in turn, asks the question, "Does this rgrp have 5 free blocks
> available?" Let's say the rgrp has 20 free blocks, so of course the
> answer is "yes" for all 10 processes. But when they go to actually
> allocate those blocks, the first 4 processes use up those 20 free
> blocks. At that point, the rgrp is completely full, but the other 6
> processes are over-committed to use that rgrp based on their
> requirements. Now we have 6 processes that are unable to allocate
> a single block from an rgrp that had previously deemed to have enough.
>
> So basically, the problem is that our current block reservations system
> has a concept of (a) "This rgrp has X reservations," and (b) "there are
> Y blocks leftover that cannot be reserved for general use." That allows
> for rgrps to be over-committed.
>
> (1) My proposed solution
>
> To allow for rgrp sharing, I'm trying to tighten things up and
> eliminate the "slop". I'm trying to transition the system from a simple
> hint to an actual "promise", i.e. an accounting system so that
> over-committing is not possible. With this new system there is still a
> concept of (a) "This rgrp has X reservations," but (b) now becomes
> "This rgrp has X of my remaining blocks promised to various processes."
> So accounting is done to keep track of how many of the free blocks are
> promised for allocations outside of reservations. After all block
> allocations have been done for a transaction, any remaining blocks that
> have been promised from the rgrp to that process are then rescinded,
> which means they go back to the general pool of free blocks for
> other processes to use.

I'd call that a reservation, as opposed to what the code does
currently. Right now, processes are asking for a TARGET number of
blocks and at least MIN_TARGET blocks, but they get whatever is
available in the chosen resource group. There is no checking if
processes overrun what they've been asking for, and we don't know if
the reservations were large enough.

> There are, of course, other ways we can do this. For example, we can:
>
> (2) Alternate solution 1 - "the rest goes to process n"
>
> Automatically assign "all remaining unreserved blocks" to
> the first process needing them and force all the others to a different
> rgrp. But after years of use, when the rgrps become severely
> fragmented, that system would cause much more of a slowdown.
>
> (3) Alternate solution 2 - "hold the lock from reservation to allocation"
>
> Of course, we could also block other processes from the rgrp from
> reservation to allocation, but that would have almost no advantage
> over what we do today: it would pretty much negate rgrp sharing
> and we'd end up with the write imbalance problems we have today.
>
> (4) Alternate solution 3 - "Hint with looping"
>
> We could also put a system in place whereby we still use "hints".
> Processes that had called inplace_reserve for a given rgrp, but
> are now out of free blocks because of over-commitment (we had a
> hint, not a promise) must then loop back around and call function
> inplace_reserve again, searching for a different rgrp that might
> work instead, in which case it could run into the same situation
> multiple times. That would most likely be a performance disaster.
>
> (5) Alternate solution 4 - "One (or small) block reservations"
>
> We could allow for one-block (or at least small-block) reservations
> and keep a queue of them or something to fulfill a multi-block
> allocation requirement. I suspect this would be a nightmare of
> kmem_cache_alloc requests and have a lot more overhead than simply
> making promises.
>
> (6) Alternate solution 5 - "assign unique rgrps"
>
> We could go back to a system where multiple allocators are
> given unique rgrps to work on (which I've proposed in the past,
> but been rejected) but it's pretty much what RHEL6 and prior
> releases do by using "try" locks on rgrps. (Which is why
> simultaneous allocators often perform better on RHEL6 and oder.)
>
> In my opinion, there's no advantage to using a hint when we can
> do actual accounting to keep track of spans of blocks too small
> to be considered for a reservation.

AFAIK, fallocate is the only caller based on "give me all you have"
semantics. It shouldn't be hard to change that to simply take out
pretty large reservations; regular writes should have a reasonably
good idea how many blocks they may require. And with the recent iomap
restructuring, they'll know exactly how many blocks they'll require
pretty soon. So I'm all for moving to actual reservations.

> Regards,
>
> Bob Peterson
> Red Hat File Systems

Thanks,
Andreas



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-07-23 16:52 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <224749285.50534453.1531513566639.JavaMail.zimbra@redhat.com>
2018-07-13 20:26 ` [Cluster-devel] [GFS2 PATCH] gfs2: Return all reservations when rgrp_brelse is called Bob Peterson
2018-07-13 21:19   ` Steven Whitehouse
2018-07-23 15:29     ` Bob Peterson
2018-07-23 16:52       ` Andreas Gruenbacher

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.