* [Cluster-devel] [GFS2 PATCH] gfs2: Return all reservations when rgrp_brelse is called [not found] <224749285.50534453.1531513566639.JavaMail.zimbra@redhat.com> @ 2018-07-13 20:26 ` Bob Peterson 2018-07-13 21:19 ` Steven Whitehouse 0 siblings, 1 reply; 4+ messages in thread From: Bob Peterson @ 2018-07-13 20:26 UTC (permalink / raw) To: cluster-devel.redhat.com Hi, Before this patch, function gfs2_rgrp_brelse would release its buffer_heads for the rgrp bitmaps, but it did not release its reservations. The problem is: When we need to call brelse, we're basically letting go of the bitmaps, which means our reservations are no longer valid: someone on another node may have reserved those blocks and even used them. This patch makes the function returns all the block reservations held for the rgrp whose buffer_heads are being released. Signed-off-by: Bob Peterson <rpeterso@redhat.com> --- fs/gfs2/rgrp.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index 7e22918d32d6..9348a18d56b9 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -1263,6 +1263,7 @@ void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd) { int x, length = rgd->rd_length; + return_all_reservations(rgd); for (x = 0; x < length; x++) { struct gfs2_bitmap *bi = rgd->rd_bits + x; if (bi->bi_bh) { ^ permalink raw reply related [flat|nested] 4+ messages in thread
* [Cluster-devel] [GFS2 PATCH] gfs2: Return all reservations when rgrp_brelse is called 2018-07-13 20:26 ` [Cluster-devel] [GFS2 PATCH] gfs2: Return all reservations when rgrp_brelse is called Bob Peterson @ 2018-07-13 21:19 ` Steven Whitehouse 2018-07-23 15:29 ` Bob Peterson 0 siblings, 1 reply; 4+ messages in thread From: Steven Whitehouse @ 2018-07-13 21:19 UTC (permalink / raw) To: cluster-devel.redhat.com On 13/07/18 21:26, Bob Peterson wrote: > Hi, > > Before this patch, function gfs2_rgrp_brelse would release its > buffer_heads for the rgrp bitmaps, but it did not release its > reservations. The problem is: When we need to call brelse, we're > basically letting go of the bitmaps, which means our reservations > are no longer valid: someone on another node may have reserved those > blocks and even used them. This patch makes the function returns all > the block reservations held for the rgrp whose buffer_heads are > being released. What advantage does this give us? The reservations are intended to be hints, so this should not be a problem, Steve. > Signed-off-by: Bob Peterson <rpeterso@redhat.com> > --- > fs/gfs2/rgrp.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c > index 7e22918d32d6..9348a18d56b9 100644 > --- a/fs/gfs2/rgrp.c > +++ b/fs/gfs2/rgrp.c > @@ -1263,6 +1263,7 @@ void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd) > { > int x, length = rgd->rd_length; > > + return_all_reservations(rgd); > for (x = 0; x < length; x++) { > struct gfs2_bitmap *bi = rgd->rd_bits + x; > if (bi->bi_bh) { > ^ permalink raw reply [flat|nested] 4+ messages in thread
* [Cluster-devel] [GFS2 PATCH] gfs2: Return all reservations when rgrp_brelse is called 2018-07-13 21:19 ` Steven Whitehouse @ 2018-07-23 15:29 ` Bob Peterson 2018-07-23 16:52 ` Andreas Gruenbacher 0 siblings, 1 reply; 4+ messages in thread From: Bob Peterson @ 2018-07-23 15:29 UTC (permalink / raw) To: cluster-devel.redhat.com ----- Original Message ----- > > Before this patch, function gfs2_rgrp_brelse would release its > > buffer_heads for the rgrp bitmaps, but it did not release its > > reservations. The problem is: When we need to call brelse, we're > > basically letting go of the bitmaps, which means our reservations > > are no longer valid: someone on another node may have reserved those > > blocks and even used them. This patch makes the function returns all > > the block reservations held for the rgrp whose buffer_heads are > > being released. > What advantage does this give us? The reservations are intended to be > hints, so this should not be a problem, > > Steve. Hi Steve, I've been working toward a block reservation system that allows multiple writers to allocate blocks while sharing resource groups (rgrps) that are locked for EX. The goal, of course, is to improve write concurrency and eliminate intra-node write imbalances. My patches all work reasonably well until rgrps fill up and get down to their last few blocks, in which case the spans of free blocks aren't long enough for a minimum reservation, due to rgrp fragmentation. With today's current code, multiple processes doing block allocations can't bump heads and interfere with one another because they each lock the rgrp in EX for a span that covers from the block reservation all the way up to block allocation. But when we enable multiple processes to share the rgrp, we need to ensure they can't over-commit the rgrp. So the problem is over-commitment of rgrps. For example, let's say you have 10 processes that each want to write 5 blocks. The rgrp glock is locked in EX and they begin sharing it. When they check the rgrp's free space in inplace_reserve, each one of the 10, in turn, asks the question, "Does this rgrp have 5 free blocks available?" Let's say the rgrp has 20 free blocks, so of course the answer is "yes" for all 10 processes. But when they go to actually allocate those blocks, the first 4 processes use up those 20 free blocks. At that point, the rgrp is completely full, but the other 6 processes are over-committed to use that rgrp based on their requirements. Now we have 6 processes that are unable to allocate a single block from an rgrp that had previously deemed to have enough. So basically, the problem is that our current block reservations system has a concept of (a) "This rgrp has X reservations," and (b) "there are Y blocks leftover that cannot be reserved for general use." That allows for rgrps to be over-committed. (1) My proposed solution To allow for rgrp sharing, I'm trying to tighten things up and eliminate the "slop". I'm trying to transition the system from a simple hint to an actual "promise", i.e. an accounting system so that over-committing is not possible. With this new system there is still a concept of (a) "This rgrp has X reservations," but (b) now becomes "This rgrp has X of my remaining blocks promised to various processes." So accounting is done to keep track of how many of the free blocks are promised for allocations outside of reservations. After all block allocations have been done for a transaction, any remaining blocks that have been promised from the rgrp to that process are then rescinded, which means they go back to the general pool of free blocks for other processes to use. There are, of course, other ways we can do this. For example, we can: (2) Alternate solution 1 - "the rest goes to process n" Automatically assign "all remaining unreserved blocks" to the first process needing them and force all the others to a different rgrp. But after years of use, when the rgrps become severely fragmented, that system would cause much more of a slowdown. (3) Alternate solution 2 - "hold the lock from reservation to allocation" Of course, we could also block other processes from the rgrp from reservation to allocation, but that would have almost no advantage over what we do today: it would pretty much negate rgrp sharing and we'd end up with the write imbalance problems we have today. (4) Alternate solution 3 - "Hint with looping" We could also put a system in place whereby we still use "hints". Processes that had called inplace_reserve for a given rgrp, but are now out of free blocks because of over-commitment (we had a hint, not a promise) must then loop back around and call function inplace_reserve again, searching for a different rgrp that might work instead, in which case it could run into the same situation multiple times. That would most likely be a performance disaster. (5) Alternate solution 4 - "One (or small) block reservations" We could allow for one-block (or at least small-block) reservations and keep a queue of them or something to fulfill a multi-block allocation requirement. I suspect this would be a nightmare of kmem_cache_alloc requests and have a lot more overhead than simply making promises. (6) Alternate solution 5 - "assign unique rgrps" We could go back to a system where multiple allocators are given unique rgrps to work on (which I've proposed in the past, but been rejected) but it's pretty much what RHEL6 and prior releases do by using "try" locks on rgrps. (Which is why simultaneous allocators often perform better on RHEL6 and older.) In my opinion, there's no advantage to using a hint when we can do actual accounting to keep track of spans of blocks too small to be considered for a reservation. Regards, Bob Peterson Red Hat File Systems ^ permalink raw reply [flat|nested] 4+ messages in thread
* [Cluster-devel] [GFS2 PATCH] gfs2: Return all reservations when rgrp_brelse is called 2018-07-23 15:29 ` Bob Peterson @ 2018-07-23 16:52 ` Andreas Gruenbacher 0 siblings, 0 replies; 4+ messages in thread From: Andreas Gruenbacher @ 2018-07-23 16:52 UTC (permalink / raw) To: cluster-devel.redhat.com On 23 July 2018 at 17:29, Bob Peterson <rpeterso@redhat.com> wrote: > ----- Original Message ----- >> > Before this patch, function gfs2_rgrp_brelse would release its >> > buffer_heads for the rgrp bitmaps, but it did not release its >> > reservations. The problem is: When we need to call brelse, we're >> > basically letting go of the bitmaps, which means our reservations >> > are no longer valid: someone on another node may have reserved those >> > blocks and even used them. This patch makes the function returns all >> > the block reservations held for the rgrp whose buffer_heads are >> > being released. >> What advantage does this give us? The reservations are intended to be >> hints, so this should not be a problem, >> >> Steve. > > Hi Steve, > > I've been working toward a block reservation system that allows multiple > writers to allocate blocks while sharing resource groups (rgrps) that > are locked for EX. The goal, of course, is to improve write concurrency > and eliminate intra-node write imbalances. > > My patches all work reasonably well until rgrps fill up and get down > to their last few blocks, in which case the spans of free blocks aren't > long enough for a minimum reservation, due to rgrp fragmentation. With > today's current code, multiple processes doing block allocations can't > bump heads and interfere with one another because they each lock the > rgrp in EX for a span that covers from the block reservation all the > way up to block allocation. But when we enable multiple processes to > share the rgrp, we need to ensure they can't over-commit the rgrp. > > So the problem is over-commitment of rgrps. > > For example, let's say you have 10 processes that each want to write > 5 blocks. The rgrp glock is locked in EX and they begin sharing it. > When they check the rgrp's free space in inplace_reserve, each one of > the 10, in turn, asks the question, "Does this rgrp have 5 free blocks > available?" Let's say the rgrp has 20 free blocks, so of course the > answer is "yes" for all 10 processes. But when they go to actually > allocate those blocks, the first 4 processes use up those 20 free > blocks. At that point, the rgrp is completely full, but the other 6 > processes are over-committed to use that rgrp based on their > requirements. Now we have 6 processes that are unable to allocate > a single block from an rgrp that had previously deemed to have enough. > > So basically, the problem is that our current block reservations system > has a concept of (a) "This rgrp has X reservations," and (b) "there are > Y blocks leftover that cannot be reserved for general use." That allows > for rgrps to be over-committed. > > (1) My proposed solution > > To allow for rgrp sharing, I'm trying to tighten things up and > eliminate the "slop". I'm trying to transition the system from a simple > hint to an actual "promise", i.e. an accounting system so that > over-committing is not possible. With this new system there is still a > concept of (a) "This rgrp has X reservations," but (b) now becomes > "This rgrp has X of my remaining blocks promised to various processes." > So accounting is done to keep track of how many of the free blocks are > promised for allocations outside of reservations. After all block > allocations have been done for a transaction, any remaining blocks that > have been promised from the rgrp to that process are then rescinded, > which means they go back to the general pool of free blocks for > other processes to use. I'd call that a reservation, as opposed to what the code does currently. Right now, processes are asking for a TARGET number of blocks and at least MIN_TARGET blocks, but they get whatever is available in the chosen resource group. There is no checking if processes overrun what they've been asking for, and we don't know if the reservations were large enough. > There are, of course, other ways we can do this. For example, we can: > > (2) Alternate solution 1 - "the rest goes to process n" > > Automatically assign "all remaining unreserved blocks" to > the first process needing them and force all the others to a different > rgrp. But after years of use, when the rgrps become severely > fragmented, that system would cause much more of a slowdown. > > (3) Alternate solution 2 - "hold the lock from reservation to allocation" > > Of course, we could also block other processes from the rgrp from > reservation to allocation, but that would have almost no advantage > over what we do today: it would pretty much negate rgrp sharing > and we'd end up with the write imbalance problems we have today. > > (4) Alternate solution 3 - "Hint with looping" > > We could also put a system in place whereby we still use "hints". > Processes that had called inplace_reserve for a given rgrp, but > are now out of free blocks because of over-commitment (we had a > hint, not a promise) must then loop back around and call function > inplace_reserve again, searching for a different rgrp that might > work instead, in which case it could run into the same situation > multiple times. That would most likely be a performance disaster. > > (5) Alternate solution 4 - "One (or small) block reservations" > > We could allow for one-block (or at least small-block) reservations > and keep a queue of them or something to fulfill a multi-block > allocation requirement. I suspect this would be a nightmare of > kmem_cache_alloc requests and have a lot more overhead than simply > making promises. > > (6) Alternate solution 5 - "assign unique rgrps" > > We could go back to a system where multiple allocators are > given unique rgrps to work on (which I've proposed in the past, > but been rejected) but it's pretty much what RHEL6 and prior > releases do by using "try" locks on rgrps. (Which is why > simultaneous allocators often perform better on RHEL6 and oder.) > > In my opinion, there's no advantage to using a hint when we can > do actual accounting to keep track of spans of blocks too small > to be considered for a reservation. AFAIK, fallocate is the only caller based on "give me all you have" semantics. It shouldn't be hard to change that to simply take out pretty large reservations; regular writes should have a reasonably good idea how many blocks they may require. And with the recent iomap restructuring, they'll know exactly how many blocks they'll require pretty soon. So I'm all for moving to actual reservations. > Regards, > > Bob Peterson > Red Hat File Systems Thanks, Andreas ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2018-07-23 16:52 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <224749285.50534453.1531513566639.JavaMail.zimbra@redhat.com> 2018-07-13 20:26 ` [Cluster-devel] [GFS2 PATCH] gfs2: Return all reservations when rgrp_brelse is called Bob Peterson 2018-07-13 21:19 ` Steven Whitehouse 2018-07-23 15:29 ` Bob Peterson 2018-07-23 16:52 ` Andreas Gruenbacher
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.