From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bob Peterson Date: Mon, 23 Jul 2018 11:29:20 -0400 (EDT) Subject: [Cluster-devel] [GFS2 PATCH] gfs2: Return all reservations when rgrp_brelse is called In-Reply-To: <85954133-6611-1665-c0b6-d422d16f4d9c@redhat.com> References: <963882949.50535174.1531513587686.JavaMail.zimbra@redhat.com> <85954133-6611-1665-c0b6-d422d16f4d9c@redhat.com> Message-ID: <972293249.53472433.1532359760435.JavaMail.zimbra@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit ----- Original Message ----- > > Before this patch, function gfs2_rgrp_brelse would release its > > buffer_heads for the rgrp bitmaps, but it did not release its > > reservations. The problem is: When we need to call brelse, we're > > basically letting go of the bitmaps, which means our reservations > > are no longer valid: someone on another node may have reserved those > > blocks and even used them. This patch makes the function returns all > > the block reservations held for the rgrp whose buffer_heads are > > being released. > What advantage does this give us? The reservations are intended to be > hints, so this should not be a problem, > > Steve. Hi Steve, I've been working toward a block reservation system that allows multiple writers to allocate blocks while sharing resource groups (rgrps) that are locked for EX. The goal, of course, is to improve write concurrency and eliminate intra-node write imbalances. My patches all work reasonably well until rgrps fill up and get down to their last few blocks, in which case the spans of free blocks aren't long enough for a minimum reservation, due to rgrp fragmentation. With today's current code, multiple processes doing block allocations can't bump heads and interfere with one another because they each lock the rgrp in EX for a span that covers from the block reservation all the way up to block allocation. But when we enable multiple processes to share the rgrp, we need to ensure they can't over-commit the rgrp. So the problem is over-commitment of rgrps. For example, let's say you have 10 processes that each want to write 5 blocks. The rgrp glock is locked in EX and they begin sharing it. When they check the rgrp's free space in inplace_reserve, each one of the 10, in turn, asks the question, "Does this rgrp have 5 free blocks available?" Let's say the rgrp has 20 free blocks, so of course the answer is "yes" for all 10 processes. But when they go to actually allocate those blocks, the first 4 processes use up those 20 free blocks. At that point, the rgrp is completely full, but the other 6 processes are over-committed to use that rgrp based on their requirements. Now we have 6 processes that are unable to allocate a single block from an rgrp that had previously deemed to have enough. So basically, the problem is that our current block reservations system has a concept of (a) "This rgrp has X reservations," and (b) "there are Y blocks leftover that cannot be reserved for general use." That allows for rgrps to be over-committed. (1) My proposed solution To allow for rgrp sharing, I'm trying to tighten things up and eliminate the "slop". I'm trying to transition the system from a simple hint to an actual "promise", i.e. an accounting system so that over-committing is not possible. With this new system there is still a concept of (a) "This rgrp has X reservations," but (b) now becomes "This rgrp has X of my remaining blocks promised to various processes." So accounting is done to keep track of how many of the free blocks are promised for allocations outside of reservations. After all block allocations have been done for a transaction, any remaining blocks that have been promised from the rgrp to that process are then rescinded, which means they go back to the general pool of free blocks for other processes to use. There are, of course, other ways we can do this. For example, we can: (2) Alternate solution 1 - "the rest goes to process n" Automatically assign "all remaining unreserved blocks" to the first process needing them and force all the others to a different rgrp. But after years of use, when the rgrps become severely fragmented, that system would cause much more of a slowdown. (3) Alternate solution 2 - "hold the lock from reservation to allocation" Of course, we could also block other processes from the rgrp from reservation to allocation, but that would have almost no advantage over what we do today: it would pretty much negate rgrp sharing and we'd end up with the write imbalance problems we have today. (4) Alternate solution 3 - "Hint with looping" We could also put a system in place whereby we still use "hints". Processes that had called inplace_reserve for a given rgrp, but are now out of free blocks because of over-commitment (we had a hint, not a promise) must then loop back around and call function inplace_reserve again, searching for a different rgrp that might work instead, in which case it could run into the same situation multiple times. That would most likely be a performance disaster. (5) Alternate solution 4 - "One (or small) block reservations" We could allow for one-block (or at least small-block) reservations and keep a queue of them or something to fulfill a multi-block allocation requirement. I suspect this would be a nightmare of kmem_cache_alloc requests and have a lot more overhead than simply making promises. (6) Alternate solution 5 - "assign unique rgrps" We could go back to a system where multiple allocators are given unique rgrps to work on (which I've proposed in the past, but been rejected) but it's pretty much what RHEL6 and prior releases do by using "try" locks on rgrps. (Which is why simultaneous allocators often perform better on RHEL6 and older.) In my opinion, there's no advantage to using a hint when we can do actual accounting to keep track of spans of blocks too small to be considered for a reservation. Regards, Bob Peterson Red Hat File Systems