From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andreas Gruenbacher Date: Mon, 22 Feb 2021 14:03:34 +0100 Subject: [Cluster-devel] Recording extents in GFS2 In-Reply-To: References: <93452e4c-af81-225f-34f5-f297473164ae@redhat.com> Message-ID: List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On Mon, Feb 22, 2021 at 12:41 PM Andreas Gruenbacher wrote: > On Mon, Feb 22, 2021 at 11:21 AM Steven Whitehouse > wrote: > >> Hi, >> On 20/02/2021 09:48, Andreas Gruenbacher wrote: >> >> Hi all, >> >> once we change the journal format, in addition to recording block numbers >> as extents, there are some additional issues we should address at the same >> time: >> >> I. The current transaction format of our journals is as follows: >> >> - One METADATA log descriptor block for each [503 / 247 / 119 / 55] >> metadata blocks, followed by those metadata blocks. For each metadata >> block, the log descriptor records the 64-bit block number. >> - One JDATA log descriptor block for each [251 / 123 / 59 / 27] >> metadata blocks, followed by those metadata blocks. For each metadata >> block, the log descriptor records the 64-bit block number and another >> 64-bit field for indicating whether the block needed escaping. >> - One REVOKE log descriptor block for the initial [503 / 247 / 119 / >> 55] revokes, followed by a metadata header (not to be confused with the log >> header) for each additional [509 / 253 / 125 / 61] revokes. Each revoke is >> recorded as a 64-bit block number in its REVOKE log descriptor or metadata >> header. >> - One log header with various necessary and useful metadata that acts >> as a COMMIT record. If the log header is incorrect or missing, the >> preceding log descriptors are ignored. >> >> ^^^^ >> succeeding? (I hope!) >> > > No, we call lops_before_commit (which writes the various log descriptors, > metadata, and journaled data blocks) before writing the log header in > log_write_header -> gfs2_write_log_header. In that sense, we could call it > a trailer. > > We should change that so that a single log descriptor contains a number of >> records. There should be records for METADATA and JDATA blocks that follow, >> as well as for REVOKES and for COMMIT. If a transaction contains metadata >> and/or jdata blocks, those will obviously need a precursor and a commit >> block like today, but we shouldn't need separate blocks for metadata and >> journaled data in many cases. Small transactions that only consist of >> revokes and of a commit should frequently fit into a single block entirely, >> though. >> >> Yes, it makes sense to try and condense what we are writing. Why would we >> not need to have separate blocks for journaled data though? That one seems >> difficult to avoid, and since it is used so infrequently, perhaps not such >> an important issue. >> > Journaled data would of course still need to be written. We could have a > single log descriptor with METADATA and JDATA records, followed by the > metadata and journaled data blocks, followed by a log descriptor with a > COMMIT record. > >> Right now, we're writing log headers ("commits") with REQ_PREFLUSH to >> make sure all the log descriptors of a transaction make it to disk before >> the log header. Depending on the device, this is often costly. If we can >> fit an entire transaction into a single block, REQ_PREFLUSH won't be needed >> anymore. >> >> I'm not sure I agree. The purpose of the preflush is to ensure that the >> data and the preceding log blocks are really on disk before we write the >> commit record. That will still be required while we use ordered writes, >> even if we can use (as you suggest below) a checksum to cover the whole >> transaction, and thus check for a complete log record after the fact. Also, >> we would still have to issue the flush in the case of a fsync derived log >> flush too. >> >> >> >> III. We could also checksum entire transactions to avoid REQ_PREFLUSH. At >> replay time, all the blocks that make up a transaction will either be there >> and the checksum will match, or the transaction will be invalid. This >> should be less prohibitively expensive with CPU support for CRC32C >> nowadays, but depending on the hardware, it may make sense to turn this off. >> >> IV. We need recording of unwritten blocks / extents (allocations / >> fallocate) as this will significantly speed up moving glocks from one node >> to another: >> >> That would definitely be a step forward. >> >> >> >> At the moment, data=ordered is implemented by keeping a list of all >> inodes that did an ordered write. When it comes time to flush the log, the >> data of all those ordered inodes is flushed first. When all we want is to >> flush a single glock in order to move it to a different node, we currently >> flush all the ordered inodes as well as the journal. >> >> If we only flushed the ordered data of the glock being moved plus the >> entire journal, the ordering guarantees for the other ordered inodes in the >> journal would be violated. In that scenario, unwritten blocks could (and >> would) show up in files after crashes. >> >> If we instead record unwritten blocks in the journal, we'll know which >> blocks need to be zeroed out at recovery time. Once an unwritten block is >> written, we record a REVOKE entry for that block. >> >> This comes at the cost of tracking those blocks of course, but with that >> in place, moving a glock from one node to another will only require >> flushing the underlying inode (assuming it's a inode glock) and the >> journal. And most likely, we won't have to bother with implementing "simple" >> transactions as described in >> https://bugzilla.redhat.com/show_bug.cgi?id=1631499. >> >> Thanks, >> Andreas >> >> That would be another way of looking at the problem, yes. It does add a >> lot to the complexity though, and it doesn't scale very well on systems >> with large amounts of memory (and therefore potentially lots of unwritten >> extents to record & keep track of). If there are lots of small >> transactions, then each one might be significantly expanded by the need to >> write the info to track the things which have not been written yet. >> >> If we keep track of individual allocations/deallocations, as per Abhi's >> suggestion, then we know where the areas are which may potentially have >> unwritten data in them. That may allow us to avoid having to do the data >> writeback ahead of the journal flush in the first place - moving something >> more towards the XFS way of doing things. >> > Well, allocations and unwritten data are essentially the same thing; I may > not have said that very clearly. So avoiding unnecessary ordered data > write-out is *exactly* what I'm proposing here. When moving a glock from > one node to another, we very certainly do want to write out the ordered > data of that specific inode, however. The problem is that tracking > allocations is worthless if we don't record one of the following things in > the journal: either (a) which of the unwritten blocks have been written > already, or (b) the fact that all unwritten blocks of an inode have been > written now. When moving a glock from one node to another, (b) may be > relatively easy to ascertain, but in a running system, we may never reach > that state. > To expand on this a little, fsync is a point at which (b) is achieved, due to the fact that we don't allow multiple local processes concurrent "EX" access to a file today. This isn't really a desired property of the filesystem though; other filesystems allow a lot more concurrency. So before too long, we might end up in a situation where an fsync only guarantees that all previous writes will be synced to disk. The resource group glock sharing is a move in that direction. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: