On Tue, Nov 12, 2019 at 08:49:33PM +0100, Goffredo Baroncelli wrote: > On 12/11/2019 16.13, Hubert Tonneau wrote: > > Hi, > > > > In order to close the RAID5 write hole, I prepose the add a mount option that would change RAID5 (and RAID6) behaviour : > > > > . When overwriting a RAID5 stripe, first convert it to RAID1 (convert it to RAID1C3 if it was RAID6) > > You can't overwrite and convert a existing stripe for two kind of reason: > 1) you still have to protect the stripe overwriting from the write hole > 2) depending by the layout, a raid1 stripe consumes more space than a raid5 stripe with equal "capacity" > > So you have to write (temporarily) the data on another place. This is something not different from what Qu proposed few years ago: > > https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg66472.html [Btrfs: Add journal for raid5/6 writes] > > where he added a device for logging the writes. > > Unfortunately, this means doubling the writes; that for a COW filesystem (which already suffers this kind of issue) would be big performance penality.... > > Instead I would like to investigate the idea of COW-ing the stripe: instead of updating the stripe on place, why not write the new stripe in another place and then update the data extent to point to the new data ? Of course would work only for the data and not for the metadata. > Pros: the data is written only once > Cons: the pressure of the metadata would increase; the fragmentation would increase The write hole issue is caused by updating a RAID stripe that contains committed data, and then not being able to finish that update because of a crash or power loss. You avoid this using two strategies: 1. never modify RAID stripes while they have committed data in them, or 2. use journalling so that you can never be prevented from completing a RAID stripe update due to a crash. You can even do both, e.g. use strategy #1 for datacow files and strategy #2 for nodatacow files. IMHO we don't need to help nodatacow files survive RAID5/6 failure events because we don't help nodatacow files survive raid1, raid1c3, raid1c4, raid10, or dup failure events either, but opinions differ so, fine, there's strategy #2 if you want it. Other filesystems use strategy #1, but they have different layering between CoW allocator and the RAID layer: they put parity blocks in-band in extents, so every extent is always a complete set of RAID stripes. That would be a huge on-disk format change for btrfs (as well as rewriting half the kernel implementation) that nobody wants to do. The end result would behave almost but not quite like the way btrfs currently handles compression. It's also not fixing the current btrfs raid5/6, it's deprecating them entirely and doing something better instead. Back to fixing existing btrfs profiles. Any time we write to a stripe that is not occupied by committed data on btrfs, we avoid the conditions for the write hole. The existing CoW mechanisms handle this, so nothing needs to be changed there. We only need to worry about writes to stripes that contain data committed in earlier transactions, and we can know we are doing this by looking at 'gen' fields in neighboring extents whenever we insert an extent into a RAID5/6 block group. We can get strategy #1 on btrfs by making two small(ish) changes: 1.1. allocate blocks strictly on stripe-aligned boundaries. 1.2. add a new balance filter that selects only partially filled RAID5/6 stripes for relocation. The 'ssd' mount option already does 1.1, but it only works for RAID5 arrays with 5 disks and RAID6 arrays with 6 disks because it uses a fixed allocation boundary, and it only works for metadata because...it's coded to work only on metadata. The change would be to have btrfs select an allocation boundary for each block group based on the number of disks in the block group (no new behavior for block groups that aren't raid5/6), and do aligned allocations for both data and metadata. This creates a problem with free space fragmentation which we solve with change 1.2. Implementing 1.2 allows balance to repack partially filled stripes into complete stripes, which you will have to do fairly often if you are allocating data strictly on RAID-stripe-aligned boundaries. "Write 4K then fsync" uses 256K of disk space, since writes to partially filled stripes would not be allowed, we have 252K of wasted space and 4K in use. Balance could later pack 64 such 4K extents into a single RAID5 stripe, recovering all the wasted space. Defrag can perform a similar function, collecting multiple 4K extents into a single 256K or larger extent that can be written in a single transaction without wasting space. Strategy #2 requires some disk format changes: 2.1. add a new block group type for metadata that uses simple replication (raid1c3/raid1c4, already done) 2.2. record all data blocks to be written to partially filled RAID5/6 stripes in a journal before modifying any blocks in the stripe. The journal in 2.2. could be some extension of the log tree or a separate tree. As long as we can guarantee that any partial RAID5/6 RMW stripe update will complete all data block updates before we start updating the committed stripes, we can update any blocks we want. We don't need to journal the parity blocks, we can just recompute them from the logged data block if the updated device goes missing. After a crash, the journal must be replayed so that there are no incomplete stripe updates. Normally there would be at most one partial stripe update per transaction, unless the filesystem is really full and we are forced to start filling in old incomplete stripes. Full stripe writes don't need any intervention, the existing btrfs CoW mechanisms are fine. Strategy #1 requires no disk format changes. It just changes the allocator and balance behavior. Userspace changes would not be immediately required, though without running balance to clean up partially filled RAID stripes, performance would degrade after some time. New kernels will be able to write raid5/6 updates without write hole, old kernels won't. Strategy #2 requires multiple disk format changes: raid1c3/c4 (which we now have) and raid5/6 data block journalling extensions (which we don't). A kernel that didn't know to replay the log would not be able to fix write holes on mount. Note there are similar numbers of writes between the two strategies. Everything is written in two places--but strategy #1 allows the user to choose when the second write happens. This allows for batch updates, or maybe the user deletes or overwrites the data before we even have to bother relocating it. Strategy #2 always writes every journalled data block twice (not counting parity and mirroring), but we can keep the number of journalled blocks to a minimum. > > . Have a background process that converts RAID1 stripes to RAID5 (RAID1C3 to RAID6) > > > > Expected advantages are : > > . the low level features set basically remains the same > > . the filesystem format remains the same > > . old kernels and btrs-progs would not be disturbed > > > > The end result would be a mixed filesystem where active parts are RAID1 and archives one are RAID5. > > > > Regards, > > Hubert Tonneau > > > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5