[PATCH 0/1] zoned: moving superblock logging zones

* [PATCH 0/1] zoned: moving superblock logging zones
@ 2021-03-15  5:53 Naohiro Aota
  2021-03-15  5:53 ` [PATCH] btrfs: zoned: move superblock logging zone location Naohiro Aota
  2021-03-29  3:33 ` [PATCH 0/1] zoned: moving superblock logging zones Anand Jain
  0 siblings, 2 replies; 10+ messages in thread
From: Naohiro Aota @ 2021-03-15  5:53 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: Naohiro Aota

The following patch will change the superblock logging zones' location from
fixed zone number to fixed LBAs.

Here is a background of how the superblock is working on zoned btrfs.

This document will be promoted to btrfs-dev-docs in the future.

# Superblock logging for zoned btrfs

The superblock and its copies are the only data structures in btrfs with a
fixed location on a device. Since we cannot overwrite these blocks if they
are placed in sequential write required zones, we cannot use the regular
method of updating superblocks with zoned btrfs. We also cannot limit the
position of superblocks to conventional zones as that would prevent using
zoned block devices that do not have this zone type (e.g. NVMe ZNS SSDs).

To solve this problem, we use superblock log writing. This method uses two
sequential write required zones as a circular buffer to write updated
superblocks. Once the first zone is filled up, start writing into the
second zone. When both zones are filled up and before start writing to the
first zone again, the first zone is reset and writing continues in the
first zone. Once the first zone is full, reset the second zone, and write
the latest superblock in the second zone. With this logging, we can always
determine the position of the latest superblock by inspecting the zones'
write pointer information provided by the device. One corner case is when
both zones are full. For this situation, we read out the last superblock of
each zone and compare them to determine which copy is the latest one.

## Placement of superblock logging zones

We use the following three pairs of zones containing fixed offset
locations, regardless of the device zone size.

  - Primary superblock: zone starting at offset 0 and the following zone
  - First copy: zone containing offset 64GB and the following zone
  - Second copy: zone containing offset 256GB and the following zone

These zones are reserved for superblock logging and never used for data or
metadata blocks. Zones containing the offsets used to store superblocks in
a regular btrfs volume (no zoned case) are also reserved to avoid
confusion.

The first copy position is much larger than for a regular btrfs volume
(64M).  This increase is to avoid overlapping with the log zones for the
primary superblock. This higher location is arbitrary but allows supporting
devices with very large zone size, up to 32GB. But we only allow zone sizes
up to 8GB for now.

## Writing superblock in conventional zones

Conventional zones do not have a write pointer. This zone type thus cannot
be used with superblock logging since determining the position of the
latest copy of the superblock in a zone pair would be impossible.

To address this problem, if either of the zones containing the fixed offset
locations for zone logging is a conventional zone, superblock updates are
done in-place using the first block of the conventional zone.

## Reading zoned btrfs dump image without zone information

Reading a zoned btrfs image without zone information is challenging but
possible.

We can always find a superblock copy at or after the fixed offset locations
determining the logging zones position. With such copy, the superblock
incompatible flags indicates if the volume is zoned or not. With a chunk
item in the sys_chunk_array, we can determine the zone size from the size
of a device extent, itself determined from the chunk length, num_stripes,
and sub_stripes.  With this information, all blocks within the 2 logging
zones containing the fixed locations can be inspected to find the newest
superblock copy.

The first zone of a log pair may be empty and have no superblock copy. This
can happen if a system crashes after resetting the first zone of a pair and
before writing out a new superblock. In this case, a superblock copy can be
found in the second zone of a log pair. The start of this second zone can
be found by inspecting the blocks located at the fixed offset of the log
pair plus the possible zone size (4M [1], 8M, 16M, 32M, 64M, 128M, 256M,
512M, 1G, 2G, 4G, 8G [2])[3]. Once we find a superblock, we can follow the
same instruction above to find the latest superblock copy within the zone
log pair.

[1] 4M = BTRFS_MKFS_SYSTEM_GROUP_SIZE. We cannot mkfs on a device with a
zone size less than 4MB because we cannot create the initial temporary
system chunk with the size.
[2] The maximum size we support for now.
[3] The zone size is limited to these 11 cases, as it must be a power of 2.

Once we find the latest superblock, it is no different than reading a
regular btrfs image. You can further confirm the determined zone size by
comparing it with the size of a device extent because it is the same as the
zone size.

Actually, since the writing offset within the logging buffer is different
from the primary to copies [4], the timing when resetting the former zone
will become different. So, we can also try reading the head of the buffer
of a copy in case of missing superblock at offset 0.

[4] Because mkfs update the primary in the initial process, advancing only
the write pointer of the primary log buffer

## Superblock writing on an emulated zoned device

By mounting a regular device in zoned mode, btrfs emulates conventional
zones by slicing the device with a fixed size. In this case, however, we do
not follow the above rule of writing superblocks at the head of the logging
zones if they are conventional. Doing so would introduce a chicken-and-egg
problem. To know the given btrfs is zoned btrfs, we need to read a
superblock to see the incompatible flags. But, to read a superblock
properly from a zoned position, we need to know the file-system is zoned a
priori (e.g. resided in a zoned device), leading to a recursive dependency.

We can use the regular super block update method on an emulated zoned
device to break the recursion. Since the zones containing the regular
locations are always reserved, it is safe to do so. Then, we can naturally
read a regular superblock on a regular device and determine the file-system
is zoned or not.

Naohiro Aota (1):
  btrfs: zoned: move superblock logging zone location

 fs/btrfs/zoned.c | 40 ++++++++++++++++++++++++++++++----------
 1 file changed, 30 insertions(+), 10 deletions(-)

-- 
2.30.2

^ permalink raw reply	[flat|nested] 10+ messages in thread