[ANNOUNCE] Reiser4 Logical Volumes. Mirrors and Failover

* [ANNOUNCE] Reiser4 Logical Volumes. Mirrors and Failover
@ 2016-09-24 22:47 Edward Shishkin
  2016-09-26 10:43 ` Edward Shishkin
  2016-11-20 11:58 ` Edward Shishkin
  0 siblings, 2 replies; 4+ messages in thread
From: Edward Shishkin @ 2016-09-24 22:47 UTC (permalink / raw)
  To: ReiserFS development mailing list

                       Logical Volumes

Reiser4 will support logical (compound) volumes. For now we have
implemented the simplest ones - mirrors. As a supplement to existing
checksums it will provide a failover - an important feature, which
will reduce number of cases when your volume needs to be repaired by
fsck.

Reiser4 subvolume is a component of logical volume. Subvolume is
always associated with a physical, or logical (built of RAID, LVM,
etc means) block device. Every subvolume possesses:

. volume ID;
. subvolume ID;
. mirror ID;
. number of replicas.

mirror ID is a serial number from 0 till 65535. Subvolume with mirror
ID 0 has a special name - original. Other ones are called replicas.
We use to say "original A has a replica B" (or "B replicates A",
which is the same), iff A and B possess the same subvolume ID.
Original with all its replicas are called "mirrors".

For subvolumes we have introduced a special disk format plugin
"format41". In accordance with Reiser4 development model it means
forward incompatibility. We have introduced it intentionally, for
protection. Indeed, for clear reasons users must not have possibility
to RW-mount separate replicas (without originals).
The multi-device extension is backward compatible: all volumes of the
old format (format40) are supported as logical volumes composed of
only one (original) subvolume.

            Registration and activation of subvolumes

For now every Reiser4 logical volume has only one original subvolume.
Number of replicas can be 0, or more. Logical volume can be mount
by usual mount command. Simply specify any its subvolume (the
original, or some its replica). The only condition is that original
and all its replicas should be registered in the system. If original,
or some its replica are not registered, then mount will fail with a
respective kernel message.

Currently there is no tool to register specified subvolume (TBD).
However, mount command always tries to register the specified device.
The registration policy is "sticky". It means that your device won't
be unregistered after umount, as well as failed mount. (You will be
able to unregister it mandatory by a special tool - TBD).

Procedure of registration reads the master super-block of the
subvolume and puts the subvolume header to a specilal list of
registered subvolumes.

Mounting a logical volume activates all its registered components.
Procedure of activation reads format super-block of the subvolume, and
performs other actions like initialization of space maps, transaction
replay, etc. as specified by the method ->init_format() of respective
disk format plugin. Pointer to an activated subvolume is placed to a
special table of active subvolumes.

                        Mirror operations

So original and mirrors actually represent RAID0 on the filesystem
level.

COMMENT. We aren't engaged in marketing fraud on collecting all
features of the block layer's RAID and LVM. Reiser4 mirrors implement
a failover, that block layers's RAID0 is not able to provide.

It will be possible to "upgrade", or "downgrade" a reiser4 array of
mirrors by attaching / detaching online one, or more replicas by
special user-space tools (mirror.reiser4, TBD). Also by those tools it
will be possible to swap original with any its replica, or make a new
original from any replica, if the old one is lost for some reasons.

Fsck will refuse to check/repir replica. Fsck is supposed to work only
with original subvolumes. After mounting an fsck-ed original, kernel
will automatically run a special on-line backgroud procedure (scrub)
in order to synchronize the repaired original with all its replicas.

Once in a while user has to check his array of mirrors by running
scrub in the background mode.

WARNING: Bear in mind once and forever: Replica is not a backup!!!

                        Technical Notes

1. Reiser4 Transaction Design document is transferred to logical
volumes without any modifications, but with a small addition. Atom is
now composed of per-subvolume components.

2. By design all mirrors differ only in mirror-IDs which are stored in
master super-block. Format super-blocks of mirrors are identical. This
approach provides best performance and full parallelism in issuing IO
requests for mirrors. The minus is a small compromise in design,
according to which master super-block doesn't participate in
transactions. It means that mirror operations on upgrading/degrading/
swapping can not spawn usual transactions, which can be committed
and (re)played using existing transaction manager. That is, mirror
operations won't survive a system crash. If a system crash happens
during a mirror operation, then the mirror structure should be
checked/fixed offline by the mirror tools (kernel will refuse to mount
unchecked array of mirrors). Fortunately, all critical mirror
operations issue small number of IO requests, so that probability of
their interruption is close to zero.

3. We don't commit transactions on all mirrors, only on the original
subvolume (this is the single functional difference of original and
its replicas). Transaction (re)play, of course, is going on all
mirrors using the wandering maps/blocks of the original subvolume.

                    How to test the new features

Checkout branch "format41" of the upstream reiser4 and reiser4progs
git repos on https://github.com/edward6 Build and install as usual.

Mirrors can be created by mkfs.reiser4 option -m. If this option is
specified, then the first listed device will be the original, other
ones - replicas. All devices of an array should have the same size.
Further we'll avoid that restriction.

IMPORTANT: when creating mirrors specify node41 plugin (with checksum
support). Otherwise, your mirrors won't be more useful than block
layer's RAID0.

Register all your mirrors, trying to "mount" them one-by-one in any
order. If you have N mirrors (i.e. one original and N-1 replicas),
then first N-1 mount commands will fail. Of course, it is not too
graceful, but this is temporal solution. The N-th "attempt" should
succeed. Have a fun. Unmount as usual.

                            Example

Suppose we have 2 partitions /dev/sda7 and /dev/sda8 of equal
size. Let's create an array of 2 mirrors:

# mkfs.reiser4 -my -o node=node41 /dev/sda7 /dev/sda8

Take a look at original subvolume:

# debugfs.reiser4 /dev/sda7

Take a look at replica:

# debugfs.reiser4 /dev/sda8

Find differences ;)

Register the original subvolume

# mount /dev/sda7 /mnt
mount: wrong fs type, bad option, bad superblock blablabla....
# dmesg
reiser4[mount(20914)]: check_active_replicas 
(fs/reiser4/init_volume.c:268)[edward-1750]:
WARNING: /dev/sda7 requires replicas, which are not registered.

Register the replica and mount the array:

#mount /dev/sda8 /mnt
#dmesg

reiser4: registered subvolume (/dev/sda8)
reiser4 (sda8): found disk format 4.0.1.
reiser4 (/dev/sda7): using Hybrid Transaction Model.

Let's copy a file /etc/services to our array of mirrors:

# cp /etc/services /mnt/.

Unmount the array:

# umount /mnt

Find a root block: it goes the first in the tree dump:

# debugfs.reiser4 -t /dev/sda7

In our case the root block has blocknumber #79

Let's now take a look on how our failover works. The death defying
act: we erase the root block of the original subvolume:

# dd if=/dev/zero of=/dev/sda7 bs=4096 count=1 seek=79

We know that the mount procedure load the root block. Let's try to
mount our array with the corrupted root block:

# mount /dev/sda8 /mnt

Everything works..
Take a look at kernel messages:

# dmesg
reiser4[mount(21224)]: parse_node41 
(fs/reiser4/plugin/node/node41.c:79)[edward-1645]:
WARNING: block 79 (/dev/sda7): bad checksum. Please, scrub the volume.

                              TODO

1) Mirror tools (upgrade/downgrade a mirror array, swap original and
     specified replica, convert replica to an original, visualization of 
mirror
     arrays, etc);
2) Scrub (online background checking and synchronizaton of mirrors);
3) Checksumming format super-block;
4) Issuing discard requests for replicas on SSD devices.

All items are very simple to implement. If anyone cares, then I'll
provide details.

Thanks,
Edward.

^ permalink raw reply	[flat|nested] 4+ messages in thread