RE: [LSF/MM TOPIC] online filesystem repair

From: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>,
	"lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>
Subject: RE: [LSF/MM TOPIC] online filesystem repair
Date: Fri, 27 Jan 2017 22:06:32 +0000	[thread overview]
Message-ID: <SN2PR04MB2191780E898A5439D7F1585B88760@SN2PR04MB2191.namprd04.prod.outlook.com> (raw)
In-Reply-To: <20170125084133.GC5726@birch.djwong.org>

-----Original Message-----
From: Darrick J. Wong [mailto:darrick.wong@oracle.com] 
Sent: Wednesday, January 25, 2017 12:42 AM
To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>; lsf-pc@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; linux-xfs@vger.kernel.org
Subject: Re: [LSF/MM TOPIC] online filesystem repair

> > Let's imagine that file system will register every metadata structure 
> > in generic online file checking subsystem. Then the file system will
>
> That sounds pretty harsh.  XFS (and ext4) hide quite a /lot/ of metadata. 
> We don't expose the superblocks, the free space header, the inode header,
> the free space btrees, the inode btrees, the reverse mapping btrees,
> the refcount btrees, the journal, or the rtdev space data.  I don't think
> we ought to expose any of that except to xfsprogs.
> For another thing, there are dependencies between those pieces of metadata,
> (e.g. the AGI has to work before we can check the inobt) and one has to take
> those into account when scrubbing.
>
> ext4 has a different set of internal metadata, but the same applies there too.

I didn't suggest to expose the metadata in pure sense of this word. The key point of this
discussion is the elaboration of a vision how the generic online file system checking/recovery
can be done. It means that VFS has to represent a file system like some generic set of
items (for example, like sequence of iterators). The VFS is the layer of generalized
management of any file system. And VFS interacts with concrete file systems by means of
specialized callbacks (file_operations, inode_operations and so on) that provides opportunity
to implement some special way of file system volume management. So, as far as I can see,
the online file system check/recovery subsystem has to looks like in the same way. It needs to work
in generalized manner but specialized callbacks will realize specialized elementary operations.
And concrete file system driver should provide the specialization of such methods.

The really important point is the possible mode(s) of the online file system
check/recovery subsystem. I see the two principal cases: (1) post-corruption check/recovery;
(2) preventive check. 

We could consider mount and/or unmount operations like the main point(s) of the online
file system check/recovery subsystem activity for the post-corruption case. In this case
struct super_operations could contain check_method() and recovery_method() that will
realize all specialized logic of checking/recovery a file system volume. All file system's peculiarities
in metadata implementation and checking/recovering algorithm will be hidden in these
specialized method. So, it is possible to see such possible cases when online file system check/recovery
subsystem could be possibly used for the post-corruption case:
(1) mount operation -> we discover the file system corruption here, usually;
(2) remount in RO mode -> if we had some internal error in file system driver;
(3) special set of file system errors that initiate check/recovery subsystem activity;
(4) unmount operation -> check file system volume consistency at the end of unmount operation. 

Also it is possible to consider the opportunity to check a file system volume's state or the state of
some metadata structure in mounted state of file system volume. But, as far as I can see, we need
to introduce new syscalls or special ioctl commands for such case. And I am not sure that
it will be easy to implement support of such requests.

Another possible mode could be a preventive mode of checking file system volume's state
before flush operation. In this case, VFS should consider a file system volume as some abstract
sequence of metadata structures. It means that VFS needs to use some specialized methods
(registered by file system driver) in generic way. Let's imagine that VFS will have some generic
method of preventive checking of flush operations. I mean here that, anyway, every metadata
structure is split between nodes, logical blocks and so on. Usually, such node could contain some
header and file system driver is able to check consistency of such node before flush operation.
Of course, such check operation can degrade the performance of flush operation. But it could
be the decision of a user to use or not to use the preventive mode. Also we cannot check the
relations between different nodes. The complete check can be done during post-corruption
check/recovery mode.

> > need to register some set of checking methods or checking events for 
> > every registered metadata structure. For example:
> >
> > (1) check_access_metadata();
> > (2) check_metadata_modification();
> > (3) check_metadata_node();
> > (4) check_metadata_node_flush();
> > (5) check_metadata_nodes_relation().
> 
> How does the VFS know to invoke these methods on a piece of internal metadata that
> the FS owns and updates at its pleasure?  The only place we encode all the relationships
> between pieces of metadata is in the fs driver itself, and that's where scrubbing needs
> to take place.  The VFS only serves to multiplex the subset of operations that are common
> across all filesystems; everything else take the form of (semi-private) ioctls.

First of all, it's clear that we discover a file system volume's corrupted state during
mount operation. So, VFS can easily invoke the method of check/recovery a file
system volume in generic manner during mount operation. Also file system volume
is unable for any file system operations till the finishing of mount operation. So, file
system driver can do anything with metadata of file system volume with complete
pleasure. Secondly, the unmount operation can be used easily in generic manner
with the same purpose.

Secondly, VFS represents a file system's hierarchy by means of inodes, dentries, page
cache and so on. It is abstraction that VFS is using in memory on OS side. But file system
volume could not contain such items at all. For example, HFS+ hasn't inodes or dentries
in the file system volume. It uses btrees that contains file records, folder records.
And it needs to convert HFS+ representation of metadata into VFS internal representation 
during any operations with retrieving or storing metadata in the file system volume.
When I am talking about some check/recovery methods (check_metadata_node(),
for example) I mean that could have some abstract representation of any metadata
in the file system volume. It could be the simple sequence of metadata nodes, for example.
And if it was requested the sync operation then it means that the whole metadata
structure should be consistent for the flush operation. So, if VFS is trying to execute
the sync_fs() operation then it is possible to pass through the all abstract sequences of
metadata nodes and to apply the check_metadata_node() callbacks that will be
executed by concrete file system driver. So, the real check/recovery operation will be
done by the fs driver itself but in general manner.

> > I think that it is possible to consider several possible level of 
> > generic online file system checking subsystem's activity: (1) light 
> > check mode; (2) regular check mode; (3) strict check mode.
> >
> > The "light check mode" can  be resulted in "fast" metadata nodes'
> > check on write operation with generation of error messages in the 
> > syslog with the request to check/recover file system volume by means 
> > of fsck tool.
> > 
> > The "regular check mode" can be resulted in: (1) the checking of any 
> > metadata modification with trying to correct the operation in the 
> > modification place; (2) metadata nodes' check on write operation with 
> > generation of error messages in the syslog.
> >
> > The "strict check mode" can be resulted in: (1) check mount operation 
> > with trying to recover the affected metadata structures; (2) the 
> > checking of any metadata modification with trying to correct the 
> > operation in the modification place; (3) check and recover metadata 
> > nodes on flush operation; (4) check/recover during unmount operation.
>
> I'm a little unclear about where you're going with all three of these things;
> the XFS metadata verifiers already do limited spot-checking of all metadata
> reads and writes without the VFS being directly involved.
> The ioctl performs more intense checking and cross-checking of metadata
> that would be too expensive to do on every access.

We are trying to talk not about XFS only. If we talk about a generic online
check/recovery subsystem then it has to be good for all other file systems too.
Again, if you believe that all check/recovering activity should be hidden from
VFS then it's not clear for me why did you raise this topic?

XFS and other file systems do some metadata verification in the background of
VFS activity. Excellent... It sounds for me that we simply need to generalize
this activity on VFS level. As minimum, we could consider mount/unmount
operation for the case of online check/recovery subsystem. Also VFS is able
to be involved into some preventive metadata checking on generalized basis.

When I am talking about different checking modes I mean that a user should
have opportunity to select the different possible modes of online check/recovery
subsystem with different overheads. It is not necessary to be set of modes that
I mentioned. But different users have different priorities. Some users need in
performance, another ones need in reliability. Now we cannot manage what
file system does with metadata checking in the background. But a user will be
able to opt a proper way of online check/recovery subsystem activity if the VFS
supports generalized way of metadata checking with different modes.

> > What do you like to expose to VFS level as generalized methods for 
> > your implementation?
> 
> Nothing.  A theoretical ext4 interface could look similar to XFS's,
> but the metadata-type codes would be different.  btrfs seems so much
> different structurally there's little point in trying.

So, why did you raise this topic? If nothing then no topic. :)

> I also looked at ocfs2's online filecheck.  It's pretty clear they had
> different goals and ended up with a much different interface.

If we would like to talk about generic VFS-based online check/recovery
subsystem then we need to find some common points. I think it's possible.
Do you mean that you don't see the way of generalization? What's the point
of this discussion in such case?

Thanks,
Vyacheslav Dubeyko.