All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] online filesystem repair
@ 2017-01-14  7:54 Darrick J. Wong
  2017-01-16  0:01 ` Viacheslav Dubeyko
  0 siblings, 1 reply; 8+ messages in thread
From: Darrick J. Wong @ 2017-01-14  7:54 UTC (permalink / raw)
  To: lsf-pc, darrick.wong; +Cc: linux-fsdevel, linux-xfs

Hi,

I've been working on implementing online metadata scrubbing and repair
in XFS.  Most of the code is self contained inside XFS, but there's a
small amount of interaction with the VFS freezer code that has to happen
in order to shut down the filesystem to rebuild the extent backref
records.  It might be interesting to discuss the (fairly slight)
requirements upon the VFS to support repairs, and/or have a BoF to
discuss how to build an online checker if any of the other filesystems
are interested in this.

Concurrent with development of online scrubbing, I've also been working
on a fuzz test suite for xfstests that fuzzes every field of every
metadata object on the filesystem and then tries to crash the kernel,
the offline repair tool (xfs_repair), or the online repair tool
(xfs_scrub).  I could talk about that as kind of a follow up to last
year's AFL presentation, and what kinds of bugs it's uncovered.

--D

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC] online filesystem repair
  2017-01-14  7:54 [LSF/MM TOPIC] online filesystem repair Darrick J. Wong
@ 2017-01-16  0:01 ` Viacheslav Dubeyko
  2017-01-17  6:24     ` Darrick J. Wong
  0 siblings, 1 reply; 8+ messages in thread
From: Viacheslav Dubeyko @ 2017-01-16  0:01 UTC (permalink / raw)
  To: Darrick J. Wong, lsf-pc; +Cc: linux-fsdevel, linux-xfs, Vyacheslav.Dubeyko

On Fri, 2017-01-13 at 23:54 -0800, Darrick J. Wong wrote:
> Hi,
> 
> I've been working on implementing online metadata scrubbing and
> repair
> in XFS.  Most of the code is self contained inside XFS, but there's a
> small amount of interaction with the VFS freezer code that has to
> happen
> in order to shut down the filesystem to rebuild the extent backref
> records.  It might be interesting to discuss the (fairly slight)
> requirements upon the VFS to support repairs, and/or have a BoF to
> discuss how to build an online checker if any of the other
> filesystems
> are interested in this.
> 

How do you imagine a generic way to support repairs for different file
systems? From one point of view, to have generic way of the online file
system repairing could be the really great subsystem. But, from another
point of view, every file system has own architecture, own set of
metadata and own way to do fsck check/recovering. As far as I can
judge, there are significant amount of research efforts in this
direction (Recon [1], [2], for example). But we still haven't any real
general online file system repair subsystem in the Linux kernel. Do you
have some new insight? What's difference of your vision? If we have
online file system repair subsystem then how file system driver will
need to interact with the goal to make internal repairing?

Thanks,
Vyacheslav Dubeyko.

[1] http://www.eecg.toronto.edu/~ashvin/publications/recon-fs-consistency-runtime.pdf
[2] https://www.researchgate.net/publication/269300836_Managing_the_file_system_from_the_kernel


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC] online filesystem repair
  2017-01-16  0:01 ` Viacheslav Dubeyko
@ 2017-01-17  6:24     ` Darrick J. Wong
  0 siblings, 0 replies; 8+ messages in thread
From: Darrick J. Wong @ 2017-01-17  6:24 UTC (permalink / raw)
  To: Viacheslav Dubeyko; +Cc: lsf-pc, linux-fsdevel, linux-xfs, Vyacheslav.Dubeyko

On Sun, Jan 15, 2017 at 04:01:30PM -0800, Viacheslav Dubeyko wrote:
> On Fri, 2017-01-13 at 23:54 -0800, Darrick J. Wong wrote:
> > Hi,
> > 
> > I've been working on implementing online metadata scrubbing and
> > repair
> > in XFS.��Most of the code is self contained inside XFS, but there's a
> > small amount of interaction with the VFS freezer code that has to
> > happen
> > in order to shut down the filesystem to rebuild the extent backref
> > records.��It might be interesting to discuss the (fairly slight)
> > requirements upon the VFS to support repairs, and/or have a BoF to
> > discuss how to build an online checker if any of the other
> > filesystems
> > are interested in this.
> > 
> 
> How do you imagine a generic way to support repairs for different file
> systems? From one point of view, to have generic way of the online file
> system repairing could be the really great subsystem.

I don't, sadly.  There's not even a way to /check/ all fs metadata in a
"generic" manner -- we can use the standard VFS interfaces to read
all metadata, but this is fraught.  Even if we assume the fs can spot
check obviously garbage values, that's still not the appropriate place
for a full scan.

> But, from another point of view, every file system has own
> architecture, own set of metadata and own way to do fsck
> check/recovering.

Yes, and this wouldn't change.  The particular mechanism of fixing a
piece of metadata will always be fs-dependent, but the thing that I'm
interested in discussing is how do we avoid having these kinds of things
interact badly with the VFS?

> As far as I can judge, there are significant amount of research
> efforts in this direction (Recon [1], [2], for example).

Yes, I remember Recon.  I appreciated the insight that while it's
impossible to block everything for a full scan, it /is/ possible to
check a single object and its relation to other metadata items.  The xfs
scrubber also takes an incremental approach to verifying a filesystem;
we'll lock each metadata object and verify that its relationships with
the other metadata make sense.  So long as we aren't bombarding the fs
with heavy metadata update workloads, of course.

On the repair side of things xfs added reverse-mapping records, which
the repair code uses to regenerate damaged primary metadata.  After we
land inode parent pointers we'll be able to do the same reconstructions
that we can now do for block allocations...

...but there are some sticky problems with repairing the reverse
mappings.  The normal locking order for that part of xfs is sb_writers
-> inode -> ag header -> rmap btree blocks, but to repair we have to
freeze the filesystem against writes so that we can scan all the inodes.

> But we still haven't any real general online file system repair
> subsystem in the Linux kernel.

I think the ocfs2 developers have encoded some ability to repair
metadata over the past year, though it seems limited to fixing some
parts of inodes.  btrfs stores duplicate copies and restores when
necessary, I think.  Unfortunately, fixing disk corruption is something
that's not easily genericized, which means that I don't think we'll ever
achieve a general subsystem.

But we could at least figure out what in the VFS has to change (if
anything) to support this type of usage.

> Do you have some new insight? What's difference of your
> vision? If we have online file system repair subsystem then how file
> system driver will need to interact with the goal to make internal
> repairing?

It's pretty much all private xfs userspace ioctls[1] with a driver
program[2].

--D

[1] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel
[2] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=djwong-devel

> 
> Thanks,
> Vyacheslav Dubeyko.
> 
> [1]�http://www.eecg.toronto.edu/~ashvin/publications/recon-fs-consistency-runtime.pdf
> [2]�https://www.researchgate.net/publication/269300836_Managing_the_file_system_from_the_kernel
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC] online filesystem repair
@ 2017-01-17  6:24     ` Darrick J. Wong
  0 siblings, 0 replies; 8+ messages in thread
From: Darrick J. Wong @ 2017-01-17  6:24 UTC (permalink / raw)
  To: Viacheslav Dubeyko; +Cc: lsf-pc, linux-fsdevel, linux-xfs, Vyacheslav.Dubeyko

On Sun, Jan 15, 2017 at 04:01:30PM -0800, Viacheslav Dubeyko wrote:
> On Fri, 2017-01-13 at 23:54 -0800, Darrick J. Wong wrote:
> > Hi,
> > 
> > I've been working on implementing online metadata scrubbing and
> > repair
> > in XFS.  Most of the code is self contained inside XFS, but there's a
> > small amount of interaction with the VFS freezer code that has to
> > happen
> > in order to shut down the filesystem to rebuild the extent backref
> > records.  It might be interesting to discuss the (fairly slight)
> > requirements upon the VFS to support repairs, and/or have a BoF to
> > discuss how to build an online checker if any of the other
> > filesystems
> > are interested in this.
> > 
> 
> How do you imagine a generic way to support repairs for different file
> systems? From one point of view, to have generic way of the online file
> system repairing could be the really great subsystem.

I don't, sadly.  There's not even a way to /check/ all fs metadata in a
"generic" manner -- we can use the standard VFS interfaces to read
all metadata, but this is fraught.  Even if we assume the fs can spot
check obviously garbage values, that's still not the appropriate place
for a full scan.

> But, from another point of view, every file system has own
> architecture, own set of metadata and own way to do fsck
> check/recovering.

Yes, and this wouldn't change.  The particular mechanism of fixing a
piece of metadata will always be fs-dependent, but the thing that I'm
interested in discussing is how do we avoid having these kinds of things
interact badly with the VFS?

> As far as I can judge, there are significant amount of research
> efforts in this direction (Recon [1], [2], for example).

Yes, I remember Recon.  I appreciated the insight that while it's
impossible to block everything for a full scan, it /is/ possible to
check a single object and its relation to other metadata items.  The xfs
scrubber also takes an incremental approach to verifying a filesystem;
we'll lock each metadata object and verify that its relationships with
the other metadata make sense.  So long as we aren't bombarding the fs
with heavy metadata update workloads, of course.

On the repair side of things xfs added reverse-mapping records, which
the repair code uses to regenerate damaged primary metadata.  After we
land inode parent pointers we'll be able to do the same reconstructions
that we can now do for block allocations...

...but there are some sticky problems with repairing the reverse
mappings.  The normal locking order for that part of xfs is sb_writers
-> inode -> ag header -> rmap btree blocks, but to repair we have to
freeze the filesystem against writes so that we can scan all the inodes.

> But we still haven't any real general online file system repair
> subsystem in the Linux kernel.

I think the ocfs2 developers have encoded some ability to repair
metadata over the past year, though it seems limited to fixing some
parts of inodes.  btrfs stores duplicate copies and restores when
necessary, I think.  Unfortunately, fixing disk corruption is something
that's not easily genericized, which means that I don't think we'll ever
achieve a general subsystem.

But we could at least figure out what in the VFS has to change (if
anything) to support this type of usage.

> Do you have some new insight? What's difference of your
> vision? If we have online file system repair subsystem then how file
> system driver will need to interact with the goal to make internal
> repairing?

It's pretty much all private xfs userspace ioctls[1] with a driver
program[2].

--D

[1] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel
[2] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=djwong-devel

> 
> Thanks,
> Vyacheslav Dubeyko.
> 
> [1] http://www.eecg.toronto.edu/~ashvin/publications/recon-fs-consistency-runtime.pdf
> [2] https://www.researchgate.net/publication/269300836_Managing_the_file_system_from_the_kernel
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC] online filesystem repair
  2017-01-17  6:24     ` Darrick J. Wong
  (?)
@ 2017-01-17 20:45     ` Andreas Dilger
  -1 siblings, 0 replies; 8+ messages in thread
From: Andreas Dilger @ 2017-01-17 20:45 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Viacheslav Dubeyko, lsf-pc, linux-fsdevel, linux-xfs, Vyacheslav.Dubeyko

[-- Attachment #1: Type: text/plain, Size: 5373 bytes --]

On Jan 16, 2017, at 11:24 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> 
> On Sun, Jan 15, 2017 at 04:01:30PM -0800, Viacheslav Dubeyko wrote:
>> On Fri, 2017-01-13 at 23:54 -0800, Darrick J. Wong wrote:
>>> Hi,
>>> 
>>> I've been working on implementing online metadata scrubbing and
>>> repair
>>> in XFS.  Most of the code is self contained inside XFS, but there's a
>>> small amount of interaction with the VFS freezer code that has to
>>> happen
>>> in order to shut down the filesystem to rebuild the extent backref
>>> records.  It might be interesting to discuss the (fairly slight)
>>> requirements upon the VFS to support repairs, and/or have a BoF to
>>> discuss how to build an online checker if any of the other
>>> filesystems
>>> are interested in this.
>>> 
>> 
>> How do you imagine a generic way to support repairs for different file
>> systems? From one point of view, to have generic way of the online file
>> system repairing could be the really great subsystem.
> 
> I don't, sadly.  There's not even a way to /check/ all fs metadata in a
> "generic" manner -- we can use the standard VFS interfaces to read
> all metadata, but this is fraught.  Even if we assume the fs can spot
> check obviously garbage values, that's still not the appropriate place
> for a full scan.
> 
>> But, from another point of view, every file system has own
>> architecture, own set of metadata and own way to do fsck
>> check/recovering.
> 
> Yes, and this wouldn't change.  The particular mechanism of fixing a
> piece of metadata will always be fs-dependent, but the thing that I'm
> interested in discussing is how do we avoid having these kinds of things
> interact badly with the VFS?
> 
>> As far as I can judge, there are significant amount of research
>> efforts in this direction (Recon [1], [2], for example).
> 
> Yes, I remember Recon.  I appreciated the insight that while it's
> impossible to block everything for a full scan, it /is/ possible to
> check a single object and its relation to other metadata items.  The xfs
> scrubber also takes an incremental approach to verifying a filesystem;
> we'll lock each metadata object and verify that its relationships with
> the other metadata make sense.  So long as we aren't bombarding the fs
> with heavy metadata update workloads, of course.

It is worthwhile to note that Lustre has a distributed online filesystem
checker (LFSCK) that works in a similar incremental manner, checking the
status of each object w.r.t. other objects it is related to.  This can
be done reasonably well because there is extra Lustre metadata that has
backpointers from data objects to inodes and from inodes to the parent
directory (including hard links).

That said, we depend on the local filesystem to be internally consistent,
and LFSCK is only verifying/repairing Lustre-specific metadata that
describes cross-server object relationships.

Cheers, Andreas

> On the repair side of things xfs added reverse-mapping records, which
> the repair code uses to regenerate damaged primary metadata.  After we
> land inode parent pointers we'll be able to do the same reconstructions
> that we can now do for block allocations...
> 
> ...but there are some sticky problems with repairing the reverse
> mappings.  The normal locking order for that part of xfs is sb_writers
> -> inode -> ag header -> rmap btree blocks, but to repair we have to
> freeze the filesystem against writes so that we can scan all the inodes.
> 
>> But we still haven't any real general online file system repair
>> subsystem in the Linux kernel.
> 
> I think the ocfs2 developers have encoded some ability to repair
> metadata over the past year, though it seems limited to fixing some
> parts of inodes.  btrfs stores duplicate copies and restores when
> necessary, I think.  Unfortunately, fixing disk corruption is something
> that's not easily genericized, which means that I don't think we'll ever
> achieve a general subsystem.
> 
> But we could at least figure out what in the VFS has to change (if
> anything) to support this type of usage.
> 
>> Do you have some new insight? What's difference of your
>> vision? If we have online file system repair subsystem then how file
>> system driver will need to interact with the goal to make internal
>> repairing?
> 
> It's pretty much all private xfs userspace ioctls[1] with a driver
> program[2].
> 
> --D
> 
> [1] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel
> [2] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=djwong-devel
> 
>> 
>> Thanks,
>> Vyacheslav Dubeyko.
>> 
>> [1] http://www.eecg.toronto.edu/~ashvin/publications/recon-fs-consistency-runtime.pdf
>> [2] https://www.researchgate.net/publication/269300836_Managing_the_file_system_from_the_kernel
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [LSF/MM TOPIC] online filesystem repair
  2017-01-17  6:24     ` Darrick J. Wong
  (?)
  (?)
@ 2017-01-18  0:37     ` Slava Dubeyko
  2017-01-25  8:41       ` Darrick J. Wong
  -1 siblings, 1 reply; 8+ messages in thread
From: Slava Dubeyko @ 2017-01-18  0:37 UTC (permalink / raw)
  To: Darrick J. Wong, Viacheslav Dubeyko; +Cc: lsf-pc, linux-fsdevel, linux-xfs


-----Original Message-----
From: Darrick J. Wong [mailto:darrick.wong@oracle.com] 
Sent: Monday, January 16, 2017 10:25 PM
To: Viacheslav Dubeyko <slava@dubeyko.com>
Cc: lsf-pc@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; linux-xfs@vger.kernel.org; Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
Subject: Re: [LSF/MM TOPIC] online filesystem repair
 
> > How do you imagine a generic way to support repairs for different file 
> > systems? From one point of view, to have generic way of the online 
> > file system repairing could be the really great subsystem.
>
> I don't, sadly.  There's not even a way to /check/ all fs metadata
> in a "generic" manner -- we can use the standard VFS interfaces to read
> all metadata, but this is fraught.  Even if we assume the fs can spot check obviously
> garbage values, that's still not the appropriate place for a full scan.

Let's try to imagine a possible way of generalization. I can see such critical points:
(1) mount operation;
(2) unmount/fsync operation;
(3) readpage;
(4) writepage;
(5) read metadata block/node;
(6) write/flush metadata block/node.
(7) metadata's item modification/access.

Let's imagine that file system will register every metadata structure in generic
online file checking subsystem. Then the file system will need to register some
set of checking methods or checking events for every registered metadata
structure. For example:
(1) check_access_metadata();
(2) check_metadata_modification();
(3) check_metadata_node();
(4) check_metadata_node_flush();
(5) check_metadata_nodes_relation().

I think that it is possible to consider several possible level of generic online
file system checking subsystem's activity: (1) light check mode;
(2) regular check mode; (3) strict check mode.

The "light check mode" can  be resulted in "fast" metadata nodes' check on write operation with
generation of error messages in the syslog with the request to check/recover
file system volume by means of fsck tool.

The "regular check mode" can be resulted in: (1) the checking of any metadata modification
with trying to correct the operation in the modification place; (2) metadata nodes' check
on write operation with generation of error messages in the syslog. 

The "strict check mode" can be resulted in: (1) check mount operation with trying to recover
the affected metadata structures; (2) the checking of any metadata modification
with trying to correct the operation in the modification place; (3) check and recover
metadata nodes on flush operation; (4) check/recover during unmount operation.

What do you like to expose to VFS level as generalized methods for your implementation?

> > But, from another point of view, every file system has own 
> > architecture, own set of metadata and own way to do fsck 
> > check/recovering.
>
> Yes, and this wouldn't change.  The particular mechanism of fixing a piece of
> metadata will always be fs-dependent, but the thing that I'm interested in
> discussing is how do we avoid having these kinds of things interact badly with the VFS?

Let's start from the simplest case. You have the current implementation.
How do you see the way to delegate to VFS some activity in your implementation
in the form of generalized methods? Let's imagine that VFS will have some callbacks
from file system side. What could it be?

> > As far as I can judge, there are significant amount of research 
> > efforts in this direction (Recon [1], [2], for example).
>
> Yes, I remember Recon.  I appreciated the insight that while it's impossible
> to block everything for a full scan, it /is/ possible to check a single object and
> its relation to other metadata items.  The xfs scrubber also takes an incremental
> approach to verifying a filesystem; we'll lock each metadata object and verify that
> its relationships with the other metadata make sense.  So long as we aren't bombarding
> the fs with heavy metadata update workloads, of course.
>
> On the repair side of things xfs added reverse-mapping records, which the repair code
> uses to regenerate damaged primary metadata.  After we land inode parent pointers
> we'll be able to do the same reconstructions that we can now do for block allocations...
>
> ...but there are some sticky problems with repairing the reverse mappings.
> The normal locking order for that part of xfs is sb_writers
> -> inode -> ag header -> rmap btree blocks, but to repair we have to
> freeze the filesystem against writes so that we can scan all the inodes.

Yes, the necessary freezing of file system is really tricky point. From one point of view,
it is possible to use "light checking mode" that will simply check and complain
about possible troubles at proper time (maybe with remount in RO mode).
Otherwise, from another point of view, we need in special file system architecture
or/and special way of VFS functioning. Let's imagine that file system volume will
be split on some groups/aggregations/objects with dedicated metadata. Then, theoretically,
VFS is able to freeze such group/aggregation/object for check and recovering
without affection the availability of the whole file system volume. It means that
file system operations should be redirected into active (not frozen) groups/aggregations/objects.
 
Thanks,
Vyacheslav Dubeyko.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC] online filesystem repair
  2017-01-18  0:37     ` Slava Dubeyko
@ 2017-01-25  8:41       ` Darrick J. Wong
  2017-01-27 22:06         ` Slava Dubeyko
  0 siblings, 1 reply; 8+ messages in thread
From: Darrick J. Wong @ 2017-01-25  8:41 UTC (permalink / raw)
  To: Slava Dubeyko; +Cc: Viacheslav Dubeyko, lsf-pc, linux-fsdevel, linux-xfs

On Wed, Jan 18, 2017 at 12:37:02AM +0000, Slava Dubeyko wrote:
> 
> -----Original Message-----
> From: Darrick J. Wong [mailto:darrick.wong@oracle.com] 
> Sent: Monday, January 16, 2017 10:25 PM
> To: Viacheslav Dubeyko <slava@dubeyko.com>
> Cc: lsf-pc@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; linux-xfs@vger.kernel.org; Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
> Subject: Re: [LSF/MM TOPIC] online filesystem repair
>  
> > > How do you imagine a generic way to support repairs for different file 
> > > systems? From one point of view, to have generic way of the online 
> > > file system repairing could be the really great subsystem.
> >
> > I don't, sadly.  There's not even a way to /check/ all fs metadata
> > in a "generic" manner -- we can use the standard VFS interfaces to read
> > all metadata, but this is fraught.  Even if we assume the fs can spot check obviously
> > garbage values, that's still not the appropriate place for a full scan.
> 
> Let's try to imagine a possible way of generalization. I can see such
> critical points:
> (1) mount operation;
> (2) unmount/fsync operation;
> (3) readpage;
> (4) writepage;
> (5) read metadata block/node;
> (6) write/flush metadata block/node.
> (7) metadata's item modification/access.
> 
> Let's imagine that file system will register every metadata structure
> in generic online file checking subsystem. Then the file system will

That sounds pretty harsh.  XFS (and ext4) hide quite a /lot/ of
metadata.  We don't expose the superblocks, the free space header, the
inode header, the free space btrees, the inode btrees, the reverse
mapping btrees, the refcount btrees, the journal, or the rtdev space
data.  I don't think we ought to expose any of that except to xfsprogs.
For another thing, there are dependencies between those pieces of
metadata, (e.g. the AGI has to work before we can check the inobt) and
one has to take those into account when scrubbing.

ext4 has a different set of internal metadata, but the same applies
there too.

> need to register some set of checking methods or checking events for
> every registered metadata structure. For example:
>
> (1) check_access_metadata();
> (2) check_metadata_modification();
> (3) check_metadata_node();
> (4) check_metadata_node_flush();
> (5) check_metadata_nodes_relation().

How does the VFS know to invoke these methods on a piece of internal
metadata that the FS owns and updates at its pleasure?  The only place
we encode all the relationships between pieces of metadata is in the fs
driver itself, and that's where scrubbing needs to take place.  The VFS
only serves to multiplex the subset of operations that are common across
all filesystems; everything else take the form of (semi-private) ioctls.

> I think that it is possible to consider several possible level of
> generic online file system checking subsystem's activity: (1) light
> check mode; (2) regular check mode; (3) strict check mode.
>
> The "light check mode" can  be resulted in "fast" metadata nodes'
> check on write operation with generation of error messages in the
> syslog with the request to check/recover file system volume by means
> of fsck tool.
> 
> The "regular check mode" can be resulted in: (1) the checking of any
> metadata modification with trying to correct the operation in the
> modification place; (2) metadata nodes' check on write operation with
> generation of error messages in the syslog. 
>
> The "strict check mode" can be resulted in: (1) check mount operation
> with trying to recover the affected metadata structures; (2) the
> checking of any metadata modification with trying to correct the
> operation in the modification place; (3) check and recover metadata
> nodes on flush operation; (4) check/recover during unmount operation.

I'm a little unclear about where you're going with all three of these
things; the XFS metadata verifiers already do limited spot-checking of
all metadata reads and writes without the VFS being directly involved.
The ioctl performs more intense checking and cross-checking of metadata
that would be too expensive to do on every access.

> What do you like to expose to VFS level as generalized methods for
> your implementation?

Nothing.  A theoretical ext4 interface could look similar to XFS's, but
the metadata-type codes would be different.  btrfs seems so much
different structurally there's little point in trying.

I also looked at ocfs2's online filecheck.  It's pretty clear they had
different goals and ended up with a much different interface.

> > > But, from another point of view, every file system has own 
> > > architecture, own set of metadata and own way to do fsck 
> > > check/recovering.
> >
> > Yes, and this wouldn't change.  The particular mechanism of fixing a piece of
> > metadata will always be fs-dependent, but the thing that I'm interested in
> > discussing is how do we avoid having these kinds of things interact badly with the VFS?
> 
> Let's start from the simplest case. You have the current
> implementation.  How do you see the way to delegate to VFS some
> activity in your implementation in the form of generalized methods?
> Let's imagine that VFS will have some callbacks from file system side.
> What could it be?
> 
> > > As far as I can judge, there are significant amount of research 
> > > efforts in this direction (Recon [1], [2], for example).
> >
> > Yes, I remember Recon.  I appreciated the insight that while it's impossible
> > to block everything for a full scan, it /is/ possible to check a single object and
> > its relation to other metadata items.  The xfs scrubber also takes an incremental
> > approach to verifying a filesystem; we'll lock each metadata object and verify that
> > its relationships with the other metadata make sense.  So long as we aren't bombarding
> > the fs with heavy metadata update workloads, of course.
> >
> > On the repair side of things xfs added reverse-mapping records, which the repair code
> > uses to regenerate damaged primary metadata.  After we land inode parent pointers
> > we'll be able to do the same reconstructions that we can now do for block allocations...
> >
> > ...but there are some sticky problems with repairing the reverse mappings.
> > The normal locking order for that part of xfs is sb_writers
> > -> inode -> ag header -> rmap btree blocks, but to repair we have to
> > freeze the filesystem against writes so that we can scan all the inodes.
> 
> Yes, the necessary freezing of file system is really tricky point.
> From one point of view, it is possible to use "light checking mode"
> that will simply check and complain about possible troubles at proper
> time (maybe with remount in RO mode).

Yes, scrub does this fairly lightweight checking -- no freezing, no
remounting, etc.  If checking something would mean violating locking
rules (which would require a quiesced fs) then we simply hope that the
scan process eventually checks it via the normative locking paths.
For example, the rmapbt scrubber doesn't cross reference inode extent
records with the inode block maps because we lock inode -> agf ->
rmapbt; it relies on the scrub program eventually locking the inode to
check the block map and then cross-referencing with the rmap data.

For repair (of only the rmap data) we have to be able to access
arbitrary files, so that requires a sync and then shutting down the
filesystem while we do it, so that nothing else can lock inodes.  I
don't know if this is 100% bulletproof; the fsync might take the fs down
before we even get to repairing.

Really what this comes down to is a discussion of how to suspend user
IOs temporarily and how to reinitialize the mm/vfs view of a part of the
world if the filesystem wants to do that.

> Otherwise, from another point of view, we need in special file system
> architecture or/and special way of VFS functioning. Let's imagine that
> file system volume will be split on some groups/aggregations/objects
> with dedicated metadata.  Then, theoretically, VFS is able to freeze
> such group/aggregation/object for check and recovering without
> affection the availability of the whole file system volume. It means
> that file system operations should be redirected into active (not
> frozen) groups/aggregations/objects.

One could in theory teach XFS how to shut down AGs, which would redirect
block/inode allocations elsewhere.  Freeing would be a mess though.  My
goal is to make scrub & repair fast enough that we don't need that.

--D

>  
> Thanks,
> Vyacheslav Dubeyko.
> 
> Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:
> 
> This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [LSF/MM TOPIC] online filesystem repair
  2017-01-25  8:41       ` Darrick J. Wong
@ 2017-01-27 22:06         ` Slava Dubeyko
  0 siblings, 0 replies; 8+ messages in thread
From: Slava Dubeyko @ 2017-01-27 22:06 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Viacheslav Dubeyko, lsf-pc, linux-fsdevel, linux-xfs


-----Original Message-----
From: Darrick J. Wong [mailto:darrick.wong@oracle.com] 
Sent: Wednesday, January 25, 2017 12:42 AM
To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>; lsf-pc@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; linux-xfs@vger.kernel.org
Subject: Re: [LSF/MM TOPIC] online filesystem repair
 
> > Let's imagine that file system will register every metadata structure 
> > in generic online file checking subsystem. Then the file system will
>
> That sounds pretty harsh.  XFS (and ext4) hide quite a /lot/ of metadata. 
> We don't expose the superblocks, the free space header, the inode header,
> the free space btrees, the inode btrees, the reverse mapping btrees,
> the refcount btrees, the journal, or the rtdev space data.  I don't think
> we ought to expose any of that except to xfsprogs.
> For another thing, there are dependencies between those pieces of metadata,
> (e.g. the AGI has to work before we can check the inobt) and one has to take
> those into account when scrubbing.
>
> ext4 has a different set of internal metadata, but the same applies there too.

I didn't suggest to expose the metadata in pure sense of this word. The key point of this
discussion is the elaboration of a vision how the generic online file system checking/recovery
can be done. It means that VFS has to represent a file system like some generic set of
items (for example, like sequence of iterators). The VFS is the layer of generalized
management of any file system. And VFS interacts with concrete file systems by means of
specialized callbacks (file_operations, inode_operations and so on) that provides opportunity
to implement some special way of file system volume management. So, as far as I can see,
the online file system check/recovery subsystem has to looks like in the same way. It needs to work
in generalized manner but specialized callbacks will realize specialized elementary operations.
And concrete file system driver should provide the specialization of such methods.

The really important point is the possible mode(s) of the online file system
check/recovery subsystem. I see the two principal cases: (1) post-corruption check/recovery;
(2) preventive check. 

We could consider mount and/or unmount operations like the main point(s) of the online
file system check/recovery subsystem activity for the post-corruption case. In this case
struct super_operations could contain check_method() and recovery_method() that will
realize all specialized logic of checking/recovery a file system volume. All file system's peculiarities
in metadata implementation and checking/recovering algorithm will be hidden in these
specialized method. So, it is possible to see such possible cases when online file system check/recovery
subsystem could be possibly used for the post-corruption case:
(1) mount operation -> we discover the file system corruption here, usually;
(2) remount in RO mode -> if we had some internal error in file system driver;
(3) special set of file system errors that initiate check/recovery subsystem activity;
(4) unmount operation -> check file system volume consistency at the end of unmount operation. 

Also it is possible to consider the opportunity to check a file system volume's state or the state of
some metadata structure in mounted state of file system volume. But, as far as I can see, we need
to introduce new syscalls or special ioctl commands for such case. And I am not sure that
it will be easy to implement support of such requests.

Another possible mode could be a preventive mode of checking file system volume's state
before flush operation. In this case, VFS should consider a file system volume as some abstract
sequence of metadata structures. It means that VFS needs to use some specialized methods
(registered by file system driver) in generic way. Let's imagine that VFS will have some generic
method of preventive checking of flush operations. I mean here that, anyway, every metadata
structure is split between nodes, logical blocks and so on. Usually, such node could contain some
header and file system driver is able to check consistency of such node before flush operation.
Of course, such check operation can degrade the performance of flush operation. But it could
be the decision of a user to use or not to use the preventive mode. Also we cannot check the
relations between different nodes. The complete check can be done during post-corruption
check/recovery mode.

> > need to register some set of checking methods or checking events for 
> > every registered metadata structure. For example:
> >
> > (1) check_access_metadata();
> > (2) check_metadata_modification();
> > (3) check_metadata_node();
> > (4) check_metadata_node_flush();
> > (5) check_metadata_nodes_relation().
> 
> How does the VFS know to invoke these methods on a piece of internal metadata that
> the FS owns and updates at its pleasure?  The only place we encode all the relationships
> between pieces of metadata is in the fs driver itself, and that's where scrubbing needs
> to take place.  The VFS only serves to multiplex the subset of operations that are common
> across all filesystems; everything else take the form of (semi-private) ioctls.

First of all, it's clear that we discover a file system volume's corrupted state during
mount operation. So, VFS can easily invoke the method of check/recovery a file
system volume in generic manner during mount operation. Also file system volume
is unable for any file system operations till the finishing of mount operation. So, file
system driver can do anything with metadata of file system volume with complete
pleasure. Secondly, the unmount operation can be used easily in generic manner
with the same purpose.

Secondly, VFS represents a file system's hierarchy by means of inodes, dentries, page
cache and so on. It is abstraction that VFS is using in memory on OS side. But file system
volume could not contain such items at all. For example, HFS+ hasn't inodes or dentries
in the file system volume. It uses btrees that contains file records, folder records.
And it needs to convert HFS+ representation of metadata into VFS internal representation 
during any operations with retrieving or storing metadata in the file system volume.
When I am talking about some check/recovery methods (check_metadata_node(),
for example) I mean that could have some abstract representation of any metadata
in the file system volume. It could be the simple sequence of metadata nodes, for example.
And if it was requested the sync operation then it means that the whole metadata
structure should be consistent for the flush operation. So, if VFS is trying to execute
the sync_fs() operation then it is possible to pass through the all abstract sequences of
metadata nodes and to apply the check_metadata_node() callbacks that will be
executed by concrete file system driver. So, the real check/recovery operation will be
done by the fs driver itself but in general manner.

> > I think that it is possible to consider several possible level of 
> > generic online file system checking subsystem's activity: (1) light 
> > check mode; (2) regular check mode; (3) strict check mode.
> >
> > The "light check mode" can  be resulted in "fast" metadata nodes'
> > check on write operation with generation of error messages in the 
> > syslog with the request to check/recover file system volume by means 
> > of fsck tool.
> > 
> > The "regular check mode" can be resulted in: (1) the checking of any 
> > metadata modification with trying to correct the operation in the 
> > modification place; (2) metadata nodes' check on write operation with 
> > generation of error messages in the syslog.
> >
> > The "strict check mode" can be resulted in: (1) check mount operation 
> > with trying to recover the affected metadata structures; (2) the 
> > checking of any metadata modification with trying to correct the 
> > operation in the modification place; (3) check and recover metadata 
> > nodes on flush operation; (4) check/recover during unmount operation.
>
> I'm a little unclear about where you're going with all three of these things;
> the XFS metadata verifiers already do limited spot-checking of all metadata
> reads and writes without the VFS being directly involved.
> The ioctl performs more intense checking and cross-checking of metadata
> that would be too expensive to do on every access.

We are trying to talk not about XFS only. If we talk about a generic online
check/recovery subsystem then it has to be good for all other file systems too.
Again, if you believe that all check/recovering activity should be hidden from
VFS then it's not clear for me why did you raise this topic?

XFS and other file systems do some metadata verification in the background of
VFS activity. Excellent... It sounds for me that we simply need to generalize
this activity on VFS level. As minimum, we could consider mount/unmount
operation for the case of online check/recovery subsystem. Also VFS is able
to be involved into some preventive metadata checking on generalized basis.

When I am talking about different checking modes I mean that a user should
have opportunity to select the different possible modes of online check/recovery
subsystem with different overheads. It is not necessary to be set of modes that
I mentioned. But different users have different priorities. Some users need in
performance, another ones need in reliability. Now we cannot manage what
file system does with metadata checking in the background. But a user will be
able to opt a proper way of online check/recovery subsystem activity if the VFS
supports generalized way of metadata checking with different modes.

> > What do you like to expose to VFS level as generalized methods for 
> > your implementation?
> 
> Nothing.  A theoretical ext4 interface could look similar to XFS's,
> but the metadata-type codes would be different.  btrfs seems so much
> different structurally there's little point in trying.

So, why did you raise this topic? If nothing then no topic. :)

> I also looked at ocfs2's online filecheck.  It's pretty clear they had
> different goals and ended up with a much different interface.

If we would like to talk about generic VFS-based online check/recovery
subsystem then we need to find some common points. I think it's possible.
Do you mean that you don't see the way of generalization? What's the point
of this discussion in such case?

Thanks,
Vyacheslav Dubeyko.
 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-01-27 22:06 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-14  7:54 [LSF/MM TOPIC] online filesystem repair Darrick J. Wong
2017-01-16  0:01 ` Viacheslav Dubeyko
2017-01-17  6:24   ` Darrick J. Wong
2017-01-17  6:24     ` Darrick J. Wong
2017-01-17 20:45     ` Andreas Dilger
2017-01-18  0:37     ` Slava Dubeyko
2017-01-25  8:41       ` Darrick J. Wong
2017-01-27 22:06         ` Slava Dubeyko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.