* [LSF/MM TOPIC] online filesystem repair @ 2017-01-14 7:54 Darrick J. Wong 2017-01-16 0:01 ` Viacheslav Dubeyko 0 siblings, 1 reply; 8+ messages in thread From: Darrick J. Wong @ 2017-01-14 7:54 UTC (permalink / raw) To: lsf-pc, darrick.wong; +Cc: linux-fsdevel, linux-xfs Hi, I've been working on implementing online metadata scrubbing and repair in XFS. Most of the code is self contained inside XFS, but there's a small amount of interaction with the VFS freezer code that has to happen in order to shut down the filesystem to rebuild the extent backref records. It might be interesting to discuss the (fairly slight) requirements upon the VFS to support repairs, and/or have a BoF to discuss how to build an online checker if any of the other filesystems are interested in this. Concurrent with development of online scrubbing, I've also been working on a fuzz test suite for xfstests that fuzzes every field of every metadata object on the filesystem and then tries to crash the kernel, the offline repair tool (xfs_repair), or the online repair tool (xfs_scrub). I could talk about that as kind of a follow up to last year's AFL presentation, and what kinds of bugs it's uncovered. --D ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM TOPIC] online filesystem repair 2017-01-14 7:54 [LSF/MM TOPIC] online filesystem repair Darrick J. Wong @ 2017-01-16 0:01 ` Viacheslav Dubeyko 2017-01-17 6:24 ` Darrick J. Wong 0 siblings, 1 reply; 8+ messages in thread From: Viacheslav Dubeyko @ 2017-01-16 0:01 UTC (permalink / raw) To: Darrick J. Wong, lsf-pc; +Cc: linux-fsdevel, linux-xfs, Vyacheslav.Dubeyko On Fri, 2017-01-13 at 23:54 -0800, Darrick J. Wong wrote: > Hi, > > I've been working on implementing online metadata scrubbing and > repair > in XFS. Most of the code is self contained inside XFS, but there's a > small amount of interaction with the VFS freezer code that has to > happen > in order to shut down the filesystem to rebuild the extent backref > records. It might be interesting to discuss the (fairly slight) > requirements upon the VFS to support repairs, and/or have a BoF to > discuss how to build an online checker if any of the other > filesystems > are interested in this. > How do you imagine a generic way to support repairs for different file systems? From one point of view, to have generic way of the online file system repairing could be the really great subsystem. But, from another point of view, every file system has own architecture, own set of metadata and own way to do fsck check/recovering. As far as I can judge, there are significant amount of research efforts in this direction (Recon [1], [2], for example). But we still haven't any real general online file system repair subsystem in the Linux kernel. Do you have some new insight? What's difference of your vision? If we have online file system repair subsystem then how file system driver will need to interact with the goal to make internal repairing? Thanks, Vyacheslav Dubeyko. [1] http://www.eecg.toronto.edu/~ashvin/publications/recon-fs-consistency-runtime.pdf [2] https://www.researchgate.net/publication/269300836_Managing_the_file_system_from_the_kernel ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM TOPIC] online filesystem repair 2017-01-16 0:01 ` Viacheslav Dubeyko @ 2017-01-17 6:24 ` Darrick J. Wong 0 siblings, 0 replies; 8+ messages in thread From: Darrick J. Wong @ 2017-01-17 6:24 UTC (permalink / raw) To: Viacheslav Dubeyko; +Cc: lsf-pc, linux-fsdevel, linux-xfs, Vyacheslav.Dubeyko On Sun, Jan 15, 2017 at 04:01:30PM -0800, Viacheslav Dubeyko wrote: > On Fri, 2017-01-13 at 23:54 -0800, Darrick J. Wong wrote: > > Hi, > > > > I've been working on implementing online metadata scrubbing and > > repair > > in XFS.��Most of the code is self contained inside XFS, but there's a > > small amount of interaction with the VFS freezer code that has to > > happen > > in order to shut down the filesystem to rebuild the extent backref > > records.��It might be interesting to discuss the (fairly slight) > > requirements upon the VFS to support repairs, and/or have a BoF to > > discuss how to build an online checker if any of the other > > filesystems > > are interested in this. > > > > How do you imagine a generic way to support repairs for different file > systems? From one point of view, to have generic way of the online file > system repairing could be the really great subsystem. I don't, sadly. There's not even a way to /check/ all fs metadata in a "generic" manner -- we can use the standard VFS interfaces to read all metadata, but this is fraught. Even if we assume the fs can spot check obviously garbage values, that's still not the appropriate place for a full scan. > But, from another point of view, every file system has own > architecture, own set of metadata and own way to do fsck > check/recovering. Yes, and this wouldn't change. The particular mechanism of fixing a piece of metadata will always be fs-dependent, but the thing that I'm interested in discussing is how do we avoid having these kinds of things interact badly with the VFS? > As far as I can judge, there are significant amount of research > efforts in this direction (Recon [1], [2], for example). Yes, I remember Recon. I appreciated the insight that while it's impossible to block everything for a full scan, it /is/ possible to check a single object and its relation to other metadata items. The xfs scrubber also takes an incremental approach to verifying a filesystem; we'll lock each metadata object and verify that its relationships with the other metadata make sense. So long as we aren't bombarding the fs with heavy metadata update workloads, of course. On the repair side of things xfs added reverse-mapping records, which the repair code uses to regenerate damaged primary metadata. After we land inode parent pointers we'll be able to do the same reconstructions that we can now do for block allocations... ...but there are some sticky problems with repairing the reverse mappings. The normal locking order for that part of xfs is sb_writers -> inode -> ag header -> rmap btree blocks, but to repair we have to freeze the filesystem against writes so that we can scan all the inodes. > But we still haven't any real general online file system repair > subsystem in the Linux kernel. I think the ocfs2 developers have encoded some ability to repair metadata over the past year, though it seems limited to fixing some parts of inodes. btrfs stores duplicate copies and restores when necessary, I think. Unfortunately, fixing disk corruption is something that's not easily genericized, which means that I don't think we'll ever achieve a general subsystem. But we could at least figure out what in the VFS has to change (if anything) to support this type of usage. > Do you have some new insight? What's difference of your > vision? If we have online file system repair subsystem then how file > system driver will need to interact with the goal to make internal > repairing? It's pretty much all private xfs userspace ioctls[1] with a driver program[2]. --D [1] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel [2] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=djwong-devel > > Thanks, > Vyacheslav Dubeyko. > > [1]�http://www.eecg.toronto.edu/~ashvin/publications/recon-fs-consistency-runtime.pdf > [2]�https://www.researchgate.net/publication/269300836_Managing_the_file_system_from_the_kernel > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM TOPIC] online filesystem repair @ 2017-01-17 6:24 ` Darrick J. Wong 0 siblings, 0 replies; 8+ messages in thread From: Darrick J. Wong @ 2017-01-17 6:24 UTC (permalink / raw) To: Viacheslav Dubeyko; +Cc: lsf-pc, linux-fsdevel, linux-xfs, Vyacheslav.Dubeyko On Sun, Jan 15, 2017 at 04:01:30PM -0800, Viacheslav Dubeyko wrote: > On Fri, 2017-01-13 at 23:54 -0800, Darrick J. Wong wrote: > > Hi, > > > > I've been working on implementing online metadata scrubbing and > > repair > > in XFS. Most of the code is self contained inside XFS, but there's a > > small amount of interaction with the VFS freezer code that has to > > happen > > in order to shut down the filesystem to rebuild the extent backref > > records. It might be interesting to discuss the (fairly slight) > > requirements upon the VFS to support repairs, and/or have a BoF to > > discuss how to build an online checker if any of the other > > filesystems > > are interested in this. > > > > How do you imagine a generic way to support repairs for different file > systems? From one point of view, to have generic way of the online file > system repairing could be the really great subsystem. I don't, sadly. There's not even a way to /check/ all fs metadata in a "generic" manner -- we can use the standard VFS interfaces to read all metadata, but this is fraught. Even if we assume the fs can spot check obviously garbage values, that's still not the appropriate place for a full scan. > But, from another point of view, every file system has own > architecture, own set of metadata and own way to do fsck > check/recovering. Yes, and this wouldn't change. The particular mechanism of fixing a piece of metadata will always be fs-dependent, but the thing that I'm interested in discussing is how do we avoid having these kinds of things interact badly with the VFS? > As far as I can judge, there are significant amount of research > efforts in this direction (Recon [1], [2], for example). Yes, I remember Recon. I appreciated the insight that while it's impossible to block everything for a full scan, it /is/ possible to check a single object and its relation to other metadata items. The xfs scrubber also takes an incremental approach to verifying a filesystem; we'll lock each metadata object and verify that its relationships with the other metadata make sense. So long as we aren't bombarding the fs with heavy metadata update workloads, of course. On the repair side of things xfs added reverse-mapping records, which the repair code uses to regenerate damaged primary metadata. After we land inode parent pointers we'll be able to do the same reconstructions that we can now do for block allocations... ...but there are some sticky problems with repairing the reverse mappings. The normal locking order for that part of xfs is sb_writers -> inode -> ag header -> rmap btree blocks, but to repair we have to freeze the filesystem against writes so that we can scan all the inodes. > But we still haven't any real general online file system repair > subsystem in the Linux kernel. I think the ocfs2 developers have encoded some ability to repair metadata over the past year, though it seems limited to fixing some parts of inodes. btrfs stores duplicate copies and restores when necessary, I think. Unfortunately, fixing disk corruption is something that's not easily genericized, which means that I don't think we'll ever achieve a general subsystem. But we could at least figure out what in the VFS has to change (if anything) to support this type of usage. > Do you have some new insight? What's difference of your > vision? If we have online file system repair subsystem then how file > system driver will need to interact with the goal to make internal > repairing? It's pretty much all private xfs userspace ioctls[1] with a driver program[2]. --D [1] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel [2] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=djwong-devel > > Thanks, > Vyacheslav Dubeyko. > > [1] http://www.eecg.toronto.edu/~ashvin/publications/recon-fs-consistency-runtime.pdf > [2] https://www.researchgate.net/publication/269300836_Managing_the_file_system_from_the_kernel > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM TOPIC] online filesystem repair 2017-01-17 6:24 ` Darrick J. Wong (?) @ 2017-01-17 20:45 ` Andreas Dilger -1 siblings, 0 replies; 8+ messages in thread From: Andreas Dilger @ 2017-01-17 20:45 UTC (permalink / raw) To: Darrick J. Wong Cc: Viacheslav Dubeyko, lsf-pc, linux-fsdevel, linux-xfs, Vyacheslav.Dubeyko [-- Attachment #1: Type: text/plain, Size: 5373 bytes --] On Jan 16, 2017, at 11:24 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > > On Sun, Jan 15, 2017 at 04:01:30PM -0800, Viacheslav Dubeyko wrote: >> On Fri, 2017-01-13 at 23:54 -0800, Darrick J. Wong wrote: >>> Hi, >>> >>> I've been working on implementing online metadata scrubbing and >>> repair >>> in XFS. Most of the code is self contained inside XFS, but there's a >>> small amount of interaction with the VFS freezer code that has to >>> happen >>> in order to shut down the filesystem to rebuild the extent backref >>> records. It might be interesting to discuss the (fairly slight) >>> requirements upon the VFS to support repairs, and/or have a BoF to >>> discuss how to build an online checker if any of the other >>> filesystems >>> are interested in this. >>> >> >> How do you imagine a generic way to support repairs for different file >> systems? From one point of view, to have generic way of the online file >> system repairing could be the really great subsystem. > > I don't, sadly. There's not even a way to /check/ all fs metadata in a > "generic" manner -- we can use the standard VFS interfaces to read > all metadata, but this is fraught. Even if we assume the fs can spot > check obviously garbage values, that's still not the appropriate place > for a full scan. > >> But, from another point of view, every file system has own >> architecture, own set of metadata and own way to do fsck >> check/recovering. > > Yes, and this wouldn't change. The particular mechanism of fixing a > piece of metadata will always be fs-dependent, but the thing that I'm > interested in discussing is how do we avoid having these kinds of things > interact badly with the VFS? > >> As far as I can judge, there are significant amount of research >> efforts in this direction (Recon [1], [2], for example). > > Yes, I remember Recon. I appreciated the insight that while it's > impossible to block everything for a full scan, it /is/ possible to > check a single object and its relation to other metadata items. The xfs > scrubber also takes an incremental approach to verifying a filesystem; > we'll lock each metadata object and verify that its relationships with > the other metadata make sense. So long as we aren't bombarding the fs > with heavy metadata update workloads, of course. It is worthwhile to note that Lustre has a distributed online filesystem checker (LFSCK) that works in a similar incremental manner, checking the status of each object w.r.t. other objects it is related to. This can be done reasonably well because there is extra Lustre metadata that has backpointers from data objects to inodes and from inodes to the parent directory (including hard links). That said, we depend on the local filesystem to be internally consistent, and LFSCK is only verifying/repairing Lustre-specific metadata that describes cross-server object relationships. Cheers, Andreas > On the repair side of things xfs added reverse-mapping records, which > the repair code uses to regenerate damaged primary metadata. After we > land inode parent pointers we'll be able to do the same reconstructions > that we can now do for block allocations... > > ...but there are some sticky problems with repairing the reverse > mappings. The normal locking order for that part of xfs is sb_writers > -> inode -> ag header -> rmap btree blocks, but to repair we have to > freeze the filesystem against writes so that we can scan all the inodes. > >> But we still haven't any real general online file system repair >> subsystem in the Linux kernel. > > I think the ocfs2 developers have encoded some ability to repair > metadata over the past year, though it seems limited to fixing some > parts of inodes. btrfs stores duplicate copies and restores when > necessary, I think. Unfortunately, fixing disk corruption is something > that's not easily genericized, which means that I don't think we'll ever > achieve a general subsystem. > > But we could at least figure out what in the VFS has to change (if > anything) to support this type of usage. > >> Do you have some new insight? What's difference of your >> vision? If we have online file system repair subsystem then how file >> system driver will need to interact with the goal to make internal >> repairing? > > It's pretty much all private xfs userspace ioctls[1] with a driver > program[2]. > > --D > > [1] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel > [2] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=djwong-devel > >> >> Thanks, >> Vyacheslav Dubeyko. >> >> [1] http://www.eecg.toronto.edu/~ashvin/publications/recon-fs-consistency-runtime.pdf >> [2] https://www.researchgate.net/publication/269300836_Managing_the_file_system_from_the_kernel >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [LSF/MM TOPIC] online filesystem repair 2017-01-17 6:24 ` Darrick J. Wong (?) (?) @ 2017-01-18 0:37 ` Slava Dubeyko 2017-01-25 8:41 ` Darrick J. Wong -1 siblings, 1 reply; 8+ messages in thread From: Slava Dubeyko @ 2017-01-18 0:37 UTC (permalink / raw) To: Darrick J. Wong, Viacheslav Dubeyko; +Cc: lsf-pc, linux-fsdevel, linux-xfs -----Original Message----- From: Darrick J. Wong [mailto:darrick.wong@oracle.com] Sent: Monday, January 16, 2017 10:25 PM To: Viacheslav Dubeyko <slava@dubeyko.com> Cc: lsf-pc@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; linux-xfs@vger.kernel.org; Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com> Subject: Re: [LSF/MM TOPIC] online filesystem repair > > How do you imagine a generic way to support repairs for different file > > systems? From one point of view, to have generic way of the online > > file system repairing could be the really great subsystem. > > I don't, sadly. There's not even a way to /check/ all fs metadata > in a "generic" manner -- we can use the standard VFS interfaces to read > all metadata, but this is fraught. Even if we assume the fs can spot check obviously > garbage values, that's still not the appropriate place for a full scan. Let's try to imagine a possible way of generalization. I can see such critical points: (1) mount operation; (2) unmount/fsync operation; (3) readpage; (4) writepage; (5) read metadata block/node; (6) write/flush metadata block/node. (7) metadata's item modification/access. Let's imagine that file system will register every metadata structure in generic online file checking subsystem. Then the file system will need to register some set of checking methods or checking events for every registered metadata structure. For example: (1) check_access_metadata(); (2) check_metadata_modification(); (3) check_metadata_node(); (4) check_metadata_node_flush(); (5) check_metadata_nodes_relation(). I think that it is possible to consider several possible level of generic online file system checking subsystem's activity: (1) light check mode; (2) regular check mode; (3) strict check mode. The "light check mode" can be resulted in "fast" metadata nodes' check on write operation with generation of error messages in the syslog with the request to check/recover file system volume by means of fsck tool. The "regular check mode" can be resulted in: (1) the checking of any metadata modification with trying to correct the operation in the modification place; (2) metadata nodes' check on write operation with generation of error messages in the syslog. The "strict check mode" can be resulted in: (1) check mount operation with trying to recover the affected metadata structures; (2) the checking of any metadata modification with trying to correct the operation in the modification place; (3) check and recover metadata nodes on flush operation; (4) check/recover during unmount operation. What do you like to expose to VFS level as generalized methods for your implementation? > > But, from another point of view, every file system has own > > architecture, own set of metadata and own way to do fsck > > check/recovering. > > Yes, and this wouldn't change. The particular mechanism of fixing a piece of > metadata will always be fs-dependent, but the thing that I'm interested in > discussing is how do we avoid having these kinds of things interact badly with the VFS? Let's start from the simplest case. You have the current implementation. How do you see the way to delegate to VFS some activity in your implementation in the form of generalized methods? Let's imagine that VFS will have some callbacks from file system side. What could it be? > > As far as I can judge, there are significant amount of research > > efforts in this direction (Recon [1], [2], for example). > > Yes, I remember Recon. I appreciated the insight that while it's impossible > to block everything for a full scan, it /is/ possible to check a single object and > its relation to other metadata items. The xfs scrubber also takes an incremental > approach to verifying a filesystem; we'll lock each metadata object and verify that > its relationships with the other metadata make sense. So long as we aren't bombarding > the fs with heavy metadata update workloads, of course. > > On the repair side of things xfs added reverse-mapping records, which the repair code > uses to regenerate damaged primary metadata. After we land inode parent pointers > we'll be able to do the same reconstructions that we can now do for block allocations... > > ...but there are some sticky problems with repairing the reverse mappings. > The normal locking order for that part of xfs is sb_writers > -> inode -> ag header -> rmap btree blocks, but to repair we have to > freeze the filesystem against writes so that we can scan all the inodes. Yes, the necessary freezing of file system is really tricky point. From one point of view, it is possible to use "light checking mode" that will simply check and complain about possible troubles at proper time (maybe with remount in RO mode). Otherwise, from another point of view, we need in special file system architecture or/and special way of VFS functioning. Let's imagine that file system volume will be split on some groups/aggregations/objects with dedicated metadata. Then, theoretically, VFS is able to freeze such group/aggregation/object for check and recovering without affection the availability of the whole file system volume. It means that file system operations should be redirected into active (not frozen) groups/aggregations/objects. Thanks, Vyacheslav Dubeyko. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM TOPIC] online filesystem repair 2017-01-18 0:37 ` Slava Dubeyko @ 2017-01-25 8:41 ` Darrick J. Wong 2017-01-27 22:06 ` Slava Dubeyko 0 siblings, 1 reply; 8+ messages in thread From: Darrick J. Wong @ 2017-01-25 8:41 UTC (permalink / raw) To: Slava Dubeyko; +Cc: Viacheslav Dubeyko, lsf-pc, linux-fsdevel, linux-xfs On Wed, Jan 18, 2017 at 12:37:02AM +0000, Slava Dubeyko wrote: > > -----Original Message----- > From: Darrick J. Wong [mailto:darrick.wong@oracle.com] > Sent: Monday, January 16, 2017 10:25 PM > To: Viacheslav Dubeyko <slava@dubeyko.com> > Cc: lsf-pc@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; linux-xfs@vger.kernel.org; Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com> > Subject: Re: [LSF/MM TOPIC] online filesystem repair > > > > How do you imagine a generic way to support repairs for different file > > > systems? From one point of view, to have generic way of the online > > > file system repairing could be the really great subsystem. > > > > I don't, sadly. There's not even a way to /check/ all fs metadata > > in a "generic" manner -- we can use the standard VFS interfaces to read > > all metadata, but this is fraught. Even if we assume the fs can spot check obviously > > garbage values, that's still not the appropriate place for a full scan. > > Let's try to imagine a possible way of generalization. I can see such > critical points: > (1) mount operation; > (2) unmount/fsync operation; > (3) readpage; > (4) writepage; > (5) read metadata block/node; > (6) write/flush metadata block/node. > (7) metadata's item modification/access. > > Let's imagine that file system will register every metadata structure > in generic online file checking subsystem. Then the file system will That sounds pretty harsh. XFS (and ext4) hide quite a /lot/ of metadata. We don't expose the superblocks, the free space header, the inode header, the free space btrees, the inode btrees, the reverse mapping btrees, the refcount btrees, the journal, or the rtdev space data. I don't think we ought to expose any of that except to xfsprogs. For another thing, there are dependencies between those pieces of metadata, (e.g. the AGI has to work before we can check the inobt) and one has to take those into account when scrubbing. ext4 has a different set of internal metadata, but the same applies there too. > need to register some set of checking methods or checking events for > every registered metadata structure. For example: > > (1) check_access_metadata(); > (2) check_metadata_modification(); > (3) check_metadata_node(); > (4) check_metadata_node_flush(); > (5) check_metadata_nodes_relation(). How does the VFS know to invoke these methods on a piece of internal metadata that the FS owns and updates at its pleasure? The only place we encode all the relationships between pieces of metadata is in the fs driver itself, and that's where scrubbing needs to take place. The VFS only serves to multiplex the subset of operations that are common across all filesystems; everything else take the form of (semi-private) ioctls. > I think that it is possible to consider several possible level of > generic online file system checking subsystem's activity: (1) light > check mode; (2) regular check mode; (3) strict check mode. > > The "light check mode" can be resulted in "fast" metadata nodes' > check on write operation with generation of error messages in the > syslog with the request to check/recover file system volume by means > of fsck tool. > > The "regular check mode" can be resulted in: (1) the checking of any > metadata modification with trying to correct the operation in the > modification place; (2) metadata nodes' check on write operation with > generation of error messages in the syslog. > > The "strict check mode" can be resulted in: (1) check mount operation > with trying to recover the affected metadata structures; (2) the > checking of any metadata modification with trying to correct the > operation in the modification place; (3) check and recover metadata > nodes on flush operation; (4) check/recover during unmount operation. I'm a little unclear about where you're going with all three of these things; the XFS metadata verifiers already do limited spot-checking of all metadata reads and writes without the VFS being directly involved. The ioctl performs more intense checking and cross-checking of metadata that would be too expensive to do on every access. > What do you like to expose to VFS level as generalized methods for > your implementation? Nothing. A theoretical ext4 interface could look similar to XFS's, but the metadata-type codes would be different. btrfs seems so much different structurally there's little point in trying. I also looked at ocfs2's online filecheck. It's pretty clear they had different goals and ended up with a much different interface. > > > But, from another point of view, every file system has own > > > architecture, own set of metadata and own way to do fsck > > > check/recovering. > > > > Yes, and this wouldn't change. The particular mechanism of fixing a piece of > > metadata will always be fs-dependent, but the thing that I'm interested in > > discussing is how do we avoid having these kinds of things interact badly with the VFS? > > Let's start from the simplest case. You have the current > implementation. How do you see the way to delegate to VFS some > activity in your implementation in the form of generalized methods? > Let's imagine that VFS will have some callbacks from file system side. > What could it be? > > > > As far as I can judge, there are significant amount of research > > > efforts in this direction (Recon [1], [2], for example). > > > > Yes, I remember Recon. I appreciated the insight that while it's impossible > > to block everything for a full scan, it /is/ possible to check a single object and > > its relation to other metadata items. The xfs scrubber also takes an incremental > > approach to verifying a filesystem; we'll lock each metadata object and verify that > > its relationships with the other metadata make sense. So long as we aren't bombarding > > the fs with heavy metadata update workloads, of course. > > > > On the repair side of things xfs added reverse-mapping records, which the repair code > > uses to regenerate damaged primary metadata. After we land inode parent pointers > > we'll be able to do the same reconstructions that we can now do for block allocations... > > > > ...but there are some sticky problems with repairing the reverse mappings. > > The normal locking order for that part of xfs is sb_writers > > -> inode -> ag header -> rmap btree blocks, but to repair we have to > > freeze the filesystem against writes so that we can scan all the inodes. > > Yes, the necessary freezing of file system is really tricky point. > From one point of view, it is possible to use "light checking mode" > that will simply check and complain about possible troubles at proper > time (maybe with remount in RO mode). Yes, scrub does this fairly lightweight checking -- no freezing, no remounting, etc. If checking something would mean violating locking rules (which would require a quiesced fs) then we simply hope that the scan process eventually checks it via the normative locking paths. For example, the rmapbt scrubber doesn't cross reference inode extent records with the inode block maps because we lock inode -> agf -> rmapbt; it relies on the scrub program eventually locking the inode to check the block map and then cross-referencing with the rmap data. For repair (of only the rmap data) we have to be able to access arbitrary files, so that requires a sync and then shutting down the filesystem while we do it, so that nothing else can lock inodes. I don't know if this is 100% bulletproof; the fsync might take the fs down before we even get to repairing. Really what this comes down to is a discussion of how to suspend user IOs temporarily and how to reinitialize the mm/vfs view of a part of the world if the filesystem wants to do that. > Otherwise, from another point of view, we need in special file system > architecture or/and special way of VFS functioning. Let's imagine that > file system volume will be split on some groups/aggregations/objects > with dedicated metadata. Then, theoretically, VFS is able to freeze > such group/aggregation/object for check and recovering without > affection the availability of the whole file system volume. It means > that file system operations should be redirected into active (not > frozen) groups/aggregations/objects. One could in theory teach XFS how to shut down AGs, which would redirect block/inode allocations elsewhere. Freeing would be a mess though. My goal is to make scrub & repair fast enough that we don't need that. --D > > Thanks, > Vyacheslav Dubeyko. > > Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: > > This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. > ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [LSF/MM TOPIC] online filesystem repair 2017-01-25 8:41 ` Darrick J. Wong @ 2017-01-27 22:06 ` Slava Dubeyko 0 siblings, 0 replies; 8+ messages in thread From: Slava Dubeyko @ 2017-01-27 22:06 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Viacheslav Dubeyko, lsf-pc, linux-fsdevel, linux-xfs -----Original Message----- From: Darrick J. Wong [mailto:darrick.wong@oracle.com] Sent: Wednesday, January 25, 2017 12:42 AM To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com> Cc: Viacheslav Dubeyko <slava@dubeyko.com>; lsf-pc@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; linux-xfs@vger.kernel.org Subject: Re: [LSF/MM TOPIC] online filesystem repair > > Let's imagine that file system will register every metadata structure > > in generic online file checking subsystem. Then the file system will > > That sounds pretty harsh. XFS (and ext4) hide quite a /lot/ of metadata. > We don't expose the superblocks, the free space header, the inode header, > the free space btrees, the inode btrees, the reverse mapping btrees, > the refcount btrees, the journal, or the rtdev space data. I don't think > we ought to expose any of that except to xfsprogs. > For another thing, there are dependencies between those pieces of metadata, > (e.g. the AGI has to work before we can check the inobt) and one has to take > those into account when scrubbing. > > ext4 has a different set of internal metadata, but the same applies there too. I didn't suggest to expose the metadata in pure sense of this word. The key point of this discussion is the elaboration of a vision how the generic online file system checking/recovery can be done. It means that VFS has to represent a file system like some generic set of items (for example, like sequence of iterators). The VFS is the layer of generalized management of any file system. And VFS interacts with concrete file systems by means of specialized callbacks (file_operations, inode_operations and so on) that provides opportunity to implement some special way of file system volume management. So, as far as I can see, the online file system check/recovery subsystem has to looks like in the same way. It needs to work in generalized manner but specialized callbacks will realize specialized elementary operations. And concrete file system driver should provide the specialization of such methods. The really important point is the possible mode(s) of the online file system check/recovery subsystem. I see the two principal cases: (1) post-corruption check/recovery; (2) preventive check. We could consider mount and/or unmount operations like the main point(s) of the online file system check/recovery subsystem activity for the post-corruption case. In this case struct super_operations could contain check_method() and recovery_method() that will realize all specialized logic of checking/recovery a file system volume. All file system's peculiarities in metadata implementation and checking/recovering algorithm will be hidden in these specialized method. So, it is possible to see such possible cases when online file system check/recovery subsystem could be possibly used for the post-corruption case: (1) mount operation -> we discover the file system corruption here, usually; (2) remount in RO mode -> if we had some internal error in file system driver; (3) special set of file system errors that initiate check/recovery subsystem activity; (4) unmount operation -> check file system volume consistency at the end of unmount operation. Also it is possible to consider the opportunity to check a file system volume's state or the state of some metadata structure in mounted state of file system volume. But, as far as I can see, we need to introduce new syscalls or special ioctl commands for such case. And I am not sure that it will be easy to implement support of such requests. Another possible mode could be a preventive mode of checking file system volume's state before flush operation. In this case, VFS should consider a file system volume as some abstract sequence of metadata structures. It means that VFS needs to use some specialized methods (registered by file system driver) in generic way. Let's imagine that VFS will have some generic method of preventive checking of flush operations. I mean here that, anyway, every metadata structure is split between nodes, logical blocks and so on. Usually, such node could contain some header and file system driver is able to check consistency of such node before flush operation. Of course, such check operation can degrade the performance of flush operation. But it could be the decision of a user to use or not to use the preventive mode. Also we cannot check the relations between different nodes. The complete check can be done during post-corruption check/recovery mode. > > need to register some set of checking methods or checking events for > > every registered metadata structure. For example: > > > > (1) check_access_metadata(); > > (2) check_metadata_modification(); > > (3) check_metadata_node(); > > (4) check_metadata_node_flush(); > > (5) check_metadata_nodes_relation(). > > How does the VFS know to invoke these methods on a piece of internal metadata that > the FS owns and updates at its pleasure? The only place we encode all the relationships > between pieces of metadata is in the fs driver itself, and that's where scrubbing needs > to take place. The VFS only serves to multiplex the subset of operations that are common > across all filesystems; everything else take the form of (semi-private) ioctls. First of all, it's clear that we discover a file system volume's corrupted state during mount operation. So, VFS can easily invoke the method of check/recovery a file system volume in generic manner during mount operation. Also file system volume is unable for any file system operations till the finishing of mount operation. So, file system driver can do anything with metadata of file system volume with complete pleasure. Secondly, the unmount operation can be used easily in generic manner with the same purpose. Secondly, VFS represents a file system's hierarchy by means of inodes, dentries, page cache and so on. It is abstraction that VFS is using in memory on OS side. But file system volume could not contain such items at all. For example, HFS+ hasn't inodes or dentries in the file system volume. It uses btrees that contains file records, folder records. And it needs to convert HFS+ representation of metadata into VFS internal representation during any operations with retrieving or storing metadata in the file system volume. When I am talking about some check/recovery methods (check_metadata_node(), for example) I mean that could have some abstract representation of any metadata in the file system volume. It could be the simple sequence of metadata nodes, for example. And if it was requested the sync operation then it means that the whole metadata structure should be consistent for the flush operation. So, if VFS is trying to execute the sync_fs() operation then it is possible to pass through the all abstract sequences of metadata nodes and to apply the check_metadata_node() callbacks that will be executed by concrete file system driver. So, the real check/recovery operation will be done by the fs driver itself but in general manner. > > I think that it is possible to consider several possible level of > > generic online file system checking subsystem's activity: (1) light > > check mode; (2) regular check mode; (3) strict check mode. > > > > The "light check mode" can be resulted in "fast" metadata nodes' > > check on write operation with generation of error messages in the > > syslog with the request to check/recover file system volume by means > > of fsck tool. > > > > The "regular check mode" can be resulted in: (1) the checking of any > > metadata modification with trying to correct the operation in the > > modification place; (2) metadata nodes' check on write operation with > > generation of error messages in the syslog. > > > > The "strict check mode" can be resulted in: (1) check mount operation > > with trying to recover the affected metadata structures; (2) the > > checking of any metadata modification with trying to correct the > > operation in the modification place; (3) check and recover metadata > > nodes on flush operation; (4) check/recover during unmount operation. > > I'm a little unclear about where you're going with all three of these things; > the XFS metadata verifiers already do limited spot-checking of all metadata > reads and writes without the VFS being directly involved. > The ioctl performs more intense checking and cross-checking of metadata > that would be too expensive to do on every access. We are trying to talk not about XFS only. If we talk about a generic online check/recovery subsystem then it has to be good for all other file systems too. Again, if you believe that all check/recovering activity should be hidden from VFS then it's not clear for me why did you raise this topic? XFS and other file systems do some metadata verification in the background of VFS activity. Excellent... It sounds for me that we simply need to generalize this activity on VFS level. As minimum, we could consider mount/unmount operation for the case of online check/recovery subsystem. Also VFS is able to be involved into some preventive metadata checking on generalized basis. When I am talking about different checking modes I mean that a user should have opportunity to select the different possible modes of online check/recovery subsystem with different overheads. It is not necessary to be set of modes that I mentioned. But different users have different priorities. Some users need in performance, another ones need in reliability. Now we cannot manage what file system does with metadata checking in the background. But a user will be able to opt a proper way of online check/recovery subsystem activity if the VFS supports generalized way of metadata checking with different modes. > > What do you like to expose to VFS level as generalized methods for > > your implementation? > > Nothing. A theoretical ext4 interface could look similar to XFS's, > but the metadata-type codes would be different. btrfs seems so much > different structurally there's little point in trying. So, why did you raise this topic? If nothing then no topic. :) > I also looked at ocfs2's online filecheck. It's pretty clear they had > different goals and ended up with a much different interface. If we would like to talk about generic VFS-based online check/recovery subsystem then we need to find some common points. I think it's possible. Do you mean that you don't see the way of generalization? What's the point of this discussion in such case? Thanks, Vyacheslav Dubeyko. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2017-01-27 22:06 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-01-14 7:54 [LSF/MM TOPIC] online filesystem repair Darrick J. Wong 2017-01-16 0:01 ` Viacheslav Dubeyko 2017-01-17 6:24 ` Darrick J. Wong 2017-01-17 6:24 ` Darrick J. Wong 2017-01-17 20:45 ` Andreas Dilger 2017-01-18 0:37 ` Slava Dubeyko 2017-01-25 8:41 ` Darrick J. Wong 2017-01-27 22:06 ` Slava Dubeyko
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.