archive mirror
 help / color / mirror / Atom feed
From: "Theodore Ts'o" <>
To: Gabriel Krisman Bertazi <>
Subject: Re: [RFC] Filesystem error notifications proposal
Date: Tue, 2 Feb 2021 17:34:34 -0500	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On Tue, Feb 02, 2021 at 03:26:35PM -0500, Gabriel Krisman Bertazi wrote:
> Thanks for the explanation.  That makes sense to me.  For corruptions
> where it is impossible to map to a mountpoint, I thought they could be
> considered global filesystem errors, being exposed only to someone
> watching the entire filesystem (like FAN_MARK_FILESYSTEM).

At least for ext4, there are only 3 ext4_error_*() that we could map
to a subtree without having to make changes to the call points:

% grep -i ext4_error_file\( fs/ext4/*.c  | wc -l
% grep -i ext4_error_inode\( fs/ext4/*.c  | wc -l
% grep -i ext4_error\( fs/ext4/*.c  | wc -l

So in practice, unless we want to make a lot of changes to ext4, most
of them will be global file system errors....

> But, as you mentioned regarding the google use case, the entire idea of
> watching a subtree is a bit beyond the scope of my use-case, and was
> only added given the feedback on the previous proposal of this feature.
> While nice to have, I don't have the need to watch different mountpoints
> for errors, only the entire filesystem.

I suspect that for most use cases, the most interesting thing is the
first error.  We already record this in the ext4 superblock, because
unfortunately, I can't guarantee that system administrators have
correctly configured their system logs, so when handling upstream bug
reports, I can just ask them to run dumpe2fs -h on the file system:

FS Error count:           2
First error time:         Tue Feb  2 16:27:42 2021
First error function:     ext4_lookup
First error line #:       1704
First error inode #:      12
First error err:          EFSCORRUPTED
Last error time:          Tue Feb  2 16:27:59 2021
Last error function:      ext4_lookup
Last error line #:        1704
Last error inode #:       12
Last error err:           EFSCORRUPTED

So it's not just the Google case.  I'd argue for most system
administrator, one of the most useful things is when the file system
was first found to be corrupted, so they can try correlating file
system corruptions, with, say, reports of I/O errors, or OOM kils,
etc.  This can also be useful for correlating the start of file system
problems with problems at the application layer --- say, MongoDB,
MySQL, etc.

The reason why a notification system useful is because if you are
using database some kind of high-availability replication system, and
if there are problems detected in the file system of the primary MySQL
server, you'd want to have the system fail over to the secondary MySQL
server.  Sure, you *could* do this by polling the superblock, but
that's not the most efficient way to do things.

> There was a concern raised against my original submission which did
> superblock watching, due to the fact that a superblock is an internal
> kernel structure that must not be visible to the userspace.  It was
> suggested we should be speaking in terms of paths and mountpoint.  But,
> considering the existence of FAN_MARK_FILESYSTEM, I think it is the exact
> same object I was looking for when proposing the watch_sb syscall.

Sure, using the owner of the root directory (the mountpoint), might be
one way of doing things.  I think CAP_SYS_ADMIN is also fine, since
for many of the use cases, such as shutting down the file system so
the database server can fail over to the secondary service, you're
going to need root anyway.  (Especially for ext4, the vast majority of
the errors are going to be FAN_MARK_FILESYSTEM anyway.)

> The main use case is, as you said, corruption detection with enough
> information to allow us to trigger automatic recovery and data rebuilding
> tools.  I understand now I can drop most of the debug info, as you
> mentioned.  In this sense, the feature looks more like netoops.

Things like the bad block number, and the more structured information
is a nice to have.  It might also be that using the structured
information might be a more efficient way to get the information to
userspace.  So one could imagine using, say, an 32 character error
text code, sasy, "ext4_bad_dir_block_csum", followed by a 64-bit inode
number, a 64-bit containing directory inode number, a 64-bit block
number, and 8-bits of filename length, followed the first 7 characters
of the filename, followed by the last 8 characters of the filename.
That's 72 bytes, which is quite compact.  Adding something like 16
bytes of function names and 2 bytes of line number would net 90 bytes.

> But there are other uses that interests us, like pattern analysis of
> error locations, such that we can detect and perhaps predict errors.
> One idea that was mentioned is if an error occurs frequently enough
> in a specific function, there might be a bug in that function.  This is
> one of the reasons we are pushing to include function:line in the error
> message.

I agree.  But if this is causing too much resistance, we can just
encode some the error information as a file system specific text or
byte string.  Even we are restricted to, say, 96 or 80 bytes, there's
a lot we can do, even if we can't do notification continuations.


						- Ted

  reply	other threads:[~2021-02-02 22:36 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-20 20:13 [RFC] Filesystem error notifications proposal Gabriel Krisman Bertazi
2021-01-21  4:01 ` Viacheslav Dubeyko
2021-01-21 11:44 ` Jan Kara
2021-01-21 13:27   ` Amir Goldstein
2021-01-21 18:56   ` Gabriel Krisman Bertazi
2021-01-21 22:44 ` Theodore Ts'o
2021-01-22  0:44   ` Gabriel Krisman Bertazi
2021-01-22  7:36     ` Amir Goldstein
2021-02-02 20:51       ` Gabriel Krisman Bertazi
2021-01-28 22:28     ` Theodore Ts'o
2021-02-02 20:26       ` Gabriel Krisman Bertazi
2021-02-02 22:34         ` Theodore Ts'o [this message]
2021-02-08 18:49           ` Gabriel Krisman Bertazi
2021-02-08 22:19             ` Dave Chinner
2021-02-09  1:08               ` Theodore Ts'o
2021-02-09  5:12                 ` Khazhismel Kumykov
2021-02-09  8:55                 ` Dave Chinner
2021-02-09 17:57                   ` Theodore Ts'o
2021-02-10  0:52                     ` Darrick J. Wong
2021-02-10  2:21                       ` Theodore Ts'o
2021-02-10  2:32                         ` Darrick J. Wong
2021-02-09 17:35               ` Jan Kara
2021-02-10  0:22                 ` Darrick J. Wong
2021-02-10  7:46                 ` Dave Chinner
2021-02-10  0:49               ` Darrick J. Wong
2021-02-10  0:09 ` Darrick J. Wong
2021-02-10  7:23   ` Amir Goldstein
2021-02-10 23:29   ` Gabriel Krisman Bertazi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).