linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Filesystem error notifications proposal
@ 2021-01-20 20:13 Gabriel Krisman Bertazi
  2021-01-21  4:01 ` Viacheslav Dubeyko
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Gabriel Krisman Bertazi @ 2021-01-20 20:13 UTC (permalink / raw)
  To: tytso, jack, viro, amir73il, dhowells, david, darrick.wong, khazhy
  Cc: linux-fsdevel, kernel


My apologies for the long email.

Please let me know your thoughts.

1 Summary
=========

  I'm looking for a filesystem-agnostic mechanism to report filesystem
  errors to a monitoring tool in userspace.  I experimented first with
  the watch_queue API but decided to move to fanotify for a few reasons.


2 Background
============

  I submitted a first set of patches, based on David Howells' original
  superblock notifications patchset, that I expanded into error
  reporting and had an example implementation for ext4.  Upstream review
  has revealed a few design problems:

  - Including the "function:line" tuple in the notification allows the
    uniquely identification of the error origin but it also ties the
    decodification of the error to the source code, i.e. <function:line>
    is expected to change between releases.

  - Useful debug data (inode number, block group) have formats specific
    to the filesystems, and my design wouldn't be expansible to
    filesystems other than ext4.

  - The implementation allowed an error string description (equivalent
    to what would be thrown in dmesg) that is too short, as it needs to
    fit in a single notification.

  - How the user sees the filesystem.  The original patch points to a
    mountpoint but uses the term superblock.  This is to differentiate
    from another mechanism in development to watch mounting operations.

  - Visibility of objects.  A bind mount of a subtree shouldn't receive
    notifications of objects outside of that bind mount.


2.1 Other constraints of the watch_queue API
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  watch_queue is a fairly new framework, which has one upstream user in
  the keyring subsystem.  watch_queue is designed to submit very short
  (max of 128 bytes) atomic notifications to userspace in a fast manner
  through a ring buffer.  There is no current mechanism to link
  notifications that require more than one slot and such mechanism
  wouldn't be trivial to implement, since buffer overruns could
  overwrite the beginning/end of a multi part notification.  In
  addition, watch_queue requires an out-of-band overflow notification
  mechanism, which would need to be implemented aside from the system
  call, in a separate API.


2.2 fanotify vs watch_queue
~~~~~~~~~~~~~~~~~~~~~~~~~~~

  watch_queue is designed for efficiency and fast submission of a large
  number of notifications.  It doesn't require memory allocation in the
  submission path, but with a flexibility cost, such as limited
  notification size.  fanotify is more flexible, allows for larger
  notifications and better filtering, but requires allocations on the
  submission path.

  On the other hand, fanotify already has support for the visibility
  semantics we are looking for. fsnotify allows an inode to notify only
  its subtree, mountpoints, or the entire filesystem, depending on the
  watcher flags, while an equivalent support would need to be written
  from scratch for watch_queue.  fanotify also has in-band overflow
  handling, already implemented.  Finally, since fanotify supports much
  larger notifications, there is no need to link separate notifications,
  preventing buffer overruns from erasing parts of a notification chain.

  fanotify is based on fsnotify, and the infrastructure for the
  visibility semantics is mostly implemented by fsnotify itself.  It
  would be possible to make error notifications a new mechanism on top
  of fsnotify, without modifying fanotify, but that would require
  exposing a very similar interface to userspace, new system calls, and
  that doesn't seem justifiable since we can extend fanotify syscalls
  without ABI breakage.


3 Proposal
==========

  The error notification framework is based on fanotify instead of
  watch_queue.  It is exposed through a new set of marks FAN_ERROR_*,
  exposed through the fanotify_mark(2) API.

  fanotify (fsnotify-based) has the infrastructure in-place to link
  notifications happening at filesystem objects to mountpoints and to
  filesystems, and already exposes an interface with well defined
  semantics of how those are exposed to watchers in different
  mountpoints or different subtrees.

  A new message format is exposed, if the user passed
  FAN_REPORT_DETAILED_ERROR fanotify_init(2) flag.  FAN_ERROR messages
  don't have FID/DFID records.

  A FAN_REPORT_DETAILED_ERROR record has the same struct
  fanotify_event_metadata header, but it is followed by one or more
  additional information record as follows:

  struct fanotify_event_error_hdr {
  	struct fanotify_event_info_header hdr;
  	__u32 error;
        __u64 inode;
        __u64 offset;
  }

  error is a VFS generic error number that can notify generic conditions
  like EFSCORRUPT. If hdr.len is larger than sizeof(struct
  fanotify_event_error_hdr), this structure is followed by an optional
  filesystem specific record that further specifies the error,
  originating object and debug data. This record has a generic header:

  struct fanotify_event_fs_error_hdr {
  	struct fanotify_event_error_hdr hdr;
        __kernel_fsid_t fsid;
        __u32 fs_error;
  }

  fs_error is a filesystem specific error record, potentially more
  detailed than fanotify_event_error.hdr.error . Each type of filesystem
  error has its own record type, that is used to report different
  information useful for each type of error.  For instance, an ext4
  lookup error, caused by an invalid inode type read from disk, produces
  the following record:

  struct fanotify_event_ext4_inode_error {
  	struct fanotify_event_fs_error_hdr hdr;
        __u64 block;
        __u32 inode_type;
  }

  The separation between VFS and filesystem-specific error messages has
  the benefit of always providing some information that an error has
  occurred, regardless of filesystem-specific support, while allowing
  capable filesystems to expose specific internal data to further
  document the issue.  This scheme leaves to the filesystem to expose
  more meaningful information as needed.  For instance, multi-disk
  filesystems can single out the problematic disk, network filesystems
  can expose a network error while accessing the server.


3.1 Visibility semantics
~~~~~~~~~~~~~~~~~~~~~~~~

  Error reporting follows the same semantics of fanotify events.
  Therefore, a watcher can request to watch the entire filesystem, a
  single mountpoint or a subtree.


3.2 security implications
~~~~~~~~~~~~~~~~~~~~~~~~~

  fanotify requires CAP_SYS_ADMIN.  My understanding is this requirement
  suffices for most use cases but, according to fanotify documentation,
  it could be relaxed without issues for the existing fanotify API.  For
  instance, watching a subtree could be a allowed for a user who owns
  that subtree.


3.3 error location
~~~~~~~~~~~~~~~~~~

  While exposing the exact line of code that triggered the notification
  ties that notification to a specific kernel version, it is an
  important information for those who completely control their
  environment and kernel builds, such as cloud providers.  Therefore,
  this proposal includes a mechanism to optionally include in the
  notification the line of code where the error occurred

  A watcher who passes the flag FAN_REPORT_ERROR_LOCATION to
  fanotify_init(2) receives an extra record for FAN_ERROR events:

  struct fanotify_event_fs_error_location {
  	struct fanotify_event_info_header hdr;
        u32 line;
        char function[];
  }

  This record identifies the place where the error occured.  function is
  a VLA whose size extend to the end of the region delimited by hdr.len.
  This VLA text-encodes the function name where the error occurred.

What do you think about his interface?  Would it be acceptable as part
of fanotify, or should it be a new fsnotify mode?

Regarding semantics, I believe fanotify should solve the visibility
problem for a subtree watcher not being able to see other branches
notifications.  Do you think this would suffice?

Thanks,

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-01-20 20:13 [RFC] Filesystem error notifications proposal Gabriel Krisman Bertazi
@ 2021-01-21  4:01 ` Viacheslav Dubeyko
  2021-01-21 11:44 ` Jan Kara
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 28+ messages in thread
From: Viacheslav Dubeyko @ 2021-01-21  4:01 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Theodore Y. Ts'o, jack, viro, amir73il, dhowells, david,
	darrick.wong, khazhy, Linux FS devel list, kernel



> On Jan 20, 2021, at 12:13 PM, Gabriel Krisman Bertazi <krisman@collabora.com> wrote:
> 
> 
> My apologies for the long email.
> 
> Please let me know your thoughts.
> 
> 1 Summary
> =========
> 
>  I'm looking for a filesystem-agnostic mechanism to report filesystem
>  errors to a monitoring tool in userspace.  I experimented first with
>  the watch_queue API but decided to move to fanotify for a few reasons.
> 

I don’t quite follow what the point in such user-space tool because anybody can take a look into the syslog. Even it is possible to grep error messages related to particular file system. What’s the point of such tool? I can see some point to report about file system corruption but you are talking about any error message. Moreover, it could be not trivial to track corruption even on file system driver’s side.

> 
> 2 Background
> ============
> 
>  I submitted a first set of patches, based on David Howells' original
>  superblock notifications patchset, that I expanded into error
>  reporting and had an example implementation for ext4.  Upstream review
>  has revealed a few design problems:
> 
>  - Including the "function:line" tuple in the notification allows the
>    uniquely identification of the error origin but it also ties the
>    decodification of the error to the source code, i.e. <function:line>
>    is expected to change between releases.
> 

Error message itself could identify the location without the necessity to know <function:line>. Only specially generated UID can be good solution, I suppose.

>  - Useful debug data (inode number, block group) have formats specific
>    to the filesystems, and my design wouldn't be expansible to
>    filesystems other than ext4.
> 

Different file system volumes associate inode ids with completely different objects. So, if you haven’t access to the same volume then this information could be useless. What’s the point to share these details? Sometimes, to track the issue in file system driver requires to know much more details related to particular volume.

Finally, what’s the concrete use-cases that this user-space tool can be used? Currently, I don’t see any point to use this tool. The syslog is much more productive way to debug and to investigate the issues in file system driver.

Thanks,
Viacheslav Dubeyko.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-01-20 20:13 [RFC] Filesystem error notifications proposal Gabriel Krisman Bertazi
  2021-01-21  4:01 ` Viacheslav Dubeyko
@ 2021-01-21 11:44 ` Jan Kara
  2021-01-21 13:27   ` Amir Goldstein
  2021-01-21 18:56   ` Gabriel Krisman Bertazi
  2021-01-21 22:44 ` Theodore Ts'o
  2021-02-10  0:09 ` Darrick J. Wong
  3 siblings, 2 replies; 28+ messages in thread
From: Jan Kara @ 2021-01-21 11:44 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: tytso, jack, viro, amir73il, dhowells, david, darrick.wong,
	khazhy, linux-fsdevel, kernel

Hello Gabriel,

On Wed 20-01-21 17:13:15, Gabriel Krisman Bertazi wrote:
> 1 Summary
> =========
> 
>   I'm looking for a filesystem-agnostic mechanism to report filesystem
>   errors to a monitoring tool in userspace.  I experimented first with
>   the watch_queue API but decided to move to fanotify for a few reasons.
> 
> 
> 2 Background
> ============
> 
>   I submitted a first set of patches, based on David Howells' original
>   superblock notifications patchset, that I expanded into error
>   reporting and had an example implementation for ext4.  Upstream review
>   has revealed a few design problems:
> 
>   - Including the "function:line" tuple in the notification allows the
>     uniquely identification of the error origin but it also ties the
>     decodification of the error to the source code, i.e. <function:line>
>     is expected to change between releases.
> 
>   - Useful debug data (inode number, block group) have formats specific
>     to the filesystems, and my design wouldn't be expansible to
>     filesystems other than ext4.
> 
>   - The implementation allowed an error string description (equivalent
>     to what would be thrown in dmesg) that is too short, as it needs to
>     fit in a single notification.
> 
>   - How the user sees the filesystem.  The original patch points to a
>     mountpoint but uses the term superblock.  This is to differentiate
>     from another mechanism in development to watch mounting operations.
> 
>   - Visibility of objects.  A bind mount of a subtree shouldn't receive
>     notifications of objects outside of that bind mount.
> 
> 
> 2.1 Other constraints of the watch_queue API
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
>   watch_queue is a fairly new framework, which has one upstream user in
>   the keyring subsystem.  watch_queue is designed to submit very short
>   (max of 128 bytes) atomic notifications to userspace in a fast manner
>   through a ring buffer.  There is no current mechanism to link
>   notifications that require more than one slot and such mechanism
>   wouldn't be trivial to implement, since buffer overruns could
>   overwrite the beginning/end of a multi part notification.  In
>   addition, watch_queue requires an out-of-band overflow notification
>   mechanism, which would need to be implemented aside from the system
>   call, in a separate API.
> 
> 
> 2.2 fanotify vs watch_queue
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
>   watch_queue is designed for efficiency and fast submission of a large
>   number of notifications.  It doesn't require memory allocation in the
>   submission path, but with a flexibility cost, such as limited
>   notification size.  fanotify is more flexible, allows for larger
>   notifications and better filtering, but requires allocations on the
>   submission path.
> 
>   On the other hand, fanotify already has support for the visibility
>   semantics we are looking for. fsnotify allows an inode to notify only
>   its subtree, mountpoints, or the entire filesystem, depending on the
>   watcher flags, while an equivalent support would need to be written
>   from scratch for watch_queue.  fanotify also has in-band overflow
>   handling, already implemented.  Finally, since fanotify supports much
>   larger notifications, there is no need to link separate notifications,
>   preventing buffer overruns from erasing parts of a notification chain.
> 
>   fanotify is based on fsnotify, and the infrastructure for the
>   visibility semantics is mostly implemented by fsnotify itself.  It
>   would be possible to make error notifications a new mechanism on top
>   of fsnotify, without modifying fanotify, but that would require
>   exposing a very similar interface to userspace, new system calls, and
>   that doesn't seem justifiable since we can extend fanotify syscalls
>   without ABI breakage.

I agree fanotify could be used for propagating the error information from
kernel to userspace. But before we go to design how exactly the events
should look like, I'd like to hear a usecase (or usecases) for this
feature. Because the intended use very much determines which information we
need to propagate to userspace and what flexibility we need there.
 
> 3 Proposal
> ==========
> 
>   The error notification framework is based on fanotify instead of
>   watch_queue.  It is exposed through a new set of marks FAN_ERROR_*,
>   exposed through the fanotify_mark(2) API.
> 
>   fanotify (fsnotify-based) has the infrastructure in-place to link
>   notifications happening at filesystem objects to mountpoints and to
>   filesystems, and already exposes an interface with well defined
>   semantics of how those are exposed to watchers in different
>   mountpoints or different subtrees.
> 
>   A new message format is exposed, if the user passed
>   FAN_REPORT_DETAILED_ERROR fanotify_init(2) flag.  FAN_ERROR messages
>   don't have FID/DFID records.

So this depends on the use case. I can imagine that we could introduce new
event type FAN_FS_ERROR (like we currently have FAN_MODIFY, FAN_ACCESS,
FAN_CREATE, ...). This event type would be generated when error happens on
fs object - we can generate it associated with struct file, struct dentry,
or only superblock, depending on what information is available at the place
where fs spots the error. Similarly to how we generate other fsnotify
events. But here's the usecase question - is there really need for fine
granularity of error reporting (i.e., someone watching errors only on a
particular file, or directory)?

FAN_FS_ERROR could be supported in all types of fanotify notification
groups (although I'm not sure supporting normal fanotify mode really makes
sense for errors since that requires opening the file where error occurs
which is likely to fail anyway) and depending on other FAN_REPORT_ flags in
the notification group we'd generate appropriate information the
notification event (like FID, DFID, name, ...)

Then we can have FAN_REPORT_DETAILED_ERROR which would add to event more
information about the error like you write below - although identification
of the inode / fsid is IMO redundant - you can get all that information
(and more) from FID / DFID event info entries if you want it.

>   A FAN_REPORT_DETAILED_ERROR record has the same struct
>   fanotify_event_metadata header, but it is followed by one or more
>   additional information record as follows:
> 
>   struct fanotify_event_error_hdr {
>   	struct fanotify_event_info_header hdr;
>   	__u32 error;
>         __u64 inode;
>         __u64 offset;
>   }
> 
>   error is a VFS generic error number that can notify generic conditions
>   like EFSCORRUPT. If hdr.len is larger than sizeof(struct
>   fanotify_event_error_hdr), this structure is followed by an optional
>   filesystem specific record that further specifies the error,
>   originating object and debug data. This record has a generic header:
> 
>   struct fanotify_event_fs_error_hdr {
>   	struct fanotify_event_error_hdr hdr;
>         __kernel_fsid_t fsid;
>         __u32 fs_error;
>   }
> 
>   fs_error is a filesystem specific error record, potentially more
>   detailed than fanotify_event_error.hdr.error . Each type of filesystem
>   error has its own record type, that is used to report different
>   information useful for each type of error.  For instance, an ext4
>   lookup error, caused by an invalid inode type read from disk, produces
>   the following record:
> 
>   struct fanotify_event_ext4_inode_error {
>   	struct fanotify_event_fs_error_hdr hdr;
>         __u64 block;
>         __u32 inode_type;
>   }
> 
>   The separation between VFS and filesystem-specific error messages has
>   the benefit of always providing some information that an error has
>   occurred, regardless of filesystem-specific support, while allowing
>   capable filesystems to expose specific internal data to further
>   document the issue.  This scheme leaves to the filesystem to expose
>   more meaningful information as needed.  For instance, multi-disk
>   filesystems can single out the problematic disk, network filesystems
>   can expose a network error while accessing the server.

And here I suspect we are already overengineering it and I'd really like to
see the usecases that warrant all the extensible information. Not just "it
might be useful for somebody" kind of thing but really concrete examples.
Adding new info is easy, removing it is impossible so we have to be really
careful when adding it... Also I'm opposed to having fs-dependent event
information. It will either get used and then we'll experience an explosion
of "this peculiar filesystem needs this additional bit of information" and
thus of configuration options, or it will fade into obsurity with only
few filesystems / tools using it and then we have mostly pointless
maintenance burden.

> 3.2 security implications
> ~~~~~~~~~~~~~~~~~~~~~~~~~
> 
>   fanotify requires CAP_SYS_ADMIN.  My understanding is this requirement
>   suffices for most use cases but, according to fanotify documentation,
>   it could be relaxed without issues for the existing fanotify API.  For
>   instance, watching a subtree could be a allowed for a user who owns
>   that subtree.

Note that both 'subtree' watches and 'unpriviledged' watches are not yet
supported by fanotify. And it's uncertain in which form they'll get merged.
I don't think your proposal really relies on them so everything is fine but
I'd like to set the expectations properly.

> 3.3 error location
> ~~~~~~~~~~~~~~~~~~
> 
>   While exposing the exact line of code that triggered the notification
>   ties that notification to a specific kernel version, it is an
>   important information for those who completely control their
>   environment and kernel builds, such as cloud providers.  Therefore,
>   this proposal includes a mechanism to optionally include in the
>   notification the line of code where the error occurred
> 
>   A watcher who passes the flag FAN_REPORT_ERROR_LOCATION to
>   fanotify_init(2) receives an extra record for FAN_ERROR events:
> 
>   struct fanotify_event_fs_error_location {
>   	struct fanotify_event_info_header hdr;
>         u32 line;
>         char function[];
>   }
> 
>   This record identifies the place where the error occured.  function is
>   a VLA whose size extend to the end of the region delimited by hdr.len.
>   This VLA text-encodes the function name where the error occurred.

Again, here I'd be interested in the usecases. We can always implement
things without this feature and later add it when there are real users
with clear requirements...

> What do you think about his interface?  Would it be acceptable as part
> of fanotify, or should it be a new fsnotify mode?

I think expanding fanotify makes more sense.

> Regarding semantics, I believe fanotify should solve the visibility
> problem for a subtree watcher not being able to see other branches
> notifications.  Do you think this would suffice?

Currently fanotify requires CAP_SYS_ADMIN as you noted above so visibility
is non-issue (only a possible performance hurdle due to too many events if
anything). If we add support to fanotify for less priviledged watchers,
then yes, we'll have to solve visibility for other types of events anyway,
error reporting is not special in that regard...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-01-21 11:44 ` Jan Kara
@ 2021-01-21 13:27   ` Amir Goldstein
  2021-01-21 18:56   ` Gabriel Krisman Bertazi
  1 sibling, 0 replies; 28+ messages in thread
From: Amir Goldstein @ 2021-01-21 13:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: Gabriel Krisman Bertazi, Theodore Tso, Jan Kara, Al Viro,
	David Howells, Dave Chinner, Darrick J. Wong, khazhy,
	linux-fsdevel, kernel, Jeff Layton

On Thu, Jan 21, 2021 at 1:44 PM Jan Kara <jack@suse.cz> wrote:
>
> Hello Gabriel,
>
> On Wed 20-01-21 17:13:15, Gabriel Krisman Bertazi wrote:
> > 1 Summary
> > =========
> >
> >   I'm looking for a filesystem-agnostic mechanism to report filesystem
> >   errors to a monitoring tool in userspace.  I experimented first with
> >   the watch_queue API but decided to move to fanotify for a few reasons.
> >
> >
> > 2 Background
> > ============
> >
> >   I submitted a first set of patches, based on David Howells' original
> >   superblock notifications patchset, that I expanded into error
> >   reporting and had an example implementation for ext4.  Upstream review
> >   has revealed a few design problems:
> >
> >   - Including the "function:line" tuple in the notification allows the
> >     uniquely identification of the error origin but it also ties the
> >     decodification of the error to the source code, i.e. <function:line>
> >     is expected to change between releases.
> >
> >   - Useful debug data (inode number, block group) have formats specific
> >     to the filesystems, and my design wouldn't be expansible to
> >     filesystems other than ext4.
> >
> >   - The implementation allowed an error string description (equivalent
> >     to what would be thrown in dmesg) that is too short, as it needs to
> >     fit in a single notification.
> >
> >   - How the user sees the filesystem.  The original patch points to a
> >     mountpoint but uses the term superblock.  This is to differentiate
> >     from another mechanism in development to watch mounting operations.
> >
> >   - Visibility of objects.  A bind mount of a subtree shouldn't receive
> >     notifications of objects outside of that bind mount.
> >
> >
> > 2.1 Other constraints of the watch_queue API
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >   watch_queue is a fairly new framework, which has one upstream user in
> >   the keyring subsystem.  watch_queue is designed to submit very short
> >   (max of 128 bytes) atomic notifications to userspace in a fast manner
> >   through a ring buffer.  There is no current mechanism to link
> >   notifications that require more than one slot and such mechanism
> >   wouldn't be trivial to implement, since buffer overruns could
> >   overwrite the beginning/end of a multi part notification.  In
> >   addition, watch_queue requires an out-of-band overflow notification
> >   mechanism, which would need to be implemented aside from the system
> >   call, in a separate API.
> >
> >
> > 2.2 fanotify vs watch_queue
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >   watch_queue is designed for efficiency and fast submission of a large
> >   number of notifications.  It doesn't require memory allocation in the
> >   submission path, but with a flexibility cost, such as limited
> >   notification size.  fanotify is more flexible, allows for larger
> >   notifications and better filtering, but requires allocations on the
> >   submission path.
> >
> >   On the other hand, fanotify already has support for the visibility
> >   semantics we are looking for. fsnotify allows an inode to notify only
> >   its subtree, mountpoints, or the entire filesystem, depending on the
> >   watcher flags, while an equivalent support would need to be written
> >   from scratch for watch_queue.  fanotify also has in-band overflow
> >   handling, already implemented.  Finally, since fanotify supports much
> >   larger notifications, there is no need to link separate notifications,
> >   preventing buffer overruns from erasing parts of a notification chain.
> >
> >   fanotify is based on fsnotify, and the infrastructure for the
> >   visibility semantics is mostly implemented by fsnotify itself.  It
> >   would be possible to make error notifications a new mechanism on top
> >   of fsnotify, without modifying fanotify, but that would require
> >   exposing a very similar interface to userspace, new system calls, and
> >   that doesn't seem justifiable since we can extend fanotify syscalls
> >   without ABI breakage.
>
> I agree fanotify could be used for propagating the error information from
> kernel to userspace. But before we go to design how exactly the events
> should look like, I'd like to hear a usecase (or usecases) for this
> feature. Because the intended use very much determines which information we
> need to propagate to userspace and what flexibility we need there.
>

I agree with Jan both on using fanotify report fs errors and for not
overengineering the error details without specific use cases.

FWIW, here is a specific use case - reporting writeback errors.
Writeback errors are recorded for specific files (f_mapping->wb_err)
and on sb (sb->s_wb_err) they hold the specific error code only for the last
reported error. New error type overwrites an existing error type.
Being able to queue writeback errors in events queue could be useful
for that reason alone.

But notifications for writeback errors are useful for more reasons.
Currently, the only way for users to probe for writeback errors is by
issuing fsync() or syncfs() respectively, so polling for writeback errors
this way is out of the question even if polling would have been considered
sufficient for an application.

Of course we could add an API to poll for writeback errors without issuing
writeback, but a notification API is much better, because you can use it for
polling as well (poll the queue) and because fsnotify merges events in the
queue, so multiple writeback events are not likely to cause queue overflow.

> > 3 Proposal
> > ==========
> >
> >   The error notification framework is based on fanotify instead of
> >   watch_queue.  It is exposed through a new set of marks FAN_ERROR_*,
> >   exposed through the fanotify_mark(2) API.
> >
> >   fanotify (fsnotify-based) has the infrastructure in-place to link
> >   notifications happening at filesystem objects to mountpoints and to
> >   filesystems, and already exposes an interface with well defined
> >   semantics of how those are exposed to watchers in different
> >   mountpoints or different subtrees.
> >
> >   A new message format is exposed, if the user passed
> >   FAN_REPORT_DETAILED_ERROR fanotify_init(2) flag.  FAN_ERROR messages
> >   don't have FID/DFID records.
>
> So this depends on the use case. I can imagine that we could introduce new
> event type FAN_FS_ERROR (like we currently have FAN_MODIFY, FAN_ACCESS,
> FAN_CREATE, ...). This event type would be generated when error happens on
> fs object - we can generate it associated with struct file, struct dentry,
> or only superblock, depending on what information is available at the place
> where fs spots the error. Similarly to how we generate other fsnotify
> events. But here's the usecase question - is there really need for fine
> granularity of error reporting (i.e., someone watching errors only on a
> particular file, or directory)?
>

For writeback errors, marking a specific file for fs error events may
make sense.
A database application for example, may care about writeback errors to the
db file and not about writeback errors elsewhere in the system, given that
periodic fsync() is out of the question.

> FAN_FS_ERROR could be supported in all types of fanotify notification
> groups (although I'm not sure supporting normal fanotify mode really makes
> sense for errors since that requires opening the file where error occurs
> which is likely to fail anyway) and depending on other FAN_REPORT_ flags in
> the notification group we'd generate appropriate information the
> notification event (like FID, DFID, name, ...)
>
> Then we can have FAN_REPORT_DETAILED_ERROR which would add to event more
> information about the error like you write below - although identification
> of the inode / fsid is IMO redundant - you can get all that information
> (and more) from FID / DFID event info entries if you want it.
>

I second that.
Gabriel, may I also suggest that you take a look at XFS's corruption reporters
xfs_{buf,inode}_verifier_error().
The common information beyond error type and inode that is propagated
to reporters is  xfs_failaddr_t.
This is a much more compact way to report the line:function information.
WRT the security implications of reporting kernel addresses, as long as the
watcher is required to have CAP_SYS_ADMIN it is not an issue and function:line
resolving can happen in userspace.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-01-21 11:44 ` Jan Kara
  2021-01-21 13:27   ` Amir Goldstein
@ 2021-01-21 18:56   ` Gabriel Krisman Bertazi
  1 sibling, 0 replies; 28+ messages in thread
From: Gabriel Krisman Bertazi @ 2021-01-21 18:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: tytso, jack, viro, amir73il, dhowells, david, darrick.wong,
	khazhy, linux-fsdevel, kernel

Jan Kara <jack@suse.cz> writes:
> On Wed 20-01-21 17:13:15, Gabriel Krisman Bertazi wrote:
>>   fanotify is based on fsnotify, and the infrastructure for the
>>   visibility semantics is mostly implemented by fsnotify itself.  It
>>   would be possible to make error notifications a new mechanism on top
>>   of fsnotify, without modifying fanotify, but that would require
>>   exposing a very similar interface to userspace, new system calls, and
>>   that doesn't seem justifiable since we can extend fanotify syscalls
>>   without ABI breakage.
>
> I agree fanotify could be used for propagating the error information from
> kernel to userspace. But before we go to design how exactly the events
> should look like, I'd like to hear a usecase (or usecases) for this
> feature. Because the intended use very much determines which information we
> need to propagate to userspace and what flexibility we need there.

Hi Jan,

Thanks for the review.

My user is a cloud provider who wants to monitor several machines,
potentially with several volumes each, for metadata and data corruption
that might require attention from a sysadmin.  They would run a watcher
for each filesystem of each machine and consolidate the information.
Right now, they have a solution in-house to monitor ext4, which
basically exports to userspace any kind of condition that would trigger
one of the ext4_error_* functions.

We need to monitor a bit more than just writeback errors, but that is a
big part of the functionality we want.  We would be interested in
finding out about inode corruption, checksum failures, and read errors
as well.

Khazhy, in Cc, is more familiar with the specific requirements, and
might be able to provide more information on their usage of the existing
solution.

>> 3 Proposal
>> ==========
>> 
>>   The error notification framework is based on fanotify instead of
>>   watch_queue.  It is exposed through a new set of marks FAN_ERROR_*,
>>   exposed through the fanotify_mark(2) API.
>> 
>>   fanotify (fsnotify-based) has the infrastructure in-place to link
>>   notifications happening at filesystem objects to mountpoints and to
>>   filesystems, and already exposes an interface with well defined
>>   semantics of how those are exposed to watchers in different
>>   mountpoints or different subtrees.
>> 
>>   A new message format is exposed, if the user passed
>>   FAN_REPORT_DETAILED_ERROR fanotify_init(2) flag.  FAN_ERROR messages
>>   don't have FID/DFID records.
>
> So this depends on the use case. I can imagine that we could introduce new
> event type FAN_FS_ERROR (like we currently have FAN_MODIFY, FAN_ACCESS,
> FAN_CREATE, ...). This event type would be generated when error happens on
> fs object - we can generate it associated with struct file, struct dentry,
> or only superblock, depending on what information is available at the place
> where fs spots the error. Similarly to how we generate other fsnotify
> events. But here's the usecase question - is there really need for fine
> granularity of error reporting (i.e., someone watching errors only on a
> particular file, or directory)?

My use-case requires only watching an entire filesystem, and not
specific files.  I can imagine a use-case of someone watching a specific
mount point, like an application running inside a container is notified
only of errors in files/directories within that container.

>
> Then we can have FAN_REPORT_DETAILED_ERROR which would add to event more
> information about the error like you write below - although identification
> of the inode / fsid is IMO redundant - you can get all that information
> (and more) from FID / DFID event info entries if you want it.

When FID provides a handle, I'd need to get information on that handle
without opening it, as the file might be corrupted.  Would we need an
interface to retrieve inode number by handle without opening the file,
i.e. stat_by_handle?

Thanks,

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-01-20 20:13 [RFC] Filesystem error notifications proposal Gabriel Krisman Bertazi
  2021-01-21  4:01 ` Viacheslav Dubeyko
  2021-01-21 11:44 ` Jan Kara
@ 2021-01-21 22:44 ` Theodore Ts'o
  2021-01-22  0:44   ` Gabriel Krisman Bertazi
  2021-02-10  0:09 ` Darrick J. Wong
  3 siblings, 1 reply; 28+ messages in thread
From: Theodore Ts'o @ 2021-01-21 22:44 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: jack, viro, amir73il, dhowells, david, darrick.wong, khazhy,
	linux-fsdevel, kernel

On Wed, Jan 20, 2021 at 05:13:15PM -0300, Gabriel Krisman Bertazi wrote:
> 
>   I submitted a first set of patches, based on David Howells' original
>   superblock notifications patchset, that I expanded into error
>   reporting and had an example implementation for ext4.  Upstream review
>   has revealed a few design problems:
> 
>   - Including the "function:line" tuple in the notification allows the
>     uniquely identification of the error origin but it also ties the
>     decodification of the error to the source code, i.e. <function:line>
>     is expected to change between releases.

That's true, but it's better than printing the EIP, which is even
harder to decode.  Having the line number is better than the
alternative, which is either (a) nothing, or (b) the EIP.

>   - Useful debug data (inode number, block group) have formats specific
>     to the filesystems, and my design wouldn't be expansible to
>     filesystems other than ext4.

Inode numbers are fairly universal; block groups are certainly
ext2/3/4 specific.  I'll note that in the original ext4_error messages
to syslog and the internal-to-Google netlink notification system, they
were sent as an text string.  We don't necessarily need to send them
to userspace.

>   - The implementation allowed an error string description (equivalent
>     to what would be thrown in dmesg) that is too short, as it needs to
>     fit in a single notification.

... or we need to have some kind of contnuation system where the text
string is broken across multiple notification, with some kind of
notification id umber.

>   - Visibility of objects.  A bind mount of a subtree shouldn't receive
>     notifications of objects outside of that bind mount.

So this is scope creep beyond the original goals of the project.  I
understand that there is a desire by folks in the community to support
various containerization use cases where only only a portion of file
system is visible in a container due to a bind mount.

However, we need to recall that ext4_error messages can originate in
fairly deep inside the ext4 file system.  They indicate major problems
with the file system --- the kind that might require drastic system
administration reaction.  As such, at the point where we discover a
problem with an inode, that part of the ext4 code might not have
access to the pathname that was used to originally access the inode.

We might be inside a workqueue handler, for example, so we might not
even running in the same process that had been containerized.  We
might be holding various file system mutexes or even in some cases a
spinlock.

What follows from that is that it's not really going to be possible to
filter notifications to a subtree.  Furthermore, if fanotify requires
memory allocation, that's going to be problematic, we may not be in a
context where memory allocation is possible.  So for that reason, it's
not clear to me that fanotify is going to be a good match for this use
case.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-01-21 22:44 ` Theodore Ts'o
@ 2021-01-22  0:44   ` Gabriel Krisman Bertazi
  2021-01-22  7:36     ` Amir Goldstein
  2021-01-28 22:28     ` Theodore Ts'o
  0 siblings, 2 replies; 28+ messages in thread
From: Gabriel Krisman Bertazi @ 2021-01-22  0:44 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: jack, viro, amir73il, dhowells, david, darrick.wong, khazhy,
	linux-fsdevel, kernel

"Theodore Ts'o" <tytso@mit.edu> writes:

> On Wed, Jan 20, 2021 at 05:13:15PM -0300, Gabriel Krisman Bertazi wrote:
>>   - The implementation allowed an error string description (equivalent
>>     to what would be thrown in dmesg) that is too short, as it needs to
>>     fit in a single notification.
>
> ... or we need to have some kind of contnuation system where the text
> string is broken across multiple notification, with some kind of
> notification id umber.

Hi Ted,

This is something I'd like to avoid.  A notification broken in several
parts is subject to being partially overwritten by another notification
before it is fully read by userspace.  we can workaround it, of course,
but the added complexity would be unnecessary if only we had a large
enough notification - or one with a flexible size.  Consider that we
have the entire context of the notification at the moment it happens, so
we can always know the size we need before doing any allocations.

>>   - Visibility of objects.  A bind mount of a subtree shouldn't receive
>>     notifications of objects outside of that bind mount.
>
> So this is scope creep beyond the original goals of the project.  I
> understand that there is a desire by folks in the community to support
> various containerization use cases where only only a portion of file
> system is visible in a container due to a bind mount.
>
> However, we need to recall that ext4_error messages can originate in
> fairly deep inside the ext4 file system.  They indicate major problems
> with the file system --- the kind that might require drastic system
> administration reaction.  As such, at the point where we discover a
> problem with an inode, that part of the ext4 code might not have
> access to the pathname that was used to originally access the inode.
>
> We might be inside a workqueue handler, for example, so we might not
> even running in the same process that had been containerized.  We
> might be holding various file system mutexes or even in some cases a
> spinlock.

I see.  But the visibility is of a watcher who can see an object, not
the application that caused the error.  The fact that the error happened
outside the context of the containerized process should not be a problem
here, right?  As long as the watcher is watching a mountpoint that can
reach the failed inode, that inode should be accessible to the watcher
and it should receive a notification. No?

I don't mean to sound dense.  I understand the memory allocation &
locking argument.  But I don't understand why an error identified in a
workqueue context would be different from an error in a process
context.

> What follows from that is that it's not really going to be possible to
> filter notifications to a subtree.  Furthermore, if fanotify requires
> memory allocation, that's going to be problematic, we may not be in a
> context where memory allocation is possible.  So for that reason, it's
> not clear to me that fanotify is going to be a good match for this use
> case.

I see.  Do you think we would be able to refactor the error handling
code, such that we can drop spinlocks and do some non-reclaiming
allocations at least?  I noticed Google's code seems to survive doing
some allocations with GFP_ATOMIC in their internal-to-Google netlink
notification system, and even GFP_KERNEL on some other scenarios.  I
might not be seeing the full picture though.

I think changing fanotify to avoid allocations in the submission path
might be just intrusive enough for the patch to be rejected by Jan. If
we cannot do allocations at all, I would suggest I move this feature out
of fanotify, but stick with fsnotify, for its ability to link
inodes/mntpoints/superblock.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-01-22  0:44   ` Gabriel Krisman Bertazi
@ 2021-01-22  7:36     ` Amir Goldstein
  2021-02-02 20:51       ` Gabriel Krisman Bertazi
  2021-01-28 22:28     ` Theodore Ts'o
  1 sibling, 1 reply; 28+ messages in thread
From: Amir Goldstein @ 2021-01-22  7:36 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Theodore Ts'o, Jan Kara, Al Viro, David Howells,
	Dave Chinner, Darrick J. Wong, khazhy, linux-fsdevel, kernel

> >>   - Visibility of objects.  A bind mount of a subtree shouldn't receive
> >>     notifications of objects outside of that bind mount.
> >
> > So this is scope creep beyond the original goals of the project.  I
> > understand that there is a desire by folks in the community to support
> > various containerization use cases where only only a portion of file
> > system is visible in a container due to a bind mount.
> >
> > However, we need to recall that ext4_error messages can originate in
> > fairly deep inside the ext4 file system.  They indicate major problems
> > with the file system --- the kind that might require drastic system
> > administration reaction.  As such, at the point where we discover a
> > problem with an inode, that part of the ext4 code might not have
> > access to the pathname that was used to originally access the inode.
> >
> > We might be inside a workqueue handler, for example, so we might not
> > even running in the same process that had been containerized.  We
> > might be holding various file system mutexes or even in some cases a
> > spinlock.
>
> I see.  But the visibility is of a watcher who can see an object, not
> the application that caused the error.  The fact that the error happened
> outside the context of the containerized process should not be a problem
> here, right?  As long as the watcher is watching a mountpoint that can
> reach the failed inode, that inode should be accessible to the watcher
> and it should receive a notification. No?
>

No, because the mount/path is usually not available in file system
internal context. Even in vfs, many operations have no mnt context,
which is the reason that some fanotify event types are available for
FAN_MARK_FILESYSTEM and not for FAN_MARK_MOUNT.

> > What follows from that is that it's not really going to be possible to
> > filter notifications to a subtree.  Furthermore, if fanotify requires
> > memory allocation, that's going to be problematic, we may not be in a
> > context where memory allocation is possible.  So for that reason, it's
> > not clear to me that fanotify is going to be a good match for this use
> > case.
>
> I see.  Do you think we would be able to refactor the error handling
> code, such that we can drop spinlocks and do some non-reclaiming
> allocations at least?  I noticed Google's code seems to survive doing
> some allocations with GFP_ATOMIC in their internal-to-Google netlink
> notification system, and even GFP_KERNEL on some other scenarios.  I
> might not be seeing the full picture though.
>
> I think changing fanotify to avoid allocations in the submission path
> might be just intrusive enough for the patch to be rejected by Jan. If
> we cannot do allocations at all, I would suggest I move this feature out
> of fanotify, but stick with fsnotify, for its ability to link
> inodes/mntpoints/superblock.
>

Gabriel,

I understand the use case of monitoring a fleet of machines to know
when some machine in the fleet has a corruption.
I don't understand why the monitoring messages need to carry all the
debugging info of that corruption.

For corruption detection use case, it seems more logical to configure
machines in the fleet to errors=remount-ro and then you'd only ever
need to signal that a corruption was detected on a filesystem and the
monitoring agent can access that machine to get more debugging
info from dmesg or from filesystem recorded first/last error.

You may be able to avoid allocation in fanotify if a group keeps
a pre-allocated "emergency" event, but you won't be able to
avoid taking locks in fanotify. Even fsnotify takes srcu_read_lock
and spin_lock in some cases, so you'd have to be carefull with the
context you call fsnotify from.

If you agree with my observation that filesystem can abort itself
on corruption and keep the details internally, then the notification
of a corrupted state can always be made from a safe context
sometime after the corruption was detected, regardless of the
context in which ext4_error() was called.

IOW, if the real world use cases you have are reporting
writeback errors and signalling that the filesystem entered a corrupted
state, then fanotify might be the right tool for the job and you should
have no need for variable size detailed event info.
If you want a netoops equivalent reporting infrastructure, then
you should probably use a different tool.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-01-22  0:44   ` Gabriel Krisman Bertazi
  2021-01-22  7:36     ` Amir Goldstein
@ 2021-01-28 22:28     ` Theodore Ts'o
  2021-02-02 20:26       ` Gabriel Krisman Bertazi
  1 sibling, 1 reply; 28+ messages in thread
From: Theodore Ts'o @ 2021-01-28 22:28 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: jack, viro, amir73il, dhowells, david, darrick.wong, khazhy,
	linux-fsdevel, kernel

On Thu, Jan 21, 2021 at 09:44:35PM -0300, Gabriel Krisman Bertazi wrote:
> > We might be inside a workqueue handler, for example, so we might not
> > even running in the same process that had been containerized.  We
> > might be holding various file system mutexes or even in some cases a
> > spinlock.
> 
> I see.  But the visibility is of a watcher who can see an object, not
> the application that caused the error.  The fact that the error happened
> outside the context of the containerized process should not be a problem
> here, right?  As long as the watcher is watching a mountpoint that can
> reach the failed inode, that inode should be accessible to the watcher
> and it should receive a notification. No?

Right.  But the corruption might be an attempt to free a block which
is already freed.  The corruption is in a block allocation bitmap; and
at the point where we notice this (in the journal commit callback
function), ext4 has no idea what inode the block was originally
associated with, let *alone* the directory pathname that the inode was
associated with.  So how do we figure out whether or not the watcher
"can see the object"?

In other words, there is a large set of file system corruptions where
it is impossible to map it to the question "will it be visible to the
watcher"?

> > What follows from that is that it's not really going to be possible to
> > filter notifications to a subtree.  Furthermore, if fanotify requires
> > memory allocation, that's going to be problematic, we may not be in a
> > context where memory allocation is possible.  So for that reason, it's
> > not clear to me that fanotify is going to be a good match for this use
> > case.
> 
> I see.  Do you think we would be able to refactor the error handling
> code, such that we can drop spinlocks and do some non-reclaiming
> allocations at least?  I noticed Google's code seems to survive doing
> some allocations with GFP_ATOMIC in their internal-to-Google netlink
> notification system, and even GFP_KERNEL on some other scenarios.  I
> might not be seeing the full picture though.

It would be ***hard***.  The problem is that the spinlocks may be held
in functions higher up in the callchain from where the error was
detected.  So what you're proposing would involve analyzing and
potentially refactoring *all* of the call sites for the ext4_error*()
functions.

> I think changing fanotify to avoid allocations in the submission path
> might be just intrusive enough for the patch to be rejected by Jan. If
> we cannot do allocations at all, I would suggest I move this feature out
> of fanotify, but stick with fsnotify, for its ability to link
> inodes/mntpoints/superblock.

At least for Google's use case, what we really need to know is the
there has been some kind of file system corruption.  It would be
*nice* if the full ext4 error message is sent to userspace via
fsnotify, but for our specific use case, if we lose parts of the
notification because the file system is so corrupted that fsnotify is
getting flooded with problems, it really isn't end of the world.  So
long as we know which block device the file system error was
associated with, that's actually the most important thing that we need.

In the worst case, we can always ssh to machine and grab the logs from
/var/log/messages.  And realistically, if the file system is so sick
that we're getting flooding with gazillion of errors, some of the
messages are going to get lost whether they are sent via syslog or the
serial console --- or fsnotify.  So that's no worse than what we all
have today.  If we can get the *first* error, that's actually going to
be the most useful one, anyway.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-01-28 22:28     ` Theodore Ts'o
@ 2021-02-02 20:26       ` Gabriel Krisman Bertazi
  2021-02-02 22:34         ` Theodore Ts'o
  0 siblings, 1 reply; 28+ messages in thread
From: Gabriel Krisman Bertazi @ 2021-02-02 20:26 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: jack, viro, amir73il, dhowells, david, darrick.wong, khazhy,
	linux-fsdevel, kernel

"Theodore Ts'o" <tytso@mit.edu> writes:

> On Thu, Jan 21, 2021 at 09:44:35PM -0300, Gabriel Krisman Bertazi wrote:
>> > We might be inside a workqueue handler, for example, so we might not
>> > even running in the same process that had been containerized.  We
>> > might be holding various file system mutexes or even in some cases a
>> > spinlock.
>> 
>> I see.  But the visibility is of a watcher who can see an object, not
>> the application that caused the error.  The fact that the error happened
>> outside the context of the containerized process should not be a problem
>> here, right?  As long as the watcher is watching a mountpoint that can
>> reach the failed inode, that inode should be accessible to the watcher
>> and it should receive a notification. No?
>
> Right.  But the corruption might be an attempt to free a block which
> is already freed.  The corruption is in a block allocation bitmap; and
> at the point where we notice this (in the journal commit callback
> function), ext4 has no idea what inode the block was originally
> associated with, let *alone* the directory pathname that the inode was
> associated with.  So how do we figure out whether or not the watcher
> "can see the object"?
>
> In other words, there is a large set of file system corruptions where
> it is impossible to map it to the question "will it be visible to the
> watcher"?

Thanks for the explanation.  That makes sense to me.  For corruptions
where it is impossible to map to a mountpoint, I thought they could be
considered global filesystem errors, being exposed only to someone
watching the entire filesystem (like FAN_MARK_FILESYSTEM).

But, as you mentioned regarding the google use case, the entire idea of
watching a subtree is a bit beyond the scope of my use-case, and was
only added given the feedback on the previous proposal of this feature.
While nice to have, I don't have the need to watch different mountpoints
for errors, only the entire filesystem.

There was a concern raised against my original submission which did
superblock watching, due to the fact that a superblock is an internal
kernel structure that must not be visible to the userspace.  It was
suggested we should be speaking in terms of paths and mountpoint.  But,
considering the existence of FAN_MARK_FILESYSTEM, I think it is the exact
same object I was looking for when proposing the watch_sb syscall.

>
>> > What follows from that is that it's not really going to be possible to
>> > filter notifications to a subtree.  Furthermore, if fanotify requires
>> > memory allocation, that's going to be problematic, we may not be in a
>> > context where memory allocation is possible.  So for that reason, it's
>> > not clear to me that fanotify is going to be a good match for this use
>> > case.
>> 
>> I see.  Do you think we would be able to refactor the error handling
>> code, such that we can drop spinlocks and do some non-reclaiming
>> allocations at least?  I noticed Google's code seems to survive doing
>> some allocations with GFP_ATOMIC in their internal-to-Google netlink
>> notification system, and even GFP_KERNEL on some other scenarios.  I
>> might not be seeing the full picture though.
>
> It would be ***hard***.  The problem is that the spinlocks may be held
> in functions higher up in the callchain from where the error was
> detected.  So what you're proposing would involve analyzing and
> potentially refactoring *all* of the call sites for the ext4_error*()
> functions.
>
>> I think changing fanotify to avoid allocations in the submission path
>> might be just intrusive enough for the patch to be rejected by Jan. If
>> we cannot do allocations at all, I would suggest I move this feature out
>> of fanotify, but stick with fsnotify, for its ability to link
>> inodes/mntpoints/superblock.
>
> At least for Google's use case, what we really need to know is the
> there has been some kind of file system corruption.  It would be
> *nice* if the full ext4 error message is sent to userspace via
> fsnotify, but for our specific use case, if we lose parts of the
> notification because the file system is so corrupted that fsnotify is
> getting flooded with problems, it really isn't end of the world.  So
> long as we know which block device the file system error was
> associated with, that's actually the most important thing that we need.
>
> In the worst case, we can always ssh to machine and grab the logs from
> /var/log/messages.  And realistically, if the file system is so sick
> that we're getting flooding with gazillion of errors, some of the
> messages are going to get lost whether they are sent via syslog or the
> serial console --- or fsnotify.  So that's no worse than what we all
> have today.  If we can get the *first* error, that's actually going to
> be the most useful one, anyway.
>

The main use case is, as you said, corruption detection with enough
information to allow us to trigger automatic recovery and data rebuilding
tools.  I understand now I can drop most of the debug info, as you
mentioned.  In this sense, the feature looks more like netoops.

But there are other uses that interests us, like pattern analysis of
error locations, such that we can detect and perhaps predict errors.
One idea that was mentioned is if an error occurs frequently enough
in a specific function, there might be a bug in that function.  This is
one of the reasons we are pushing to include function:line in the error
message.

I feel I should go back to something closer to the previous proposal
then, either in the shape of a watch_sb or a new fsnotify mechanism (to
avoid the allocation problems) which only tracks the entire
filesystem.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-01-22  7:36     ` Amir Goldstein
@ 2021-02-02 20:51       ` Gabriel Krisman Bertazi
  0 siblings, 0 replies; 28+ messages in thread
From: Gabriel Krisman Bertazi @ 2021-02-02 20:51 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Theodore Ts'o, Jan Kara, Al Viro, David Howells,
	Dave Chinner, Darrick J. Wong, khazhy, linux-fsdevel, kernel

Amir Goldstein <amir73il@gmail.com> writes:

>> I see.  But the visibility is of a watcher who can see an object, not
>> the application that caused the error.  The fact that the error happened
>> outside the context of the containerized process should not be a problem
>> here, right?  As long as the watcher is watching a mountpoint that can
>> reach the failed inode, that inode should be accessible to the watcher
>> and it should receive a notification. No?
>>
>
> No, because the mount/path is usually not available in file system
> internal context. Even in vfs, many operations have no mnt context,
> which is the reason that some fanotify event types are available for
> FAN_MARK_FILESYSTEM and not for FAN_MARK_MOUNT.

Hi Amir, thanks for the explanation.

> I understand the use case of monitoring a fleet of machines to know
> when some machine in the fleet has a corruption.
> I don't understand why the monitoring messages need to carry all the
> debugging info of that corruption.
>
> For corruption detection use case, it seems more logical to configure
> machines in the fleet to errors=remount-ro and then you'd only ever
> need to signal that a corruption was detected on a filesystem and the
> monitoring agent can access that machine to get more debugging
> info from dmesg or from filesystem recorded first/last error.

The main use-case, as Ted mentioned, is corruption detection in a bunch
of machines and, while allowing them to continue to operate if possible,
schedule the execution of repair tasks and/or data rebuilding of
specific files.  In fact, you are right, we don't need to provide enough
debug information, but the ext4 message, for instance would be
useful. This is more similar to my previous RFC at
https://lwn.net/Articles/839310/

There are other use cases requiring us to provide some more information, in
particular the place where the error was raised in the code and the type
of error, for pattern analysis. So just reporting corruption via sysfs,
for instance, wouldn't suffice.

> You may be able to avoid allocation in fanotify if a group keeps
> a pre-allocated "emergency" event, but you won't be able to
> avoid taking locks in fanotify. Even fsnotify takes srcu_read_lock
> and spin_lock in some cases, so you'd have to be carefull with the
> context you call fsnotify from.
>
> If you agree with my observation that filesystem can abort itself
> on corruption and keep the details internally, then the notification
> of a corrupted state can always be made from a safe context
> sometime after the corruption was detected, regardless of the
> context in which ext4_error() was called.
>
> IOW, if the real world use cases you have are reporting
> writeback errors and signalling that the filesystem entered a corrupted
> state, then fanotify might be the right tool for the job and you should
> have no need for variable size detailed event info.
> If you want a netoops equivalent reporting infrastructure, then
> you should probably use a different tool.

The main reason I was looking at fanotify was the ability to watch
different mountpoints and objects without watching the entire
filesystem.  This was a requirement raised against my previous
submission linked above, which provided only a mechanism based on
watch_queue to watch the entire filesystem.  If we agree to no longer
watch specific subtrees, I think it makes sense to revert to the
previous proposal, and drop fanotify all together for this use case.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-02-02 20:26       ` Gabriel Krisman Bertazi
@ 2021-02-02 22:34         ` Theodore Ts'o
  2021-02-08 18:49           ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 28+ messages in thread
From: Theodore Ts'o @ 2021-02-02 22:34 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: jack, viro, amir73il, dhowells, david, darrick.wong, khazhy,
	linux-fsdevel, kernel

On Tue, Feb 02, 2021 at 03:26:35PM -0500, Gabriel Krisman Bertazi wrote:
> 
> Thanks for the explanation.  That makes sense to me.  For corruptions
> where it is impossible to map to a mountpoint, I thought they could be
> considered global filesystem errors, being exposed only to someone
> watching the entire filesystem (like FAN_MARK_FILESYSTEM).

At least for ext4, there are only 3 ext4_error_*() that we could map
to a subtree without having to make changes to the call points:

% grep -i ext4_error_file\( fs/ext4/*.c  | wc -l
3
% grep -i ext4_error_inode\( fs/ext4/*.c  | wc -l
79
% grep -i ext4_error\( fs/ext4/*.c  | wc -l
42

So in practice, unless we want to make a lot of changes to ext4, most
of them will be global file system errors....

> But, as you mentioned regarding the google use case, the entire idea of
> watching a subtree is a bit beyond the scope of my use-case, and was
> only added given the feedback on the previous proposal of this feature.
> While nice to have, I don't have the need to watch different mountpoints
> for errors, only the entire filesystem.

I suspect that for most use cases, the most interesting thing is the
first error.  We already record this in the ext4 superblock, because
unfortunately, I can't guarantee that system administrators have
correctly configured their system logs, so when handling upstream bug
reports, I can just ask them to run dumpe2fs -h on the file system:

FS Error count:           2
First error time:         Tue Feb  2 16:27:42 2021
First error function:     ext4_lookup
First error line #:       1704
First error inode #:      12
First error err:          EFSCORRUPTED
Last error time:          Tue Feb  2 16:27:59 2021
Last error function:      ext4_lookup
Last error line #:        1704
Last error inode #:       12
Last error err:           EFSCORRUPTED

So it's not just the Google case.  I'd argue for most system
administrator, one of the most useful things is when the file system
was first found to be corrupted, so they can try correlating file
system corruptions, with, say, reports of I/O errors, or OOM kils,
etc.  This can also be useful for correlating the start of file system
problems with problems at the application layer --- say, MongoDB,
MySQL, etc.

The reason why a notification system useful is because if you are
using database some kind of high-availability replication system, and
if there are problems detected in the file system of the primary MySQL
server, you'd want to have the system fail over to the secondary MySQL
server.  Sure, you *could* do this by polling the superblock, but
that's not the most efficient way to do things.

> There was a concern raised against my original submission which did
> superblock watching, due to the fact that a superblock is an internal
> kernel structure that must not be visible to the userspace.  It was
> suggested we should be speaking in terms of paths and mountpoint.  But,
> considering the existence of FAN_MARK_FILESYSTEM, I think it is the exact
> same object I was looking for when proposing the watch_sb syscall.

Sure, using the owner of the root directory (the mountpoint), might be
one way of doing things.  I think CAP_SYS_ADMIN is also fine, since
for many of the use cases, such as shutting down the file system so
the database server can fail over to the secondary service, you're
going to need root anyway.  (Especially for ext4, the vast majority of
the errors are going to be FAN_MARK_FILESYSTEM anyway.)

> The main use case is, as you said, corruption detection with enough
> information to allow us to trigger automatic recovery and data rebuilding
> tools.  I understand now I can drop most of the debug info, as you
> mentioned.  In this sense, the feature looks more like netoops.

Things like the bad block number, and the more structured information
is a nice to have.  It might also be that using the structured
information might be a more efficient way to get the information to
userspace.  So one could imagine using, say, an 32 character error
text code, sasy, "ext4_bad_dir_block_csum", followed by a 64-bit inode
number, a 64-bit containing directory inode number, a 64-bit block
number, and 8-bits of filename length, followed the first 7 characters
of the filename, followed by the last 8 characters of the filename.
That's 72 bytes, which is quite compact.  Adding something like 16
bytes of function names and 2 bytes of line number would net 90 bytes.

> But there are other uses that interests us, like pattern analysis of
> error locations, such that we can detect and perhaps predict errors.
> One idea that was mentioned is if an error occurs frequently enough
> in a specific function, there might be a bug in that function.  This is
> one of the reasons we are pushing to include function:line in the error
> message.

I agree.  But if this is causing too much resistance, we can just
encode some the error information as a file system specific text or
byte string.  Even we are restricted to, say, 96 or 80 bytes, there's
a lot we can do, even if we can't do notification continuations.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-02-02 22:34         ` Theodore Ts'o
@ 2021-02-08 18:49           ` Gabriel Krisman Bertazi
  2021-02-08 22:19             ` Dave Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: Gabriel Krisman Bertazi @ 2021-02-08 18:49 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: jack, viro, amir73il, dhowells, david, darrick.wong, khazhy,
	linux-fsdevel, kernel

"Theodore Ts'o" <tytso@mit.edu> writes:

> On Tue, Feb 02, 2021 at 03:26:35PM -0500, Gabriel Krisman Bertazi wrote:
>> 
>> Thanks for the explanation.  That makes sense to me.  For corruptions
>> where it is impossible to map to a mountpoint, I thought they could be
>> considered global filesystem errors, being exposed only to someone
>> watching the entire filesystem (like FAN_MARK_FILESYSTEM).
>
> At least for ext4, there are only 3 ext4_error_*() that we could map
> to a subtree without having to make changes to the call points:
>
> % grep -i ext4_error_file\( fs/ext4/*.c  | wc -l
> 3
> % grep -i ext4_error_inode\( fs/ext4/*.c  | wc -l
> 79
> % grep -i ext4_error\( fs/ext4/*.c  | wc -l
> 42
>
> So in practice, unless we want to make a lot of changes to ext4, most
> of them will be global file system errors....
>
>> But, as you mentioned regarding the google use case, the entire idea of
>> watching a subtree is a bit beyond the scope of my use-case, and was
>> only added given the feedback on the previous proposal of this feature.
>> While nice to have, I don't have the need to watch different mountpoints
>> for errors, only the entire filesystem.
>
> I suspect that for most use cases, the most interesting thing is the
> first error.  We already record this in the ext4 superblock, because
> unfortunately, I can't guarantee that system administrators have
> correctly configured their system logs, so when handling upstream bug
> reports, I can just ask them to run dumpe2fs -h on the file system:
>
> FS Error count:           2
> First error time:         Tue Feb  2 16:27:42 2021
> First error function:     ext4_lookup
> First error line #:       1704
> First error inode #:      12
> First error err:          EFSCORRUPTED
> Last error time:          Tue Feb  2 16:27:59 2021
> Last error function:      ext4_lookup
> Last error line #:        1704
> Last error inode #:       12
> Last error err:           EFSCORRUPTED
>
> So it's not just the Google case.  I'd argue for most system
> administrator, one of the most useful things is when the file system
> was first found to be corrupted, so they can try correlating file
> system corruptions, with, say, reports of I/O errors, or OOM kils,
> etc.  This can also be useful for correlating the start of file system
> problems with problems at the application layer --- say, MongoDB,
> MySQL, etc.
>
> The reason why a notification system useful is because if you are
> using database some kind of high-availability replication system, and
> if there are problems detected in the file system of the primary MySQL
> server, you'd want to have the system fail over to the secondary MySQL
> server.  Sure, you *could* do this by polling the superblock, but
> that's not the most efficient way to do things.

Hi Ted,

I think this closes a full circle back to my original proposal.  It
doesn't have the complexities of objects other than superblock
notifications, doesn't require allocations.  I sent an RFC for that a
while ago [1] which resulted in this discussion and the current
implementation.

For the sake of a having a proposal and a way to move forward, I'm not
sure what would be the next step here.  I could revive the previous
implementation, addressing some issues like avoiding the superblock
name, the way we refer to blocks and using CAP_SYS_ADMIN.  I think that
implementation solves the usecase you explained with more simplicity.
But I'm not sure Darrick and Dave (all in cc) will be convinced by this
approach of global pipe where we send messages for the entire
filesystem, as Dave described it in the previous implementation.

Are you familiar with that implementation?

[1] https://www.spinics.net/lists/linux-ext4/msg75742.html


-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-02-08 18:49           ` Gabriel Krisman Bertazi
@ 2021-02-08 22:19             ` Dave Chinner
  2021-02-09  1:08               ` Theodore Ts'o
                                 ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Dave Chinner @ 2021-02-08 22:19 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Theodore Ts'o, jack, viro, amir73il, dhowells, darrick.wong,
	khazhy, linux-fsdevel, kernel

On Mon, Feb 08, 2021 at 01:49:41PM -0500, Gabriel Krisman Bertazi wrote:
> "Theodore Ts'o" <tytso@mit.edu> writes:
> 
> > On Tue, Feb 02, 2021 at 03:26:35PM -0500, Gabriel Krisman Bertazi wrote:
> >> 
> >> Thanks for the explanation.  That makes sense to me.  For corruptions
> >> where it is impossible to map to a mountpoint, I thought they could be
> >> considered global filesystem errors, being exposed only to someone
> >> watching the entire filesystem (like FAN_MARK_FILESYSTEM).
> >
> > At least for ext4, there are only 3 ext4_error_*() that we could map
> > to a subtree without having to make changes to the call points:
> >
> > % grep -i ext4_error_file\( fs/ext4/*.c  | wc -l
> > 3
> > % grep -i ext4_error_inode\( fs/ext4/*.c  | wc -l
> > 79
> > % grep -i ext4_error\( fs/ext4/*.c  | wc -l
> > 42
> >
> > So in practice, unless we want to make a lot of changes to ext4, most
> > of them will be global file system errors....
> >
> >> But, as you mentioned regarding the google use case, the entire idea of
> >> watching a subtree is a bit beyond the scope of my use-case, and was
> >> only added given the feedback on the previous proposal of this feature.
> >> While nice to have, I don't have the need to watch different mountpoints
> >> for errors, only the entire filesystem.
> >
> > I suspect that for most use cases, the most interesting thing is the
> > first error.  We already record this in the ext4 superblock, because
> > unfortunately, I can't guarantee that system administrators have
> > correctly configured their system logs, so when handling upstream bug
> > reports, I can just ask them to run dumpe2fs -h on the file system:
> >
> > FS Error count:           2
> > First error time:         Tue Feb  2 16:27:42 2021
> > First error function:     ext4_lookup
> > First error line #:       1704
> > First error inode #:      12
> > First error err:          EFSCORRUPTED
> > Last error time:          Tue Feb  2 16:27:59 2021
> > Last error function:      ext4_lookup
> > Last error line #:        1704
> > Last error inode #:       12
> > Last error err:           EFSCORRUPTED
> >
> > So it's not just the Google case.  I'd argue for most system
> > administrator, one of the most useful things is when the file system
> > was first found to be corrupted, so they can try correlating file
> > system corruptions, with, say, reports of I/O errors, or OOM kils,
> > etc.  This can also be useful for correlating the start of file system
> > problems with problems at the application layer --- say, MongoDB,
> > MySQL, etc.
> >
> > The reason why a notification system useful is because if you are
> > using database some kind of high-availability replication system, and
> > if there are problems detected in the file system of the primary MySQL
> > server, you'd want to have the system fail over to the secondary MySQL
> > server.  Sure, you *could* do this by polling the superblock, but
> > that's not the most efficient way to do things.
> 
> Hi Ted,
> 
> I think this closes a full circle back to my original proposal.  It
> doesn't have the complexities of objects other than superblock
> notifications, doesn't require allocations.  I sent an RFC for that a
> while ago [1] which resulted in this discussion and the current
> implementation.

Yup, we're back to "Design for Google/ext4 requirements only", and
ignoring that other filesystems and users also have non-trivial
requirements for userspace error notifications.

> For the sake of a having a proposal and a way to move forward, I'm not
> sure what would be the next step here.  I could revive the previous
> implementation, addressing some issues like avoiding the superblock
> name, the way we refer to blocks and using CAP_SYS_ADMIN.  I think that
> implementation solves the usecase you explained with more simplicity.
> But I'm not sure Darrick and Dave (all in cc) will be convinced by this
> approach of global pipe where we send messages for the entire
> filesystem, as Dave described it in the previous implementation.

Nope, not convinced at all. As a generic interface, it cannot be
designed explicitly for the needs of a single filesystem, especially
when there are other filesystems needing to implement similar
functionality.

As Amir pointed up earlier in the thread, XFS already already has
extensive per-object verification and error reporting facilicities
that we would like to hook up to such a generic error reporting
mechanism. These use generic mechanisms within XFS, and we've
largely standardised the code interally to implement this (e.g. the
xfs_failaddr as a mechanism of efficiently encoding the exact check
that failed out of the hundreds of verification checks we do).

If we've already got largely standardised, efficient mechanisms for
doing all of this in a filesystem, then why would we want to throw
that all away when implementing a generic userspace notification
channel? We know exactly what we need to present with userspace, so
even if other filesystems don't need exactly the same detail of
information, they still need to supply a subset of that same
information to userspace.

The ext4-based proposals keep mentioning dumping text strings and
encoded structures that are ext4 error specific, instead of starting
from a generic notification structure that defines the object in the
filesystem and location within the object that the notification is
for. e.g the {bdev, object, offset, length} object ID tuple I
mention here:

https://lore.kernel.org/linux-ext4/20201210220914.GG4170059@dread.disaster.area/

For XFS, we want to be able to hook up the verifier error reports
to a notification. We want to be able to hook all our corruption
reports to a notification. We want to be able to hook all our
writeback errors to a notification. We want to be able to hook all
our ENOSPC and EDQUOT errors to a notification. And that's just the
obvious stuff that notifications are useful for.

If you want an idea of all the different types of metadata objects
we need to have different notifications for, look at the GETFSMAP
ioctl man page. It lists all the different types of objects we are
likely to emit notifications for from XFS (e.g. free space
btree corruption at record index X to Y) because, well, that's the
sort of information we're already dumping to the kernel log....

Hence from a design perspective, we need to separate the contents of
the notification from the mechanism used to configure, filter and
emit notifications to userspace.  That is, it doesn't matter if we
add a magic new syscall or use fanotify to configure watches and
transfer messages to userspace, the contents of the message is going
to be the exactly the same, and the API that the filesystem
implementations are going to call to emit a notification to
userspace is exactly the same.

So a generic message structure looks something like this:

<notification type>		(msg structure type)
<notification location>		(encoded file/line info)
<object type>			(inode, dir, btree, bmap, block, etc)
<object ID>			{bdev, object}
<range>				{offset, length} (range in object)
<notification version>		(notification data version)
<notification data>		(filesystem specific data)

The first six fields are generic and always needed and defined by
the kernel/user ABI (i.e. fixed forever in time). The notification
data is the custom message information from the filesystem, defined
by the filesystem, and is not covered by kernel ABI requirements
(similar to current dmesg output). That message could be a string,
an encoded structure, etc, but it's opaque to the notification layer
and just gets dumped to userspace in the notification. Generic tools
can parse the generic fields to give basic notification information,
debug/diagnostic tools can turn that fs specific information into
something useful for admins and support engineers.

IOWs, there is nothing that needs to be ext4 or XFS specific in the
notification infrastructure, just enough generic information for
generic tools to be useful, and a custom field for filesystem
specific diagnostic information to be included in the notification.

At this point, we just don't care about where in the filesystem the
notification is generated - the notification infrastructure assigns
the errors according to the scope the filesystem maps the object
type to. e.g. if fanotify is the userspace ABI, then global metadata
corruptions go to FA_MARK_FILESYSTEM watchers but not FA_MARK_MOUNT.
The individual operations that get an -EFSCORRUPTED error emits a
notification to FA_MARK_INODE watchers on that inode. And so on.

If a new syscall is added, then it also needs to be able to scope
error notifications like this because we would really like to have
per-directory and per-inode notifications supported if at all
possible. But that is a separate discussion to the message contents
and API filesystems will use to create notifications on demand...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-02-08 22:19             ` Dave Chinner
@ 2021-02-09  1:08               ` Theodore Ts'o
  2021-02-09  5:12                 ` Khazhismel Kumykov
  2021-02-09  8:55                 ` Dave Chinner
  2021-02-09 17:35               ` Jan Kara
  2021-02-10  0:49               ` Darrick J. Wong
  2 siblings, 2 replies; 28+ messages in thread
From: Theodore Ts'o @ 2021-02-09  1:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Gabriel Krisman Bertazi, jack, viro, amir73il, dhowells,
	darrick.wong, khazhy, linux-fsdevel, kernel

On Tue, Feb 09, 2021 at 09:19:16AM +1100, Dave Chinner wrote:
> Nope, not convinced at all. As a generic interface, it cannot be
> designed explicitly for the needs of a single filesystem, especially
> when there are other filesystems needing to implement similar
> functionality.
>
> As Amir pointed up earlier in the thread, XFS already already has
> extensive per-object verification and error reporting facilicities...

Sure, but asking Collabora to design something which works for XFS and
not for ext4 is also not fair.

If we can't design something that makes XFS happy, maybe it will be
better to design something specific for ext4.  Alternatively, perhaps
the only thing that can be made generic is to avoid scope creep, and
just design something which allows telling userspace "something is
wrong with the file system", and leaving it at that.

But asking Collabora to design something for XFS, but doesn't work for
ext4, is an absolute non-starter.

> If we've already got largely standardised, efficient mechanisms for
> doing all of this in a filesystem, then why would we want to throw
> that all away when implementing a generic userspace notification
> channel?

You don't.  And if you want to implement something that works
perfectly for xfs, but doesn't work for ext4, feel free.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-02-09  1:08               ` Theodore Ts'o
@ 2021-02-09  5:12                 ` Khazhismel Kumykov
  2021-02-09  8:55                 ` Dave Chinner
  1 sibling, 0 replies; 28+ messages in thread
From: Khazhismel Kumykov @ 2021-02-09  5:12 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, Gabriel Krisman Bertazi, jack, Al Viro, amir73il,
	dhowells, darrick.wong, linux-fsdevel, kernel

[-- Attachment #1: Type: text/plain, Size: 2920 bytes --]

On Mon, Feb 8, 2021 at 5:08 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Tue, Feb 09, 2021 at 09:19:16AM +1100, Dave Chinner wrote:
> > Nope, not convinced at all. As a generic interface, it cannot be
> > designed explicitly for the needs of a single filesystem, especially
> > when there are other filesystems needing to implement similar
> > functionality.
> >
> > As Amir pointed up earlier in the thread, XFS already already has
> > extensive per-object verification and error reporting facilicities...
>
> Sure, but asking Collabora to design something which works for XFS and
> not for ext4 is also not fair.
>
> If we can't design something that makes XFS happy, maybe it will be
> better to design something specific for ext4.  Alternatively, perhaps
> the only thing that can be made generic is to avoid scope creep, and
> just design something which allows telling userspace "something is
> wrong with the file system", and leaving it at that.

It sounds like the two asks are pretty compatible, and we are
interested I think in some sort of generic reporting infra, rather
than re-inventing it separately for ext4, xfs, etc. (e.g., we've found
ENOSPC useful to get notifications for in tmpfs, so on...)

ext4 mostly needs filesystem-wide notification, passing
variable-length data, without sleeping, allocs, or unsafe locks? XFS
additionally wants per-mount and per-inode scopes? So, something that
notifies per-fs, and leaves open the possibility for mount & inode
scopes for those filesystems that desire it, w/ a generic "message"
format seems like the way to go? watch_queue or similar seems nice due
to not allocating. The seeming 128 byte limit per message though seems
too short if we want to be able to send strings or lots of metadata,
an alternative format with larger maximums seems necessary. (IMO this
is preferable to chaining 128 bytes notifications w/ 48 byte headers
each)

What to include in the "generic" header then becomes a hot topic... To
my naive eyes the suggested 6 fields don't seem outrageous, but an
alternative though could be just, it's all filesystem specific info,
leaving the only generic fields to be message type/version/length.
Since, regardless of the data you send, you can use the same generic
interface for hooking and preparing/sending the message. A fully
featured monitoring system would want to peek into per-fs data anyhow,
I'd think.

>
> But asking Collabora to design something for XFS, but doesn't work for
> ext4, is an absolute non-starter.
>
> > If we've already got largely standardised, efficient mechanisms for
> > doing all of this in a filesystem, then why would we want to throw
> > that all away when implementing a generic userspace notification
> > channel?
>
> You don't.  And if you want to implement something that works
> perfectly for xfs, but doesn't work for ext4, feel free.
>
> Cheers,
>
>                                         - Ted

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3996 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-02-09  1:08               ` Theodore Ts'o
  2021-02-09  5:12                 ` Khazhismel Kumykov
@ 2021-02-09  8:55                 ` Dave Chinner
  2021-02-09 17:57                   ` Theodore Ts'o
  1 sibling, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2021-02-09  8:55 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Gabriel Krisman Bertazi, jack, viro, amir73il, dhowells,
	darrick.wong, khazhy, linux-fsdevel, kernel

On Mon, Feb 08, 2021 at 08:08:33PM -0500, Theodore Ts'o wrote:
> On Tue, Feb 09, 2021 at 09:19:16AM +1100, Dave Chinner wrote:
> > Nope, not convinced at all. As a generic interface, it cannot be
> > designed explicitly for the needs of a single filesystem, especially
> > when there are other filesystems needing to implement similar
> > functionality.
> >
> > As Amir pointed up earlier in the thread, XFS already already has
> > extensive per-object verification and error reporting facilicities...
> 
> Sure, but asking Collabora to design something which works for XFS and
> not for ext4 is also not fair.

<blink>

No, Ted, I did no such thing.

Do you really think that I have so little respect for Gabriel, Amir
or Jan or anyone else on -fsdevel that I would waste their precious
time by behaving like a toxic, selfish, narcissistic asshole who
cares only about a single filesystem to the exclusion of everything
else?

Maybe you haven't seen the big picture problem yet? That is,
multiple actors have spoken of their need for a generic fs
notification functionality and their requirements are not wholly
centered around a single filesystem. The application stacks that
need these notifications don't care what filesystem is being used to
store the data; they just want to know when certain things happen to
whatever filesystem their data is in and they want it in a
single, common, well defined format.

They most definitely do not want to have a N different ENOSPC
message formats that they have to understand in userspace. They want
one format that covers ENOSPC, shutdown, emergency remount-ro, inode
corruption, data corruption, bad blocks, writeback failures, etc
across all filesystems. This cannot be acheived by re-implmenting
the notification wheel repeatedly in every filesystem we need to
provide notifications from.

Hence we need a notification subsystem that uses common messages
for common events across ext4, ceph, btrfs, gfs2, NFS and other
filesystems. There is nothing in what I described that is XFS
specific, and quite frankly I don't give a shit about XFS here - I
just used it as an example to derive a generic message format that
covers a large number of the requirements we already have for
information the notification subsystem needs to provide.

So I'll make this very clear, because it is fundamental,
non-negotiable requirement:

*WE* *DO* *NOT* *WANT* *COMPETING* *FILESYSTEM* *NOTIFICATION*
*SUBSYSTEMS* *IN* *THE* *KERNEL*.

That means we have to work together to find common ground and a
solution that works for everyone.  What I've suggested allows all
filesystems to supply the same information for the same events.  It
also allows filesystems to include their own private diagnostic
information appended to the generic message, thereby fulfulling both
the goals of both David Howells' original patchset and Gabriel's
derived ext4 specific patchset.

It works for everyone - it's a win-win scenario - and it lays the
foundation for further common notifications to be created that are
useful to userspace, as well as supporting customised filesystem
specific notifications that largely makes it future proof.

The design and message formats can be refined simply by
collaborating to ensure that everyone's requirements are stated and
met.  If you have technical comments on this proposal, then I'm all
ears.

You know what Google's requirements for notifications are, so how
about you go back to my email and respond with to whether the
message format contains enough information for your employer's
needs.  This way we can improve the structure and ensure that the
resulting message format and infrastructure design can do what
everyone needs.

I should not have to remind you of your responsibilities, Ted.
Please try harder to understand what other people say and be
truthful, respectful and constructive in future discussions.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-02-08 22:19             ` Dave Chinner
  2021-02-09  1:08               ` Theodore Ts'o
@ 2021-02-09 17:35               ` Jan Kara
  2021-02-10  0:22                 ` Darrick J. Wong
  2021-02-10  7:46                 ` Dave Chinner
  2021-02-10  0:49               ` Darrick J. Wong
  2 siblings, 2 replies; 28+ messages in thread
From: Jan Kara @ 2021-02-09 17:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Gabriel Krisman Bertazi, Theodore Ts'o, jack, viro, amir73il,
	dhowells, darrick.wong, khazhy, linux-fsdevel, kernel

On Tue 09-02-21 09:19:16, Dave Chinner wrote:
> On Mon, Feb 08, 2021 at 01:49:41PM -0500, Gabriel Krisman Bertazi wrote:
> > "Theodore Ts'o" <tytso@mit.edu> writes:
> > > On Tue, Feb 02, 2021 at 03:26:35PM -0500, Gabriel Krisman Bertazi wrote:
> > >> 
> > >> Thanks for the explanation.  That makes sense to me.  For corruptions
> > >> where it is impossible to map to a mountpoint, I thought they could be
> > >> considered global filesystem errors, being exposed only to someone
> > >> watching the entire filesystem (like FAN_MARK_FILESYSTEM).
> > >
> > > At least for ext4, there are only 3 ext4_error_*() that we could map
> > > to a subtree without having to make changes to the call points:
> > >
> > > % grep -i ext4_error_file\( fs/ext4/*.c  | wc -l
> > > 3
> > > % grep -i ext4_error_inode\( fs/ext4/*.c  | wc -l
> > > 79
> > > % grep -i ext4_error\( fs/ext4/*.c  | wc -l
> > > 42
> > >
> > > So in practice, unless we want to make a lot of changes to ext4, most
> > > of them will be global file system errors....
> > >
> > >> But, as you mentioned regarding the google use case, the entire idea of
> > >> watching a subtree is a bit beyond the scope of my use-case, and was
> > >> only added given the feedback on the previous proposal of this feature.
> > >> While nice to have, I don't have the need to watch different mountpoints
> > >> for errors, only the entire filesystem.
> > >
> > > I suspect that for most use cases, the most interesting thing is the
> > > first error.  We already record this in the ext4 superblock, because
> > > unfortunately, I can't guarantee that system administrators have
> > > correctly configured their system logs, so when handling upstream bug
> > > reports, I can just ask them to run dumpe2fs -h on the file system:
> > >
> > > FS Error count:           2
> > > First error time:         Tue Feb  2 16:27:42 2021
> > > First error function:     ext4_lookup
> > > First error line #:       1704
> > > First error inode #:      12
> > > First error err:          EFSCORRUPTED
> > > Last error time:          Tue Feb  2 16:27:59 2021
> > > Last error function:      ext4_lookup
> > > Last error line #:        1704
> > > Last error inode #:       12
> > > Last error err:           EFSCORRUPTED
> > >
> > > So it's not just the Google case.  I'd argue for most system
> > > administrator, one of the most useful things is when the file system
> > > was first found to be corrupted, so they can try correlating file
> > > system corruptions, with, say, reports of I/O errors, or OOM kils,
> > > etc.  This can also be useful for correlating the start of file system
> > > problems with problems at the application layer --- say, MongoDB,
> > > MySQL, etc.
> > >
> > > The reason why a notification system useful is because if you are
> > > using database some kind of high-availability replication system, and
> > > if there are problems detected in the file system of the primary MySQL
> > > server, you'd want to have the system fail over to the secondary MySQL
> > > server.  Sure, you *could* do this by polling the superblock, but
> > > that's not the most efficient way to do things.
> > 
> > Hi Ted,
> > 
> > I think this closes a full circle back to my original proposal.  It
> > doesn't have the complexities of objects other than superblock
> > notifications, doesn't require allocations.  I sent an RFC for that a
> > while ago [1] which resulted in this discussion and the current
> > implementation.
> 
> Yup, we're back to "Design for Google/ext4 requirements only", and
> ignoring that other filesystems and users also have non-trivial
> requirements for userspace error notifications.
> 
> > For the sake of a having a proposal and a way to move forward, I'm not
> > sure what would be the next step here.  I could revive the previous
> > implementation, addressing some issues like avoiding the superblock
> > name, the way we refer to blocks and using CAP_SYS_ADMIN.  I think that
> > implementation solves the usecase you explained with more simplicity.
> > But I'm not sure Darrick and Dave (all in cc) will be convinced by this
> > approach of global pipe where we send messages for the entire
> > filesystem, as Dave described it in the previous implementation.
> 
> Nope, not convinced at all. As a generic interface, it cannot be
> designed explicitly for the needs of a single filesystem, especially
> when there are other filesystems needing to implement similar
> functionality.
> 
> As Amir pointed up earlier in the thread, XFS already already has
> extensive per-object verification and error reporting facilicities
> that we would like to hook up to such a generic error reporting
> mechanism. These use generic mechanisms within XFS, and we've
> largely standardised the code interally to implement this (e.g. the
> xfs_failaddr as a mechanism of efficiently encoding the exact check
> that failed out of the hundreds of verification checks we do).
> 
> If we've already got largely standardised, efficient mechanisms for
> doing all of this in a filesystem, then why would we want to throw
> that all away when implementing a generic userspace notification
> channel? We know exactly what we need to present with userspace, so
> even if other filesystems don't need exactly the same detail of
> information, they still need to supply a subset of that same
> information to userspace.
> 
> The ext4-based proposals keep mentioning dumping text strings and
> encoded structures that are ext4 error specific, instead of starting
> from a generic notification structure that defines the object in the
> filesystem and location within the object that the notification is
> for. e.g the {bdev, object, offset, length} object ID tuple I
> mention here:
> 
> https://lore.kernel.org/linux-ext4/20201210220914.GG4170059@dread.disaster.area/
> 
> For XFS, we want to be able to hook up the verifier error reports
> to a notification. We want to be able to hook all our corruption
> reports to a notification. We want to be able to hook all our
> writeback errors to a notification. We want to be able to hook all
> our ENOSPC and EDQUOT errors to a notification. And that's just the
> obvious stuff that notifications are useful for.

I agree with you here but I'd like to get the usecases spelled out to be
able to better evaluate the information we need to pass. I can imagine for
ENOSPC errors this can be stuff like thin provisioning sending red alert to
sysadmin - this would be fs-wide event. I have somewhat hard time coming up
with a case where notification of ENOSPC / EDQUOT for a particular file /
dir would be useful.

I can see a usecase where an application wishes to monitor all its files /
dirs for any type for error fatal error (ENOSPC, EDQUOT, EIO). Here scoping
makes a lot of sense from application POV. It may be somewhat tricky to
reliably provide the notification though. If we say spot inconsistency in
block allocation structure during page writeback (be it btree in XFS case
or bitmap in ext4 case), we report the error there in the code for that
structure but that is not necessarily aware of the inode so we need to make
sure to generate another notification in upper layers where we can associate
the error with the inode as well. Even worse if we spot some error e.g. during
journal commit, we (at least in ext4 case) don't have enough information to
trace back affected inodes anymore. So how do we handle such cases? Do we
actively queue error notifications for all inodes? Or do we lazily wait for
some operation associated with a particular inode to fail to queue
notification? I can see pros and cons for both...

What usecases you had in mind?

> If you want an idea of all the different types of metadata objects
> we need to have different notifications for, look at the GETFSMAP
> ioctl man page. It lists all the different types of objects we are
> likely to emit notifications for from XFS (e.g. free space
> btree corruption at record index X to Y) because, well, that's the
> sort of information we're already dumping to the kernel log....
> 
> Hence from a design perspective, we need to separate the contents of
> the notification from the mechanism used to configure, filter and
> emit notifications to userspace.  That is, it doesn't matter if we
> add a magic new syscall or use fanotify to configure watches and
> transfer messages to userspace, the contents of the message is going
> to be the exactly the same, and the API that the filesystem
> implementations are going to call to emit a notification to
> userspace is exactly the same.
> 
> So a generic message structure looks something like this:
> 
> <notification type>		(msg structure type)
> <notification location>		(encoded file/line info)
> <object type>			(inode, dir, btree, bmap, block, etc)
> <object ID>			{bdev, object}
> <range>				{offset, length} (range in object)
> <notification version>		(notification data version)
> <notification data>		(filesystem specific data)

There's a caveat though that 'object type' is necessarily filesystem
specific and with new filesystem wanting to support this we'll likely need
to add more object types. So it is questionable how "generic error parser"
would be able to use this type of information and whether this doesn't need
to be in the fs-specific blob.

Also versioning fs specific blobs with 'notification version' tends to get
somewhat cumbersome if you need to update the scheme, thus bump the
version, which breaks all existing parsers (and I won't even speak about
the percentage of parses that won't bother with checking the version and
just blindly try to parse whatever they get assuming incorrect things ;).
We've been there more than once... But this is more of a side remark - once
other problems are settled I believe we can come up with reasonably
extensible scheme for blob passing pretty easily.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-02-09  8:55                 ` Dave Chinner
@ 2021-02-09 17:57                   ` Theodore Ts'o
  2021-02-10  0:52                     ` Darrick J. Wong
  0 siblings, 1 reply; 28+ messages in thread
From: Theodore Ts'o @ 2021-02-09 17:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Gabriel Krisman Bertazi, jack, viro, amir73il, dhowells,
	darrick.wong, khazhy, linux-fsdevel, kernel

On Tue, Feb 09, 2021 at 07:55:01PM +1100, Dave Chinner wrote:
> 
> That means we have to work together to find common ground and a
> solution that works for everyone.  What I've suggested allows all
> filesystems to supply the same information for the same events.  It
> also allows filesystems to include their own private diagnostic
> information appended to the generic message, thereby fulfulling both
> the goals of both David Howells' original patchset and Gabriel's
> derived ext4 specific patchset.

So the simple common ground would be a plain text message, which is
what *I* had suggested.  But I saw you requesting some complex object
based system which XFS has.

I think if we want to keep something that is common, it's going to
have to be kept simple.  Do you not agree?

     	   		    	    - Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-01-20 20:13 [RFC] Filesystem error notifications proposal Gabriel Krisman Bertazi
                   ` (2 preceding siblings ...)
  2021-01-21 22:44 ` Theodore Ts'o
@ 2021-02-10  0:09 ` Darrick J. Wong
  2021-02-10  7:23   ` Amir Goldstein
  2021-02-10 23:29   ` Gabriel Krisman Bertazi
  3 siblings, 2 replies; 28+ messages in thread
From: Darrick J. Wong @ 2021-02-10  0:09 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: tytso, jack, viro, amir73il, dhowells, david, darrick.wong,
	khazhy, linux-fsdevel, kernel

Hi Gabriel,

Sorry that it has taken me nearly three weeks to respond to this; I've
been buried in XFS refactoring for 5.12. :(

I forget if I've ever really laid out the user stories (or whatever)
that I envision for XFS.  I think they're pretty close to what you've
outlined for ext4, downthread discussions notwithstanding. :/

The first usecase is, I think, the common one where a program has an
open file descriptor and wants to know when some kind of error happens
to that open file or its metadata so that it can start application-level
recovery processes.

I think the information exported by struct fanotify_event_error_hdr
corresponds almost exactly to this use case since it only identifies an
error, an inode, and an offset, which are (more or less) general
concepts inside the vfs, and probably could be accomplished without help
from individual filesystems.

I would add a u32 context code so that programs can distinguish between
errors in file data, xattrs, inode metadata, or a general filesystem
error, but otherwise it looks fine to me.  I don't think we need to (or
should) report function names and line numbers here.

(Someone would have to make fanotify work for non-admin processes
though, which if that fails, makes this part less generally useful.)

-----

The second usecase is very filesystem specific and administrative in
nature, and I bet this would be where our paths diverge.  But that's
probably ok, since exposing the details of an event requires a client
that's tightly integrated with the fs implementation details, which
pretty much means a program that we ship in {x,e2,*}fsprogs.

Over the last couple of years, XFS has grown ioctl interfaces for
reporting corruption errors to administrative programs and for xfs_scrub
to initiate checks of metadata structures.  Someday we'll be able to
perform repairs online too.

The metadata corruption reporting interfaces are split into three
categories corresponding to principal fs objects.  In other words, we
can report on the state of file metadata, allocation group metadata, or
full filesystem metadata.  So far, each of these three groups has
sufficiently few types of metadata that we can summarize what's wrong
with a u32 bitmap.

For XFS, the following would suffice for a monitoring daemon that could
become part of xfsprogs:

struct notify_xfs_corruption_error {
	__kernel_fsid_t fsid;

	__u32 group; /* FS, AG, or INODE */
	__u32 sick; /* bitset of XFS_{AG,FSOP,BS}_GEOM_SICK_* */
	union {
		struct {
			__u64 inode;
			__u32 gen;
		};
		__u32 agno;
	};
};

(A real implementation would include a flags field and not have a union
in the middle of it, but this is the bare minimum of what I think I
want for xfs_scrubd having implemented zero of this.)

Upon receipt of this structure, the monitoring daemon can translate
those three fields into calls into the [future] online repair ioctls and
fix the filesystem.  Or it can shut down the fs and kill everything
running on it, I dunno. :)

There would only be one monitoring daemon running for the whole xfs
filesystem, so you could require CAP_SYS_ADMIN and FAN_MARK_MOUNT to
prove that the daemon can get to the actual filesystem root directory.
IOWs, the visibility semantics can be "only the sysadmin and only in the
init namespace" initially.

This structure /could/ include an instruction pointer for more advanced
reporting, but it's not a hard requirement to have such a thing.  As far
as xfs is concerned, something decided the fs was bad, and the only
thing to do now is to recover.  I don't think it matters critically
whether the notices are presented via fanotify or watch_queue.

The tricky problem here is (I think?) how to multiplex different
filesystem types wanting to send corruption reports to userspace.  I
suppose you could define the fs metadata error format as:

	[struct fanotify_event_metadata]
	[optional struct fanotify_event_info_fid]
	[u32 magic code corresponding to statvfs.f_type?]
	[fs-specific blob of data here]

And then you'd use fanotify_event_metadata.event_len to figure out the
length of the fs-specific blob.  That way XFS could export the short
structure I outlined above, and ext4 can emit instruction pointer
addresses or strings or whatever else you and Ted settle on.

If that sounds like "Well you go plumb in the fanotify bits with just
enough information for dispatching and then we'll go write our own xfs
specific thing"... yep. :)

To be clear: I'm advocating for cleaving these two usescases completely
apart, and not mixing them at all like what you defined below, because I
now think these are totally separate use cases.

I don't want to get too far into the implementation details, but FWIW
XFS maintains its health state tracking in the incore data structures,
so it's no problem for us to defer the actual fsnotify calls to a
workqueue if we're in an atomic context.

Ok, now I'll go through this and respond point by point.

On Wed, Jan 20, 2021 at 05:13:15PM -0300, Gabriel Krisman Bertazi wrote:
> 
> My apologies for the long email.
> 
> Please let me know your thoughts.
> 
> 1 Summary
> =========
> 
>   I'm looking for a filesystem-agnostic mechanism to report filesystem
>   errors to a monitoring tool in userspace.  I experimented first with
>   the watch_queue API but decided to move to fanotify for a few reasons.
> 
> 
> 2 Background
> ============
> 
>   I submitted a first set of patches, based on David Howells' original
>   superblock notifications patchset, that I expanded into error
>   reporting and had an example implementation for ext4.  Upstream review
>   has revealed a few design problems:
> 
>   - Including the "function:line" tuple in the notification allows the
>     uniquely identification of the error origin but it also ties the
>     decodification of the error to the source code, i.e. <function:line>
>     is expected to change between releases.
> 
>   - Useful debug data (inode number, block group) have formats specific
>     to the filesystems, and my design wouldn't be expansible to
>     filesystems other than ext4.

Yes, hence proposing one set of generic notifications for most user
programs, and a second interface for fs-specific daemons that we can
ship in ${fs}progs.  My opinions have shifted a bit since the last
posting.

>   - The implementation allowed an error string description (equivalent
>     to what would be thrown in dmesg) that is too short, as it needs to
>     fit in a single notification.
> 
>   - How the user sees the filesystem.  The original patch points to a
>     mountpoint but uses the term superblock.  This is to differentiate
>     from another mechanism in development to watch mounting operations.
> 
>   - Visibility of objects.  A bind mount of a subtree shouldn't receive
>     notifications of objects outside of that bind mount.
> 
> 
> 2.1 Other constraints of the watch_queue API
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
>   watch_queue is a fairly new framework, which has one upstream user in
>   the keyring subsystem.  watch_queue is designed to submit very short
>   (max of 128 bytes) atomic notifications to userspace in a fast manner
>   through a ring buffer.  There is no current mechanism to link
>   notifications that require more than one slot and such mechanism
>   wouldn't be trivial to implement, since buffer overruns could
>   overwrite the beginning/end of a multi part notification.  In

<nod> This second round of iteration in my head showed me that the two
event notification use cases are divergent enough that the tagged
notification scheme that I think I triggered last time isn't necessary
at all.

>   addition, watch_queue requires an out-of-band overflow notification
>   mechanism, which would need to be implemented aside from the system
>   call, in a separate API.

Yikes.

> 2.2 fanotify vs watch_queue
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
>   watch_queue is designed for efficiency and fast submission of a large
>   number of notifications.  It doesn't require memory allocation in the
>   submission path, but with a flexibility cost, such as limited
>   notification size.  fanotify is more flexible, allows for larger
>   notifications and better filtering, but requires allocations on the
>   submission path.
> 
>   On the other hand, fanotify already has support for the visibility
>   semantics we are looking for. fsnotify allows an inode to notify only
>   its subtree, mountpoints, or the entire filesystem, depending on the
>   watcher flags, while an equivalent support would need to be written
>   from scratch for watch_queue.  fanotify also has in-band overflow
>   handling, already implemented.  Finally, since fanotify supports much
>   larger notifications, there is no need to link separate notifications,
>   preventing buffer overruns from erasing parts of a notification chain.
> 
>   fanotify is based on fsnotify, and the infrastructure for the
>   visibility semantics is mostly implemented by fsnotify itself.  It
>   would be possible to make error notifications a new mechanism on top
>   of fsnotify, without modifying fanotify, but that would require
>   exposing a very similar interface to userspace, new system calls, and
>   that doesn't seem justifiable since we can extend fanotify syscalls
>   without ABI breakage.

<nod> AFAICT it sounds like fanotify is a good fit for that first
usecase I outlined.  It'd probably work for both.

> 3 Proposal
> ==========
> 
>   The error notification framework is based on fanotify instead of
>   watch_queue.  It is exposed through a new set of marks FAN_ERROR_*,
>   exposed through the fanotify_mark(2) API.
> 
>   fanotify (fsnotify-based) has the infrastructure in-place to link
>   notifications happening at filesystem objects to mountpoints and to
>   filesystems, and already exposes an interface with well defined
>   semantics of how those are exposed to watchers in different
>   mountpoints or different subtrees.
> 
>   A new message format is exposed, if the user passed
>   FAN_REPORT_DETAILED_ERROR fanotify_init(2) flag.  FAN_ERROR messages
>   don't have FID/DFID records.
> 
>   A FAN_REPORT_DETAILED_ERROR record has the same struct
>   fanotify_event_metadata header, but it is followed by one or more
>   additional information record as follows:
> 
>   struct fanotify_event_error_hdr {
>   	struct fanotify_event_info_header hdr;
>   	__u32 error;
>         __u64 inode;
>         __u64 offset;
>   }
> 
>   error is a VFS generic error number that can notify generic conditions
>   like EFSCORRUPT. If hdr.len is larger than sizeof(struct
>   fanotify_event_error_hdr), this structure is followed by an optional
>   filesystem specific record that further specifies the error,
>   originating object and debug data. This record has a generic header:
> 
>   struct fanotify_event_fs_error_hdr {
>   	struct fanotify_event_error_hdr hdr;
>         __kernel_fsid_t fsid;
>         __u32 fs_error;
>   }
> 
>   fs_error is a filesystem specific error record, potentially more
>   detailed than fanotify_event_error.hdr.error . Each type of filesystem

Er... is fs_error supposed to be a type code that tells the program that
the next byte is the start of a fanotify_event_ext4_inode_error
structure?

>   error has its own record type, that is used to report different
>   information useful for each type of error.  For instance, an ext4
>   lookup error, caused by an invalid inode type read from disk, produces
>   the following record:
> 
>   struct fanotify_event_ext4_inode_error {
>   	struct fanotify_event_fs_error_hdr hdr;
>         __u64 block;
>         __u32 inode_type;
>   }
> 
>   The separation between VFS and filesystem-specific error messages has
>   the benefit of always providing some information that an error has
>   occurred, regardless of filesystem-specific support, while allowing

Ok, so I think we've outlined similar-ish proposals.

>   capable filesystems to expose specific internal data to further
>   document the issue.  This scheme leaves to the filesystem to expose
>   more meaningful information as needed.  For instance, multi-disk
>   filesystems can single out the problematic disk, network filesystems
>   can expose a network error while accessing the server.
> 
> 
> 3.1 Visibility semantics
> ~~~~~~~~~~~~~~~~~~~~~~~~
> 
>   Error reporting follows the same semantics of fanotify events.
>   Therefore, a watcher can request to watch the entire filesystem, a
>   single mountpoint or a subtree.
> 
> 
> 3.2 security implications
> ~~~~~~~~~~~~~~~~~~~~~~~~~
> 
>   fanotify requires CAP_SYS_ADMIN.  My understanding is this requirement
>   suffices for most use cases but, according to fanotify documentation,
>   it could be relaxed without issues for the existing fanotify API.  For
>   instance, watching a subtree could be a allowed for a user who owns
>   that subtree.
> 
> 
> 3.3 error location
> ~~~~~~~~~~~~~~~~~~
> 
>   While exposing the exact line of code that triggered the notification
>   ties that notification to a specific kernel version, it is an
>   important information for those who completely control their
>   environment and kernel builds, such as cloud providers.  Therefore,
>   this proposal includes a mechanism to optionally include in the
>   notification the line of code where the error occurred
> 
>   A watcher who passes the flag FAN_REPORT_ERROR_LOCATION to
>   fanotify_init(2) receives an extra record for FAN_ERROR events:
> 
>   struct fanotify_event_fs_error_location {
>   	struct fanotify_event_info_header hdr;
>         u32 line;
>         char function[];
>   }
> 
>   This record identifies the place where the error occured.  function is
>   a VLA whose size extend to the end of the region delimited by hdr.len.
>   This VLA text-encodes the function name where the error occurred.
> 
> What do you think about his interface?  Would it be acceptable as part
> of fanotify, or should it be a new fsnotify mode?

I would say separate FAN_FILE_ERROR and FAN_FS_ERROR mode bits.

> Regarding semantics, I believe fanotify should solve the visibility
> problem for a subtree watcher not being able to see other branches
> notifications.  Do you think this would suffice?

It sounds like it.  I'll go brave reading the rest of the thread now.

--D

> 
> Thanks,
> 
> -- 
> Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-02-09 17:35               ` Jan Kara
@ 2021-02-10  0:22                 ` Darrick J. Wong
  2021-02-10  7:46                 ` Dave Chinner
  1 sibling, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2021-02-10  0:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Gabriel Krisman Bertazi, Theodore Ts'o, jack,
	viro, amir73il, dhowells, darrick.wong, khazhy, linux-fsdevel,
	kernel

On Tue, Feb 09, 2021 at 06:35:43PM +0100, Jan Kara wrote:
> On Tue 09-02-21 09:19:16, Dave Chinner wrote:
> > On Mon, Feb 08, 2021 at 01:49:41PM -0500, Gabriel Krisman Bertazi wrote:
> > > "Theodore Ts'o" <tytso@mit.edu> writes:
> > > > On Tue, Feb 02, 2021 at 03:26:35PM -0500, Gabriel Krisman Bertazi wrote:
> > > >> 
> > > >> Thanks for the explanation.  That makes sense to me.  For corruptions
> > > >> where it is impossible to map to a mountpoint, I thought they could be
> > > >> considered global filesystem errors, being exposed only to someone
> > > >> watching the entire filesystem (like FAN_MARK_FILESYSTEM).
> > > >
> > > > At least for ext4, there are only 3 ext4_error_*() that we could map
> > > > to a subtree without having to make changes to the call points:
> > > >
> > > > % grep -i ext4_error_file\( fs/ext4/*.c  | wc -l
> > > > 3
> > > > % grep -i ext4_error_inode\( fs/ext4/*.c  | wc -l
> > > > 79
> > > > % grep -i ext4_error\( fs/ext4/*.c  | wc -l
> > > > 42
> > > >
> > > > So in practice, unless we want to make a lot of changes to ext4, most
> > > > of them will be global file system errors....
> > > >
> > > >> But, as you mentioned regarding the google use case, the entire idea of
> > > >> watching a subtree is a bit beyond the scope of my use-case, and was
> > > >> only added given the feedback on the previous proposal of this feature.
> > > >> While nice to have, I don't have the need to watch different mountpoints
> > > >> for errors, only the entire filesystem.
> > > >
> > > > I suspect that for most use cases, the most interesting thing is the
> > > > first error.  We already record this in the ext4 superblock, because
> > > > unfortunately, I can't guarantee that system administrators have
> > > > correctly configured their system logs, so when handling upstream bug
> > > > reports, I can just ask them to run dumpe2fs -h on the file system:
> > > >
> > > > FS Error count:           2
> > > > First error time:         Tue Feb  2 16:27:42 2021
> > > > First error function:     ext4_lookup
> > > > First error line #:       1704
> > > > First error inode #:      12
> > > > First error err:          EFSCORRUPTED
> > > > Last error time:          Tue Feb  2 16:27:59 2021
> > > > Last error function:      ext4_lookup
> > > > Last error line #:        1704
> > > > Last error inode #:       12
> > > > Last error err:           EFSCORRUPTED
> > > >
> > > > So it's not just the Google case.  I'd argue for most system
> > > > administrator, one of the most useful things is when the file system
> > > > was first found to be corrupted, so they can try correlating file
> > > > system corruptions, with, say, reports of I/O errors, or OOM kils,
> > > > etc.  This can also be useful for correlating the start of file system
> > > > problems with problems at the application layer --- say, MongoDB,
> > > > MySQL, etc.
> > > >
> > > > The reason why a notification system useful is because if you are
> > > > using database some kind of high-availability replication system, and
> > > > if there are problems detected in the file system of the primary MySQL
> > > > server, you'd want to have the system fail over to the secondary MySQL
> > > > server.  Sure, you *could* do this by polling the superblock, but
> > > > that's not the most efficient way to do things.
> > > 
> > > Hi Ted,
> > > 
> > > I think this closes a full circle back to my original proposal.  It
> > > doesn't have the complexities of objects other than superblock
> > > notifications, doesn't require allocations.  I sent an RFC for that a
> > > while ago [1] which resulted in this discussion and the current
> > > implementation.
> > 
> > Yup, we're back to "Design for Google/ext4 requirements only", and
> > ignoring that other filesystems and users also have non-trivial
> > requirements for userspace error notifications.
> > 
> > > For the sake of a having a proposal and a way to move forward, I'm not
> > > sure what would be the next step here.  I could revive the previous
> > > implementation, addressing some issues like avoiding the superblock
> > > name, the way we refer to blocks and using CAP_SYS_ADMIN.  I think that
> > > implementation solves the usecase you explained with more simplicity.
> > > But I'm not sure Darrick and Dave (all in cc) will be convinced by this
> > > approach of global pipe where we send messages for the entire
> > > filesystem, as Dave described it in the previous implementation.
> > 
> > Nope, not convinced at all. As a generic interface, it cannot be
> > designed explicitly for the needs of a single filesystem, especially
> > when there are other filesystems needing to implement similar
> > functionality.
> > 
> > As Amir pointed up earlier in the thread, XFS already already has
> > extensive per-object verification and error reporting facilicities
> > that we would like to hook up to such a generic error reporting
> > mechanism. These use generic mechanisms within XFS, and we've
> > largely standardised the code interally to implement this (e.g. the
> > xfs_failaddr as a mechanism of efficiently encoding the exact check
> > that failed out of the hundreds of verification checks we do).
> > 
> > If we've already got largely standardised, efficient mechanisms for
> > doing all of this in a filesystem, then why would we want to throw
> > that all away when implementing a generic userspace notification
> > channel? We know exactly what we need to present with userspace, so
> > even if other filesystems don't need exactly the same detail of
> > information, they still need to supply a subset of that same
> > information to userspace.
> > 
> > The ext4-based proposals keep mentioning dumping text strings and
> > encoded structures that are ext4 error specific, instead of starting
> > from a generic notification structure that defines the object in the
> > filesystem and location within the object that the notification is
> > for. e.g the {bdev, object, offset, length} object ID tuple I
> > mention here:
> > 
> > https://lore.kernel.org/linux-ext4/20201210220914.GG4170059@dread.disaster.area/
> > 
> > For XFS, we want to be able to hook up the verifier error reports
> > to a notification. We want to be able to hook all our corruption
> > reports to a notification. We want to be able to hook all our
> > writeback errors to a notification. We want to be able to hook all
> > our ENOSPC and EDQUOT errors to a notification. And that's just the
> > obvious stuff that notifications are useful for.
> 
> I agree with you here but I'd like to get the usecases spelled out to be
> able to better evaluate the information we need to pass. I can imagine for
> ENOSPC errors this can be stuff like thin provisioning sending red alert to
> sysadmin - this would be fs-wide event. I have somewhat hard time coming up
> with a case where notification of ENOSPC / EDQUOT for a particular file /
> dir would be useful.

How about a container where you've set a project quota and the container
hits that quota?

> I can see a usecase where an application wishes to monitor all its files /
> dirs for any type for error fatal error (ENOSPC, EDQUOT, EIO). Here scoping
> makes a lot of sense from application POV. It may be somewhat tricky to
> reliably provide the notification though. If we say spot inconsistency in
> block allocation structure during page writeback (be it btree in XFS case
> or bitmap in ext4 case), we report the error there in the code for that
> structure but that is not necessarily aware of the inode so we need to make
> sure to generate another notification in upper layers where we can associate
> the error with the inode as well. Even worse if we spot some error e.g. during
> journal commit, we (at least in ext4 case) don't have enough information to
> trace back affected inodes anymore. So how do we handle such cases? Do we
> actively queue error notifications for all inodes? Or do we lazily wait for
> some operation associated with a particular inode to fail to queue
> notification? I can see pros and cons for both...

FWIW all the error notification consumers that I have in mind (based on
internal feedback anyway) stratify into two cases: (1) the app doesn't
care what the error is--upon any error it simply signals the cluster
manager that the node failed and offlines the whole node so that
recovery can begin.  (2) is where offlining the whole node is too
expensive or impactful so they want a detailed error message sent to the
appropriate xfs utility so that it can fix the problem.  The actual
message sent to the app doesn't have to be terribly specific aside from
which file(s) are impacted.

> 
> What usecases you had in mind?
> 
> > If you want an idea of all the different types of metadata objects
> > we need to have different notifications for, look at the GETFSMAP
> > ioctl man page. It lists all the different types of objects we are
> > likely to emit notifications for from XFS (e.g. free space
> > btree corruption at record index X to Y) because, well, that's the
> > sort of information we're already dumping to the kernel log....
> > 
> > Hence from a design perspective, we need to separate the contents of
> > the notification from the mechanism used to configure, filter and
> > emit notifications to userspace.  That is, it doesn't matter if we
> > add a magic new syscall or use fanotify to configure watches and
> > transfer messages to userspace, the contents of the message is going
> > to be the exactly the same, and the API that the filesystem
> > implementations are going to call to emit a notification to
> > userspace is exactly the same.
> > 
> > So a generic message structure looks something like this:
> > 
> > <notification type>		(msg structure type)
> > <notification location>		(encoded file/line info)
> > <object type>			(inode, dir, btree, bmap, block, etc)
> > <object ID>			{bdev, object}
> > <range>				{offset, length} (range in object)
> > <notification version>		(notification data version)
> > <notification data>		(filesystem specific data)
> 
> There's a caveat though that 'object type' is necessarily filesystem
> specific and with new filesystem wanting to support this we'll likely need
> to add more object types. So it is questionable how "generic error parser"
> would be able to use this type of information and whether this doesn't need
> to be in the fs-specific blob.

Agreed, so I split the two cases from each other completely in the reply
I sent upstream a little while ago.

> Also versioning fs specific blobs with 'notification version' tends to get
> somewhat cumbersome if you need to update the scheme, thus bump the
> version, which breaks all existing parsers (and I won't even speak about
> the percentage of parses that won't bother with checking the version and
> just blindly try to parse whatever they get assuming incorrect things ;).
> We've been there more than once... But this is more of a side remark - once
> other problems are settled I believe we can come up with reasonably
> extensible scheme for blob passing pretty easily.

A pity we can't just hurl a blob of json with whatever attributes
attached that we like, but maybe I shouldn't go there.  :)

--D

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-02-08 22:19             ` Dave Chinner
  2021-02-09  1:08               ` Theodore Ts'o
  2021-02-09 17:35               ` Jan Kara
@ 2021-02-10  0:49               ` Darrick J. Wong
  2 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2021-02-10  0:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Gabriel Krisman Bertazi, Theodore Ts'o, jack, viro, amir73il,
	dhowells, darrick.wong, khazhy, linux-fsdevel, kernel

(LOL, vger is taking super-long to get these emails to me!)

On Tue, Feb 09, 2021 at 09:19:16AM +1100, Dave Chinner wrote:
> On Mon, Feb 08, 2021 at 01:49:41PM -0500, Gabriel Krisman Bertazi wrote:
> > "Theodore Ts'o" <tytso@mit.edu> writes:
> > 
> > > On Tue, Feb 02, 2021 at 03:26:35PM -0500, Gabriel Krisman Bertazi wrote:
> > >> 
> > >> Thanks for the explanation.  That makes sense to me.  For corruptions
> > >> where it is impossible to map to a mountpoint, I thought they could be
> > >> considered global filesystem errors, being exposed only to someone
> > >> watching the entire filesystem (like FAN_MARK_FILESYSTEM).
> > >
> > > At least for ext4, there are only 3 ext4_error_*() that we could map
> > > to a subtree without having to make changes to the call points:
> > >
> > > % grep -i ext4_error_file\( fs/ext4/*.c  | wc -l
> > > 3
> > > % grep -i ext4_error_inode\( fs/ext4/*.c  | wc -l
> > > 79
> > > % grep -i ext4_error\( fs/ext4/*.c  | wc -l
> > > 42
> > >
> > > So in practice, unless we want to make a lot of changes to ext4, most
> > > of them will be global file system errors....
> > >
> > >> But, as you mentioned regarding the google use case, the entire idea of
> > >> watching a subtree is a bit beyond the scope of my use-case, and was
> > >> only added given the feedback on the previous proposal of this feature.
> > >> While nice to have, I don't have the need to watch different mountpoints
> > >> for errors, only the entire filesystem.
> > >
> > > I suspect that for most use cases, the most interesting thing is the
> > > first error.  We already record this in the ext4 superblock, because
> > > unfortunately, I can't guarantee that system administrators have
> > > correctly configured their system logs, so when handling upstream bug
> > > reports, I can just ask them to run dumpe2fs -h on the file system:
> > >
> > > FS Error count:           2
> > > First error time:         Tue Feb  2 16:27:42 2021
> > > First error function:     ext4_lookup
> > > First error line #:       1704
> > > First error inode #:      12
> > > First error err:          EFSCORRUPTED
> > > Last error time:          Tue Feb  2 16:27:59 2021
> > > Last error function:      ext4_lookup
> > > Last error line #:        1704
> > > Last error inode #:       12
> > > Last error err:           EFSCORRUPTED
> > >
> > > So it's not just the Google case.  I'd argue for most system
> > > administrator, one of the most useful things is when the file system
> > > was first found to be corrupted, so they can try correlating file
> > > system corruptions, with, say, reports of I/O errors, or OOM kils,
> > > etc.  This can also be useful for correlating the start of file system
> > > problems with problems at the application layer --- say, MongoDB,
> > > MySQL, etc.
> > >
> > > The reason why a notification system useful is because if you are
> > > using database some kind of high-availability replication system, and
> > > if there are problems detected in the file system of the primary MySQL
> > > server, you'd want to have the system fail over to the secondary MySQL
> > > server.  Sure, you *could* do this by polling the superblock, but
> > > that's not the most efficient way to do things.
> > 
> > Hi Ted,
> > 
> > I think this closes a full circle back to my original proposal.  It
> > doesn't have the complexities of objects other than superblock
> > notifications, doesn't require allocations.  I sent an RFC for that a
> > while ago [1] which resulted in this discussion and the current
> > implementation.
> 
> Yup, we're back to "Design for Google/ext4 requirements only", and
> ignoring that other filesystems and users also have non-trivial
> requirements for userspace error notifications.
> 
> > For the sake of a having a proposal and a way to move forward, I'm not
> > sure what would be the next step here.  I could revive the previous
> > implementation, addressing some issues like avoiding the superblock
> > name, the way we refer to blocks and using CAP_SYS_ADMIN.  I think that
> > implementation solves the usecase you explained with more simplicity.
> > But I'm not sure Darrick and Dave (all in cc) will be convinced by this
> > approach of global pipe where we send messages for the entire
> > filesystem, as Dave described it in the previous implementation.

(Well, yeah. :))

> Nope, not convinced at all. As a generic interface, it cannot be
> designed explicitly for the needs of a single filesystem, especially
> when there are other filesystems needing to implement similar
> functionality.
> 
> As Amir pointed up earlier in the thread, XFS already already has
> extensive per-object verification and error reporting facilicities
> that we would like to hook up to such a generic error reporting
> mechanism. These use generic mechanisms within XFS, and we've
> largely standardised the code interally to implement this (e.g. the
> xfs_failaddr as a mechanism of efficiently encoding the exact check
> that failed out of the hundreds of verification checks we do).

<nod>

> If we've already got largely standardised, efficient mechanisms for
> doing all of this in a filesystem, then why would we want to throw
> that all away when implementing a generic userspace notification
> channel? We know exactly what we need to present with userspace, so
> even if other filesystems don't need exactly the same detail of
> information, they still need to supply a subset of that same
> information to userspace.
> 
> The ext4-based proposals keep mentioning dumping text strings and
> encoded structures that are ext4 error specific, instead of starting
> from a generic notification structure that defines the object in the
> filesystem and location within the object that the notification is
> for. e.g the {bdev, object, offset, length} object ID tuple I
> mention here:

TBH I don't find the text strings and whatnot all that useful for
automated recovery.  Freeform text is easy enough to grok from
/var/log/kernel, but as for a program I'd rather have an event structure
that more or less prompts me for what to do. :)

> https://lore.kernel.org/linux-ext4/20201210220914.GG4170059@dread.disaster.area/
> 
> For XFS, we want to be able to hook up the verifier error reports
> to a notification. We want to be able to hook all our corruption
> reports to a notification. We want to be able to hook all our
> writeback errors to a notification. We want to be able to hook all
> our ENOSPC and EDQUOT errors to a notification. And that's just the
> obvious stuff that notifications are useful for.
> 
> If you want an idea of all the different types of metadata objects
> we need to have different notifications for, look at the GETFSMAP
> ioctl man page. It lists all the different types of objects we are
> likely to emit notifications for from XFS (e.g. free space
> btree corruption at record index X to Y) because, well, that's the
> sort of information we're already dumping to the kernel log....
>
> Hence from a design perspective, we need to separate the contents of
> the notification from the mechanism used to configure, filter and
> emit notifications to userspace.  That is, it doesn't matter if we
> add a magic new syscall or use fanotify to configure watches and
> transfer messages to userspace, the contents of the message is going
> to be the exactly the same, and the API that the filesystem

(Yep)

> implementations are going to call to emit a notification to
> userspace is exactly the same.
> 
> So a generic message structure looks something like this:
> 
> <notification type>		(msg structure type)
> <notification location>		(encoded file/line info)
> <object type>			(inode, dir, btree, bmap, block, etc)
> <object ID>			{bdev, object}
> <range>				{offset, length} (range in object)
> <notification version>		(notification data version)
> <notification data>		(filesystem specific data)

/me notes that the first six fields are specific enough that a xfs_scrub
daemon would be able to figure out what repair calls to make, so you
could very well go with this structure instead of the things I rambled
about elsewhere in this thread.

Though I do kinda wonder about things like btrfs where you can have
raid1 metadata; would you want to be able to know that the chunk tree on
/dev/sda1 is bad?  Or is it enough to know that a chunk tree is bad
since the remedy (at least in terms of poking the kernel) is the same?

(I don't know since I don't grok btrfs)

> The first six fields are generic and always needed and defined by
> the kernel/user ABI (i.e. fixed forever in time). The notification
> data is the custom message information from the filesystem, defined
> by the filesystem, and is not covered by kernel ABI requirements
> (similar to current dmesg output). That message could be a string,
> an encoded structure, etc, but it's opaque to the notification layer
> and just gets dumped to userspace in the notification. Generic tools
> can parse the generic fields to give basic notification information,
> debug/diagnostic tools can turn that fs specific information into
> something useful for admins and support engineers.
> 
> IOWs, there is nothing that needs to be ext4 or XFS specific in the
> notification infrastructure, just enough generic information for
> generic tools to be useful, and a custom field for filesystem
> specific diagnostic information to be included in the notification.

I wonder ens 

> At this point, we just don't care about where in the filesystem the
> notification is generated - the notification infrastructure assigns
> the errors according to the scope the filesystem maps the object
> type to. e.g. if fanotify is the userspace ABI, then global metadata
> corruptions go to FA_MARK_FILESYSTEM watchers but not FA_MARK_MOUNT.
> The individual operations that get an -EFSCORRUPTED error emits a
> notification to FA_MARK_INODE watchers on that inode. And so on.

<nod>

> If a new syscall is added, then it also needs to be able to scope
> error notifications like this because we would really like to have
> per-directory and per-inode notifications supported if at all
> possible. But that is a separate discussion to the message contents
> and API filesystems will use to create notifications on demand...

Yeah.

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-02-09 17:57                   ` Theodore Ts'o
@ 2021-02-10  0:52                     ` Darrick J. Wong
  2021-02-10  2:21                       ` Theodore Ts'o
  0 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2021-02-10  0:52 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, Gabriel Krisman Bertazi, jack, viro, amir73il,
	dhowells, darrick.wong, khazhy, linux-fsdevel, kernel

On Tue, Feb 09, 2021 at 12:57:08PM -0500, Theodore Ts'o wrote:
> On Tue, Feb 09, 2021 at 07:55:01PM +1100, Dave Chinner wrote:
> > 
> > That means we have to work together to find common ground and a
> > solution that works for everyone.  What I've suggested allows all
> > filesystems to supply the same information for the same events.  It
> > also allows filesystems to include their own private diagnostic
> > information appended to the generic message, thereby fulfulling both
> > the goals of both David Howells' original patchset and Gabriel's
> > derived ext4 specific patchset.
> 
> So the simple common ground would be a plain text message, which is
> what *I* had suggested.  But I saw you requesting some complex object
> based system which XFS has.
> 
> I think if we want to keep something that is common, it's going to
> have to be kept simple.  Do you not agree?

I definitely don't want to implement string parsing for xfs_scrub.

The kernel already has enough information to fill out the struct
xfs_scrub_metadata structure for userspace in case it decides to repair.

(HA, maybe that should be the notification format for xfs metadata :P)

--D

>      	   		    	    - Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-02-10  0:52                     ` Darrick J. Wong
@ 2021-02-10  2:21                       ` Theodore Ts'o
  2021-02-10  2:32                         ` Darrick J. Wong
  0 siblings, 1 reply; 28+ messages in thread
From: Theodore Ts'o @ 2021-02-10  2:21 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, Gabriel Krisman Bertazi, jack, viro, amir73il,
	dhowells, darrick.wong, khazhy, linux-fsdevel, kernel

On Tue, Feb 09, 2021 at 04:52:07PM -0800, Darrick J. Wong wrote:
> 
> I definitely don't want to implement string parsing for xfs_scrub.
> 
> The kernel already has enough information to fill out the struct
> xfs_scrub_metadata structure for userspace in case it decides to repair.

I suspect anything which is specific enough for xfs_scrub is going to
be *significantly* file system dependent.  And in the past, attempts
to do some kind of binary tag-length-value scheme has generally gotten
vetoed.  If it's just "there's something wrong with inode number 42",
that's one thing.  But if it's, "position 37 of the 3rd btree interior
node is out of order", how in the world is that supposed to be
generic?

Taking a step back, it seems that there are a number of different use
cases that people have been discussing on this thread.  One is
"there's something wrong with the file system", so the file system
should be taken off-line for repair and/or fail over the primary
server to the secondary/backup server.

Yet another use case might be "the file system is getting full, so
please expand the LVM and resize the file system".  (This is a bit
more complex since different system administrators and different
systems may want different trigger points, from 75%, 80%, 95%, or
"we've already delivered ENOSPC to the application so there are
user-visible failures", but presumably this can be configured, with
perhaps sime fixed number of configured trigger points per file
system.)

Both of these are fairly generic, and don't required exposing a file
system's detailed, complex object hierarchies to userspace.  But if
the goal is pushing out detailed metadata information sufficient for
on-line file system repair, that both seems to be a massive case of
scope creep, and very much file system specific, and not something
that could be easily made generic.

If the XFS community believes it can be done, perhaps you or Dave
could come up with a specific proposal, as opposed to claiming that
Gabriel's design is inadequate because it doesn't meet with XFS's use
case?  And perhaps this is a sufficiently different set of
requirements that it doesn't make sense to try to have a single design
that covers all possible uses of a "file system notification system".

Regards,

					- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-02-10  2:21                       ` Theodore Ts'o
@ 2021-02-10  2:32                         ` Darrick J. Wong
  0 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2021-02-10  2:32 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, Gabriel Krisman Bertazi, jack, viro, amir73il,
	dhowells, darrick.wong, khazhy, linux-fsdevel, kernel

On Tue, Feb 09, 2021 at 09:21:40PM -0500, Theodore Ts'o wrote:
> On Tue, Feb 09, 2021 at 04:52:07PM -0800, Darrick J. Wong wrote:
> > 
> > I definitely don't want to implement string parsing for xfs_scrub.
> > 
> > The kernel already has enough information to fill out the struct
> > xfs_scrub_metadata structure for userspace in case it decides to repair.
> 
> I suspect anything which is specific enough for xfs_scrub is going to
> be *significantly* file system dependent.  And in the past, attempts
> to do some kind of binary tag-length-value scheme has generally gotten
> vetoed.  If it's just "there's something wrong with inode number 42",
> that's one thing.  But if it's, "position 37 of the 3rd btree interior
> node is out of order", how in the world is that supposed to be
> generic?
> 
> Taking a step back, it seems that there are a number of different use
> cases that people have been discussing on this thread.  One is
> "there's something wrong with the file system", so the file system
> should be taken off-line for repair and/or fail over the primary
> server to the secondary/backup server.
> 
> Yet another use case might be "the file system is getting full, so
> please expand the LVM and resize the file system".  (This is a bit
> more complex since different system administrators and different
> systems may want different trigger points, from 75%, 80%, 95%, or
> "we've already delivered ENOSPC to the application so there are
> user-visible failures", but presumably this can be configured, with
> perhaps sime fixed number of configured trigger points per file
> system.)
> 
> Both of these are fairly generic, and don't required exposing a file
> system's detailed, complex object hierarchies to userspace.  But if
> the goal is pushing out detailed metadata information sufficient for
> on-line file system repair, that both seems to be a massive case of
> scope creep, and very much file system specific, and not something
> that could be easily made generic.
> 
> If the XFS community believes it can be done, perhaps you or Dave
> could come up with a specific proposal, as opposed to claiming that
> Gabriel's design is inadequate because it doesn't meet with XFS's use
> case?  And perhaps this is a sufficiently different set of
> requirements that it doesn't make sense to try to have a single design
> that covers all possible uses of a "file system notification system".

I did in [1], but vger is being slow and hasn't yet delivered it to me
yet.  At least they try to hit lore first now, I guess?

--D

[1] https://lore.kernel.org/linux-fsdevel/20210210000932.GH7190@magnolia/T/#m268c5a244cb7bd749b9c029e1d7c3f00d194b181

> Regards,
> 
> 					- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-02-10  0:09 ` Darrick J. Wong
@ 2021-02-10  7:23   ` Amir Goldstein
  2021-02-10 23:29   ` Gabriel Krisman Bertazi
  1 sibling, 0 replies; 28+ messages in thread
From: Amir Goldstein @ 2021-02-10  7:23 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Gabriel Krisman Bertazi, Theodore Tso, Jan Kara, Al Viro,
	David Howells, Dave Chinner, Darrick J. Wong, Khazhismel Kumykov,
	linux-fsdevel, kernel

On Wed, Feb 10, 2021 at 2:09 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> Hi Gabriel,
>
> Sorry that it has taken me nearly three weeks to respond to this; I've
> been buried in XFS refactoring for 5.12. :(
>
> I forget if I've ever really laid out the user stories (or whatever)
> that I envision for XFS.  I think they're pretty close to what you've
> outlined for ext4, downthread discussions notwithstanding. :/
>
> The first usecase is, I think, the common one where a program has an
> open file descriptor and wants to know when some kind of error happens
> to that open file or its metadata so that it can start application-level
> recovery processes.
>
> I think the information exported by struct fanotify_event_error_hdr
> corresponds almost exactly to this use case since it only identifies an
> error, an inode, and an offset, which are (more or less) general
> concepts inside the vfs, and probably could be accomplished without help
> from individual filesystems.
>
> I would add a u32 context code so that programs can distinguish between
> errors in file data, xattrs, inode metadata, or a general filesystem
> error, but otherwise it looks fine to me.  I don't think we need to (or
> should) report function names and line numbers here.
>
> (Someone would have to make fanotify work for non-admin processes
> though, which if that fails, makes this part less generally useful.)
>

This is trivial. Lifting the CAP_SYS_ADMIN limitation is only a matter
of auditing
the impact of an fanotify watch and information exposed.
I'd already posted simple patches to allow a non-admin fanotify watch with some
restriction (e.g. mark on a file/directory, no permission events) [1].
It's even simpler to start with using this method to allow non-admin access to a
new event type (FAN_FILE_ERROR) whose impact and exposed information
are well known.

[1]  https://lore.kernel.org/linux-fsdevel/20210124184204.899729-3-amir73il@gmail.com/

> -----
>
> The second usecase is very filesystem specific and administrative in
> nature, and I bet this would be where our paths diverge.  But that's
> probably ok, since exposing the details of an event requires a client
> that's tightly integrated with the fs implementation details, which
> pretty much means a program that we ship in {x,e2,*}fsprogs.
>
> Over the last couple of years, XFS has grown ioctl interfaces for
> reporting corruption errors to administrative programs and for xfs_scrub
> to initiate checks of metadata structures.  Someday we'll be able to
> perform repairs online too.
>
> The metadata corruption reporting interfaces are split into three
> categories corresponding to principal fs objects.  In other words, we
> can report on the state of file metadata, allocation group metadata, or
> full filesystem metadata.  So far, each of these three groups has
> sufficiently few types of metadata that we can summarize what's wrong
> with a u32 bitmap.
>
> For XFS, the following would suffice for a monitoring daemon that could
> become part of xfsprogs:
>
> struct notify_xfs_corruption_error {
>         __kernel_fsid_t fsid;
>
>         __u32 group; /* FS, AG, or INODE */
>         __u32 sick; /* bitset of XFS_{AG,FSOP,BS}_GEOM_SICK_* */
>         union {
>                 struct {
>                         __u64 inode;
>                         __u32 gen;
>                 };
>                 __u32 agno;
>         };
> };
>
> (A real implementation would include a flags field and not have a union
> in the middle of it, but this is the bare minimum of what I think I
> want for xfs_scrubd having implemented zero of this.)
>
> Upon receipt of this structure, the monitoring daemon can translate
> those three fields into calls into the [future] online repair ioctls and
> fix the filesystem.  Or it can shut down the fs and kill everything
> running on it, I dunno. :)
>
> There would only be one monitoring daemon running for the whole xfs
> filesystem, so you could require CAP_SYS_ADMIN and FAN_MARK_MOUNT to
> prove that the daemon can get to the actual filesystem root directory.
> IOWs, the visibility semantics can be "only the sysadmin and only in the
> init namespace" initially.

That is not what FAN_MARK_MOUNT means. FAN_MARK_MOUNT
subscribes for events that happened only in a specific mount, so
any events such as listed above are not adequate.

As you know, if you have CAP_SYS_ADMIN and access to any any directory
(not only to root), you can pretty much stroll anywhere in the filesystem
using open_by_handle_at(), so the existing fanotify restrictions of
CAP_SYS_ADMIN in init userns are exactly what is needed for the fs
monitor and it should use FAN_MARK_FILESYSTEM.

>
> This structure /could/ include an instruction pointer for more advanced
> reporting, but it's not a hard requirement to have such a thing.  As far
> as xfs is concerned, something decided the fs was bad, and the only
> thing to do now is to recover.  I don't think it matters critically
> whether the notices are presented via fanotify or watch_queue.
>
> The tricky problem here is (I think?) how to multiplex different
> filesystem types wanting to send corruption reports to userspace.  I
> suppose you could define the fs metadata error format as:
>
>         [struct fanotify_event_metadata]
>         [optional struct fanotify_event_info_fid]
>         [u32 magic code corresponding to statvfs.f_type?]

Hmm. I can't say I like this.
If we are talking about a live notification, right?
In that case, fsid should be enough to figure out the filesystem
that currently occupies this fsid.

fanotify_event_info_fid info is basically worthless without
knowing where a mounted filesystem can be found, because
the objects are encoded as file handles and open_by_handle_at()
needs a mount point to resolve those handles to object paths.

>         [fs-specific blob of data here]
>
> And then you'd use fanotify_event_metadata.event_len to figure out the
> length of the fs-specific blob.  That way XFS could export the short
> structure I outlined above, and ext4 can emit instruction pointer
> addresses or strings or whatever else you and Ted settle on.
>
> If that sounds like "Well you go plumb in the fanotify bits with just
> enough information for dispatching and then we'll go write our own xfs
> specific thing"... yep. :)
>
> To be clear: I'm advocating for cleaving these two usescases completely
> apart, and not mixing them at all like what you defined below, because I
> now think these are totally separate use cases.
>

It's worth mentioning that the two use cases also correspond to the two
different writeback error states (file and fs) that can currently be "polled"
with fsync() and syncfs().

> I don't want to get too far into the implementation details, but FWIW
> XFS maintains its health state tracking in the incore data structures,
> so it's no problem for us to defer the actual fsnotify calls to a
> workqueue if we're in an atomic context.
>
> Ok, now I'll go through this and respond point by point.
>
> On Wed, Jan 20, 2021 at 05:13:15PM -0300, Gabriel Krisman Bertazi wrote:
> >
> > My apologies for the long email.
> >
> > Please let me know your thoughts.
> >
> > 1 Summary
> > =========
> >
> >   I'm looking for a filesystem-agnostic mechanism to report filesystem
> >   errors to a monitoring tool in userspace.  I experimented first with
> >   the watch_queue API but decided to move to fanotify for a few reasons.
> >
> >
> > 2 Background
> > ============
> >
> >   I submitted a first set of patches, based on David Howells' original
> >   superblock notifications patchset, that I expanded into error
> >   reporting and had an example implementation for ext4.  Upstream review
> >   has revealed a few design problems:
> >
> >   - Including the "function:line" tuple in the notification allows the
> >     uniquely identification of the error origin but it also ties the
> >     decodification of the error to the source code, i.e. <function:line>
> >     is expected to change between releases.
> >
> >   - Useful debug data (inode number, block group) have formats specific
> >     to the filesystems, and my design wouldn't be expansible to
> >     filesystems other than ext4.
>
> Yes, hence proposing one set of generic notifications for most user
> programs, and a second interface for fs-specific daemons that we can
> ship in ${fs}progs.  My opinions have shifted a bit since the last
> posting.
>
> >   - The implementation allowed an error string description (equivalent
> >     to what would be thrown in dmesg) that is too short, as it needs to
> >     fit in a single notification.
> >
> >   - How the user sees the filesystem.  The original patch points to a
> >     mountpoint but uses the term superblock.  This is to differentiate
> >     from another mechanism in development to watch mounting operations.
> >
> >   - Visibility of objects.  A bind mount of a subtree shouldn't receive
> >     notifications of objects outside of that bind mount.
> >
> >
> > 2.1 Other constraints of the watch_queue API
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >   watch_queue is a fairly new framework, which has one upstream user in
> >   the keyring subsystem.  watch_queue is designed to submit very short
> >   (max of 128 bytes) atomic notifications to userspace in a fast manner
> >   through a ring buffer.  There is no current mechanism to link
> >   notifications that require more than one slot and such mechanism
> >   wouldn't be trivial to implement, since buffer overruns could
> >   overwrite the beginning/end of a multi part notification.  In
>
> <nod> This second round of iteration in my head showed me that the two
> event notification use cases are divergent enough that the tagged
> notification scheme that I think I triggered last time isn't necessary
> at all.
>
> >   addition, watch_queue requires an out-of-band overflow notification
> >   mechanism, which would need to be implemented aside from the system
> >   call, in a separate API.
>
> Yikes.
>
> > 2.2 fanotify vs watch_queue
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >   watch_queue is designed for efficiency and fast submission of a large
> >   number of notifications.  It doesn't require memory allocation in the
> >   submission path, but with a flexibility cost, such as limited
> >   notification size.  fanotify is more flexible, allows for larger
> >   notifications and better filtering, but requires allocations on the
> >   submission path.
> >
> >   On the other hand, fanotify already has support for the visibility
> >   semantics we are looking for. fsnotify allows an inode to notify only
> >   its subtree, mountpoints, or the entire filesystem, depending on the
> >   watcher flags, while an equivalent support would need to be written
> >   from scratch for watch_queue.  fanotify also has in-band overflow
> >   handling, already implemented.  Finally, since fanotify supports much
> >   larger notifications, there is no need to link separate notifications,
> >   preventing buffer overruns from erasing parts of a notification chain.
> >
> >   fanotify is based on fsnotify, and the infrastructure for the
> >   visibility semantics is mostly implemented by fsnotify itself.  It
> >   would be possible to make error notifications a new mechanism on top
> >   of fsnotify, without modifying fanotify, but that would require
> >   exposing a very similar interface to userspace, new system calls, and
> >   that doesn't seem justifiable since we can extend fanotify syscalls
> >   without ABI breakage.
>
> <nod> AFAICT it sounds like fanotify is a good fit for that first
> usecase I outlined.  It'd probably work for both.
>
> > 3 Proposal
> > ==========
> >
> >   The error notification framework is based on fanotify instead of
> >   watch_queue.  It is exposed through a new set of marks FAN_ERROR_*,
> >   exposed through the fanotify_mark(2) API.
> >
> >   fanotify (fsnotify-based) has the infrastructure in-place to link
> >   notifications happening at filesystem objects to mountpoints and to
> >   filesystems, and already exposes an interface with well defined
> >   semantics of how those are exposed to watchers in different
> >   mountpoints or different subtrees.
> >
> >   A new message format is exposed, if the user passed
> >   FAN_REPORT_DETAILED_ERROR fanotify_init(2) flag.  FAN_ERROR messages
> >   don't have FID/DFID records.
> >
> >   A FAN_REPORT_DETAILED_ERROR record has the same struct
> >   fanotify_event_metadata header, but it is followed by one or more
> >   additional information record as follows:
> >
> >   struct fanotify_event_error_hdr {
> >       struct fanotify_event_info_header hdr;
> >       __u32 error;
> >         __u64 inode;
> >         __u64 offset;
> >   }
> >
> >   error is a VFS generic error number that can notify generic conditions
> >   like EFSCORRUPT. If hdr.len is larger than sizeof(struct
> >   fanotify_event_error_hdr), this structure is followed by an optional
> >   filesystem specific record that further specifies the error,
> >   originating object and debug data. This record has a generic header:
> >
> >   struct fanotify_event_fs_error_hdr {
> >       struct fanotify_event_error_hdr hdr;
> >         __kernel_fsid_t fsid;
> >         __u32 fs_error;
> >   }
> >
> >   fs_error is a filesystem specific error record, potentially more
> >   detailed than fanotify_event_error.hdr.error . Each type of filesystem
>
> Er... is fs_error supposed to be a type code that tells the program that
> the next byte is the start of a fanotify_event_ext4_inode_error
> structure?
>
> >   error has its own record type, that is used to report different
> >   information useful for each type of error.  For instance, an ext4
> >   lookup error, caused by an invalid inode type read from disk, produces
> >   the following record:
> >
> >   struct fanotify_event_ext4_inode_error {
> >       struct fanotify_event_fs_error_hdr hdr;
> >         __u64 block;
> >         __u32 inode_type;
> >   }
> >
> >   The separation between VFS and filesystem-specific error messages has
> >   the benefit of always providing some information that an error has
> >   occurred, regardless of filesystem-specific support, while allowing
>
> Ok, so I think we've outlined similar-ish proposals.
>
> >   capable filesystems to expose specific internal data to further
> >   document the issue.  This scheme leaves to the filesystem to expose
> >   more meaningful information as needed.  For instance, multi-disk
> >   filesystems can single out the problematic disk, network filesystems
> >   can expose a network error while accessing the server.
> >
> >
> > 3.1 Visibility semantics
> > ~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >   Error reporting follows the same semantics of fanotify events.
> >   Therefore, a watcher can request to watch the entire filesystem, a
> >   single mountpoint or a subtree.
> >
> >
> > 3.2 security implications
> > ~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >   fanotify requires CAP_SYS_ADMIN.  My understanding is this requirement
> >   suffices for most use cases but, according to fanotify documentation,
> >   it could be relaxed without issues for the existing fanotify API.  For
> >   instance, watching a subtree could be a allowed for a user who owns
> >   that subtree.
> >
> >
> > 3.3 error location
> > ~~~~~~~~~~~~~~~~~~
> >
> >   While exposing the exact line of code that triggered the notification
> >   ties that notification to a specific kernel version, it is an
> >   important information for those who completely control their
> >   environment and kernel builds, such as cloud providers.  Therefore,
> >   this proposal includes a mechanism to optionally include in the
> >   notification the line of code where the error occurred
> >
> >   A watcher who passes the flag FAN_REPORT_ERROR_LOCATION to
> >   fanotify_init(2) receives an extra record for FAN_ERROR events:
> >
> >   struct fanotify_event_fs_error_location {
> >       struct fanotify_event_info_header hdr;
> >         u32 line;
> >         char function[];
> >   }
> >
> >   This record identifies the place where the error occured.  function is
> >   a VLA whose size extend to the end of the region delimited by hdr.len.
> >   This VLA text-encodes the function name where the error occurred.
> >
> > What do you think about his interface?  Would it be acceptable as part
> > of fanotify, or should it be a new fsnotify mode?
>
> I would say separate FAN_FILE_ERROR and FAN_FS_ERROR mode bits.
>

<nod> not only that.
With fanotify users can:
1. Subscribe to FAN_FILE_ERROR for a specific file (non-admin)
2. Subscribe to FAN_FS_ERROR for an entire fs (admin)
3. Subscribe to FAN_FILE_ERROR for an entire fs (admin)
AND
4. Subscribe to FAN_FILE_ERROR for an entire fs (admin),
    except for a specific file using an ignore mark on that file [*]

[*] Currently inode marks ihold() the inode, but I have a patch
     to make this optional which is needed in this case

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-02-09 17:35               ` Jan Kara
  2021-02-10  0:22                 ` Darrick J. Wong
@ 2021-02-10  7:46                 ` Dave Chinner
  1 sibling, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2021-02-10  7:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Gabriel Krisman Bertazi, Theodore Ts'o, jack, viro, amir73il,
	dhowells, darrick.wong, khazhy, linux-fsdevel, kernel

On Tue, Feb 09, 2021 at 06:35:43PM +0100, Jan Kara wrote:
> On Tue 09-02-21 09:19:16, Dave Chinner wrote:
> > On Mon, Feb 08, 2021 at 01:49:41PM -0500, Gabriel Krisman Bertazi wrote:
> > > "Theodore Ts'o" <tytso@mit.edu> writes:
> > For XFS, we want to be able to hook up the verifier error reports
> > to a notification. We want to be able to hook all our corruption
> > reports to a notification. We want to be able to hook all our
> > writeback errors to a notification. We want to be able to hook all
> > our ENOSPC and EDQUOT errors to a notification. And that's just the
> > obvious stuff that notifications are useful for.
> 
> I agree with you here but I'd like to get the usecases spelled out to be
> able to better evaluate the information we need to pass. I can imagine for
> ENOSPC errors this can be stuff like thin provisioning sending red alert to
> sysadmin - this would be fs-wide event. I have somewhat hard time coming up
> with a case where notification of ENOSPC / EDQUOT for a particular file /
> dir would be useful.

An example is containers that the admins configure with project
quotas as directory quotas so that individual containers have their
own independent space accounting and enforcement by the host.  Apps
inside the container to monitor for their own ENOSPC events
(triggered by project quota EDQUOT) instead of the filesystem wide
ENOSPC.

> I can see a usecase where an application wishes to monitor all its files /
> dirs for any type for error fatal error (ENOSPC, EDQUOT, EIO).

*nod*

We also have cluster level management tools wanting to know about
failure events inside data stores that it hads out of containers
and/or guests. That's where things like corruption reports come in -
being able to flag errors at the management interface that something
whent wrong with the filesystem used by container X, with some level
of detail of what actually got damaged (e.g. file X at offset Y for
length Z is bad).

> Here scoping
> makes a lot of sense from application POV. It may be somewhat tricky to
> reliably provide the notification though. If we say spot inconsistency in
> block allocation structure during page writeback (be it btree in XFS case
> or bitmap in ext4 case), we report the error there in the code for that
> structure but that is not necessarily aware of the inode so we need to make
> sure to generate another notification in upper layers where we can associate
> the error with the inode as well.

Yes, that's what we already do in XFS. The initial corruption
detection site generates the corruption warning, and then if higher
layers can't back out because the fs is in an unrecoverable state,
then shutdown and more error messages are generated. There are
multiple levels of warnings/error messages in filesystems, I thought
that was pretty clear to every one so I'm really very surprised that
nobody is thinking that notifications have different scopes, levels
and meanings, just like the message we send to syslog do....

Indeed, once the filesystem is in a global shutdown or error state,
we don't emit further corruption errors, so we wouldn't emit further
error notifications, either.

Essentially, we're not talking about anything new here - this is
already how we use the syslog for corruption and shutdown reporting.
I'm not sure why using a "notification" instead of a "printk()"
seems to make people think this is a unsolvable problem, because we
have already solved it....

> Even worse if we spot some error e.g. during
> journal commit, we (at least in ext4 case) don't have enough information to
> trace back affected inodes anymore.

Failure in the journal is fatal error, and we shut down. That
generates the shutdown notification, and we don't emit anything else
once the shutdown is complete. Further analysis is up to the admin,
not the notification subsystem.

> So how do we handle such cases? Do we
> actively queue error notifications for all inodes? Or do we lazily wait for
> some operation associated with a particular inode to fail to queue
> notification? I can see pros and cons for both...

I'd say that you're vastly over complicating the problem. There is
no point in generating a notification storm from the filesystem once
a fatal error has already been tripped over and the filesystem shut
down.  We don't flood the syslog like this, and we shouldn't flood
the system with unnecessary notifications, either.

This implies that "fatal error" notifications should probably be
broadcast over all "error" watches on that filesystem, regardless of their
scope, because the filesystem is basically saying "everything has
failed". And then no further error notifications are generated,
because everything is already been told "it's broken".

But, really, that's a scoping discussion, not a use case....

> What usecases you had in mind?

Data loss events being reported to userspace so desktop
notifications can be raised. Or management interface notifications
can be raised. Or repair utilities can determine if the problem can
be fixed automatically.

I mean, that's the whole sticking point with DAX+reflink - being
able to reverse map the physical storage to the user data so that
when the storage gets torched by a MCE we can do the right thing.
And part of that "right thing" is notifying the apps and admins that
they data just went up in a cloud of high energy particles...

THen there's stuff that is indicitive of imminent failure.
Notification of transient errors during metadata operations, the
number of retries before success, when we end up with permanently
retrying because the storage is actually toast writes so unmount
will eventually fail, etc.

When there is a filesysetm health status change. Notification that a
filesystem capacity has changed (e.g. grow/shrink). notification
that a filesystem has been frozen. That allocation groups are
running low on space, that we are out of inode space, the reserve
block pool has been depleted, etc.

IOWs, storage management and monitoring is a common case I keep
hearing about. I here more vague requirements from higher level
application stacks (cloudy stuff) that they need stuff like
per-container space management and notifications. But the one thing
that nobody wants to do is scrape and/or parse text messages.

Another class of use case is applications being able to monitor
their files for writeback errors and such notifications containing
the inode, offset and length in them where the failure occurred so
that the actual data loss can be dealt with (e.g. by rewriting the
data) before the application has removed it from it's write buffers.
Right now we have no way to tell the user application where the
writeback error occurred, just that EIO happened -some where- at
-some time in the past- when they next do something with data...

> > If you want an idea of all the different types of metadata objects
> > we need to have different notifications for, look at the GETFSMAP
> > ioctl man page. It lists all the different types of objects we are
> > likely to emit notifications for from XFS (e.g. free space
> > btree corruption at record index X to Y) because, well, that's the
> > sort of information we're already dumping to the kernel log....
> > 
> > Hence from a design perspective, we need to separate the contents of
> > the notification from the mechanism used to configure, filter and
> > emit notifications to userspace.  That is, it doesn't matter if we
> > add a magic new syscall or use fanotify to configure watches and
> > transfer messages to userspace, the contents of the message is going
> > to be the exactly the same, and the API that the filesystem
> > implementations are going to call to emit a notification to
> > userspace is exactly the same.
> > 
> > So a generic message structure looks something like this:
> > 
> > <notification type>		(msg structure type)
> > <notification location>		(encoded file/line info)
> > <object type>			(inode, dir, btree, bmap, block, etc)
> > <object ID>			{bdev, object}
> > <range>				{offset, length} (range in object)
> > <notification version>		(notification data version)
> > <notification data>		(filesystem specific data)
> 
> There's a caveat though that 'object type' is necessarily filesystem
> specific and with new filesystem wanting to support this we'll likely need
> to add more object types. So it is questionable how "generic error parser"
> would be able to use this type of information and whether this doesn't need
> to be in the fs-specific blob.

Well, there are only so many generic types. If we start with the
basic ones such as "regular file" "directory" "user data extent"
and "internal metadata" we cover most bases. That's the reason I
said "filesystem specific diagnostic data can follow the generic
message". This allows the filesystem to say "Fatal internal metadata
error" to userspace and then in it's custom field say "journal write
IO error at block XYZ".

Monitoring tools don't need to know it was a journal error, the
context they react to is "fatal error". Whoever (or whatever) is
tasked with responding to that error can then look at the diagnostic
information supplied with the notification. i.e.:

Severity: fatal
Scope: global
Type: Internal metadata
Object: Journal
Location: <bdev>
Range: <extent>
Error: ENODEV
Diagnostic data: <Write error ENODEV in journal at block XYZ>

A data loss event would indicate that a data extent went bad,
identify it as belonging to inode X at offset Y, length Z. i.e.

Severity: Data Corruption
Scope: inode
Type: User Data
Object: <Extent>
Location: <inode>
Range: <logical offset, length>
Diagnostic data: "writeback failed at inode X, offset Y, len Z due to ENOSPC from bdev"

If the data corruption happens in the inode metadata (e.g. the block
map), the event would be a little different:

Severity: Data Corruption
Scope: inode
Type: Internal Metadata
Object: <Extent>
Location: <inode>
Range: <logical offset, length>
Diagnostic data: "BMBT block at block XYZ failed checksum, cannot read extent records"

So they tell userspace the same thing, but the actual details of the
cause of the data loss over that range of th file are quite
different.

Perhaps an space usage event:

Severity: Information
Scope: Global
Type: Capacity
Total space: X
Available space: Y

Or a directory quota warning:

Severity: Warning
Scope: Project Quota
Type: Low Capacity
Object: <project id>
Total space: X
Available space: Y

IOWs, the notification message header is nothing but a
classification scheme that the notification scoping subsystem uses
for filtering and distribution. If we just stick to the major
objects a filesystem exposes to uses (regular files, directories,
extended attributes, quota, Capacity and internal metadata) and
important events (corruption, errors and emergency actions) then we
cover most of what all filesystems are going to need to tell
userspace.

> Also versioning fs specific blobs with 'notification version' tends to get
> somewhat cumbersome if you need to update the scheme, thus bump the
> version, which breaks all existing parsers (and I won't even speak about
> the percentage of parses that won't bother with checking the version and
> just blindly try to parse whatever they get assuming incorrect things ;).
> We've been there more than once... But this is more of a side remark - once
> other problems are settled I believe we can come up with reasonably
> extensible scheme for blob passing pretty easily.

Yup, I just threw it in there because we need to ensure that the
message protocol format is both extensible and revocable. We will
make mistakes, but we can also ensure we don't have to live with
those mistakes forever.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Filesystem error notifications proposal
  2021-02-10  0:09 ` Darrick J. Wong
  2021-02-10  7:23   ` Amir Goldstein
@ 2021-02-10 23:29   ` Gabriel Krisman Bertazi
  1 sibling, 0 replies; 28+ messages in thread
From: Gabriel Krisman Bertazi @ 2021-02-10 23:29 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: tytso, jack, viro, amir73il, dhowells, david, darrick.wong,
	khazhy, linux-fsdevel, kernel

"Darrick J. Wong" <djwong@kernel.org> writes:

> Hi Gabriel,
>
> Sorry that it has taken me nearly three weeks to respond to this; I've
> been buried in XFS refactoring for 5.12. :(

Hi Darrick,

Thanks for description of both use cases. It helped a lot to clarify my
thoughts.

> This structure /could/ include an instruction pointer for more advanced
> reporting, but it's not a hard requirement to have such a thing.  As far
> as xfs is concerned, something decided the fs was bad, and the only
> thing to do now is to recover.  I don't think it matters critically
> whether the notices are presented via fanotify or watch_queue.
>
> The tricky problem here is (I think?) how to multiplex different
> filesystem types wanting to send corruption reports to userspace.  I
> suppose you could define the fs metadata error format as:
>
> 	[struct fanotify_event_metadata]
> 	[optional struct fanotify_event_info_fid]
> 	[u32 magic code corresponding to statvfs.f_type?]
> 	[fs-specific blob of data And]

This seems similar to what I proposed, minus the first use case i was
integrating.

> here then you'd use fanotify_event_metadata.event_len to figure out the
> length of the fs-specific blob.  That way XFS could export the short
> structure I outlined above, and ext4 can emit instruction pointer
> addresses or strings or whatever else you and Ted settle on.
>
> If that sounds like "Well you go plumb in the fanotify bits with just
> enough information for dispatching and then we'll go write our own xfs
> specific thing"... yep. :)
>
> To be clear: I'm advocating for cleaving these two usescases completely
> apart, and not mixing them at all like what you defined below, because I
> now think these are totally separate use cases.

Yes, this makes a lot of sense.


>>   On the other hand, fanotify already has support for the visibility
>>   semantics we are looking for. fsnotify allows an inode to notify only
>>   its subtree, mountpoints, or the entire filesystem, depending on the
>>   watcher flags, while an equivalent support would need to be written
>>   from scratch for watch_queue.  fanotify also has in-band overflow
>>   handling, already implemented.  Finally, since fanotify supports much
>>   larger notifications, there is no need to link separate notifications,
>>   preventing buffer overruns from erasing parts of a notification chain.
>> 
>>   fanotify is based on fsnotify, and the infrastructure for the
>>   visibility semantics is mostly implemented by fsnotify itself.  It
>>   would be possible to make error notifications a new mechanism on top
>>   of fsnotify, without modifying fanotify, but that would require
>>   exposing a very similar interface to userspace, new system calls, and
>>   that doesn't seem justifiable since we can extend fanotify syscalls
>>   without ABI breakage.
>
> <nod> AFAICT it sounds like fanotify is a good fit for that first
> usecase I outlined.  It'd probably work for both.

The main advantage of the fanotify/fsnotify api, in my opinion, is the
hooking to a specific file or the entire file system.  This makes it
specially useful for the first use case you explained, even if some
changes are needed to drop the CAP_SYS_ADMIN requirement.

But for the second case, which interests me a bit more, can you clarify
if the same infrastructure of watching specific files is indeed
necessary?  In your use case description, there would be a single daemon
watching the entire filesystem, which would use an equivalent (fanotify
or not) of FAN_MARK_FILESYSTEM. It seems there isn't a need to watch
specific files for the second use case, as long as the events identify
the object that triggered the error. In other words, you always want to
watch the entire filesystem.

If we are indeed treating this second use case separately from the
first, it seems to fit a much simpler model of a filesystem-wide "pipe"
where different parts of the filesystem can dump different errors, no?

watch_queue provisions the filtering specific types of events.  Maybe we
can extend its filtering capabilities?

>> 3 Proposal
>> ==========
>> 
>>   The error notification framework is based on fanotify instead of
>>   watch_queue.  It is exposed through a new set of marks FAN_ERROR_*,
>>   exposed through the fanotify_mark(2) API.
>> 
>>   fanotify (fsnotify-based) has the infrastructure in-place to link
>>   notifications happening at filesystem objects to mountpoints and to
>>   filesystems, and already exposes an interface with well defined
>>   semantics of how those are exposed to watchers in different
>>   mountpoints or different subtrees.
>> 
>>   A new message format is exposed, if the user passed
>>   FAN_REPORT_DETAILED_ERROR fanotify_init(2) flag.  FAN_ERROR messages
>>   don't have FID/DFID records.
>> 
>>   A FAN_REPORT_DETAILED_ERROR record has the same struct
>>   fanotify_event_metadata header, but it is followed by one or more
>>   additional information record as follows:
>> 
>>   struct fanotify_event_error_hdr {
>>   	struct fanotify_event_info_header hdr;
>>   	__u32 error;
>>         __u64 inode;
>>         __u64 offset;
>>   }
>> 
>>   error is a VFS generic error number that can notify generic conditions
>>   like EFSCORRUPT. If hdr.len is larger than sizeof(struct
>>   fanotify_event_error_hdr), this structure is followed by an optional
>>   filesystem specific record that further specifies the error,
>>   originating object and debug data. This record has a generic header:
>> 
>>   struct fanotify_event_fs_error_hdr {
>>   	struct fanotify_event_error_hdr hdr;
>>         __kernel_fsid_t fsid;
>>         __u32 fs_error;
>>   }
>> 
>>   fs_error is a filesystem specific error record, potentially more
>>   detailed than fanotify_event_error.hdr.error . Each type of filesystem
>
> Er... is fs_error supposed to be a type code that tells the program that
> the next byte is the start of a fanotify_event_ext4_inode_error
> structure?

Yes.  Sorry for the mistake.  fs_error would identify a filesystem
specific blob (record), that would follow in the same notification.

Thanks,

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2021-02-10 23:30 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-20 20:13 [RFC] Filesystem error notifications proposal Gabriel Krisman Bertazi
2021-01-21  4:01 ` Viacheslav Dubeyko
2021-01-21 11:44 ` Jan Kara
2021-01-21 13:27   ` Amir Goldstein
2021-01-21 18:56   ` Gabriel Krisman Bertazi
2021-01-21 22:44 ` Theodore Ts'o
2021-01-22  0:44   ` Gabriel Krisman Bertazi
2021-01-22  7:36     ` Amir Goldstein
2021-02-02 20:51       ` Gabriel Krisman Bertazi
2021-01-28 22:28     ` Theodore Ts'o
2021-02-02 20:26       ` Gabriel Krisman Bertazi
2021-02-02 22:34         ` Theodore Ts'o
2021-02-08 18:49           ` Gabriel Krisman Bertazi
2021-02-08 22:19             ` Dave Chinner
2021-02-09  1:08               ` Theodore Ts'o
2021-02-09  5:12                 ` Khazhismel Kumykov
2021-02-09  8:55                 ` Dave Chinner
2021-02-09 17:57                   ` Theodore Ts'o
2021-02-10  0:52                     ` Darrick J. Wong
2021-02-10  2:21                       ` Theodore Ts'o
2021-02-10  2:32                         ` Darrick J. Wong
2021-02-09 17:35               ` Jan Kara
2021-02-10  0:22                 ` Darrick J. Wong
2021-02-10  7:46                 ` Dave Chinner
2021-02-10  0:49               ` Darrick J. Wong
2021-02-10  0:09 ` Darrick J. Wong
2021-02-10  7:23   ` Amir Goldstein
2021-02-10 23:29   ` Gabriel Krisman Bertazi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).