From: Gabriel Krisman Bertazi <krisman@collabora.com>
To: tytso@mit.edu, jack@suse.com, viro@zeniv.linux.org.uk,
amir73il@gmail.com, dhowells@redhat.com, david@fromorbit.com,
darrick.wong@oracle.com, khazhy@google.com
Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com
Subject: [RFC] Filesystem error notifications proposal
Date: Wed, 20 Jan 2021 17:13:15 -0300 [thread overview]
Message-ID: <87lfcne59g.fsf@collabora.com> (raw)
My apologies for the long email.
Please let me know your thoughts.
1 Summary
=========
I'm looking for a filesystem-agnostic mechanism to report filesystem
errors to a monitoring tool in userspace. I experimented first with
the watch_queue API but decided to move to fanotify for a few reasons.
2 Background
============
I submitted a first set of patches, based on David Howells' original
superblock notifications patchset, that I expanded into error
reporting and had an example implementation for ext4. Upstream review
has revealed a few design problems:
- Including the "function:line" tuple in the notification allows the
uniquely identification of the error origin but it also ties the
decodification of the error to the source code, i.e. <function:line>
is expected to change between releases.
- Useful debug data (inode number, block group) have formats specific
to the filesystems, and my design wouldn't be expansible to
filesystems other than ext4.
- The implementation allowed an error string description (equivalent
to what would be thrown in dmesg) that is too short, as it needs to
fit in a single notification.
- How the user sees the filesystem. The original patch points to a
mountpoint but uses the term superblock. This is to differentiate
from another mechanism in development to watch mounting operations.
- Visibility of objects. A bind mount of a subtree shouldn't receive
notifications of objects outside of that bind mount.
2.1 Other constraints of the watch_queue API
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
watch_queue is a fairly new framework, which has one upstream user in
the keyring subsystem. watch_queue is designed to submit very short
(max of 128 bytes) atomic notifications to userspace in a fast manner
through a ring buffer. There is no current mechanism to link
notifications that require more than one slot and such mechanism
wouldn't be trivial to implement, since buffer overruns could
overwrite the beginning/end of a multi part notification. In
addition, watch_queue requires an out-of-band overflow notification
mechanism, which would need to be implemented aside from the system
call, in a separate API.
2.2 fanotify vs watch_queue
~~~~~~~~~~~~~~~~~~~~~~~~~~~
watch_queue is designed for efficiency and fast submission of a large
number of notifications. It doesn't require memory allocation in the
submission path, but with a flexibility cost, such as limited
notification size. fanotify is more flexible, allows for larger
notifications and better filtering, but requires allocations on the
submission path.
On the other hand, fanotify already has support for the visibility
semantics we are looking for. fsnotify allows an inode to notify only
its subtree, mountpoints, or the entire filesystem, depending on the
watcher flags, while an equivalent support would need to be written
from scratch for watch_queue. fanotify also has in-band overflow
handling, already implemented. Finally, since fanotify supports much
larger notifications, there is no need to link separate notifications,
preventing buffer overruns from erasing parts of a notification chain.
fanotify is based on fsnotify, and the infrastructure for the
visibility semantics is mostly implemented by fsnotify itself. It
would be possible to make error notifications a new mechanism on top
of fsnotify, without modifying fanotify, but that would require
exposing a very similar interface to userspace, new system calls, and
that doesn't seem justifiable since we can extend fanotify syscalls
without ABI breakage.
3 Proposal
==========
The error notification framework is based on fanotify instead of
watch_queue. It is exposed through a new set of marks FAN_ERROR_*,
exposed through the fanotify_mark(2) API.
fanotify (fsnotify-based) has the infrastructure in-place to link
notifications happening at filesystem objects to mountpoints and to
filesystems, and already exposes an interface with well defined
semantics of how those are exposed to watchers in different
mountpoints or different subtrees.
A new message format is exposed, if the user passed
FAN_REPORT_DETAILED_ERROR fanotify_init(2) flag. FAN_ERROR messages
don't have FID/DFID records.
A FAN_REPORT_DETAILED_ERROR record has the same struct
fanotify_event_metadata header, but it is followed by one or more
additional information record as follows:
struct fanotify_event_error_hdr {
struct fanotify_event_info_header hdr;
__u32 error;
__u64 inode;
__u64 offset;
}
error is a VFS generic error number that can notify generic conditions
like EFSCORRUPT. If hdr.len is larger than sizeof(struct
fanotify_event_error_hdr), this structure is followed by an optional
filesystem specific record that further specifies the error,
originating object and debug data. This record has a generic header:
struct fanotify_event_fs_error_hdr {
struct fanotify_event_error_hdr hdr;
__kernel_fsid_t fsid;
__u32 fs_error;
}
fs_error is a filesystem specific error record, potentially more
detailed than fanotify_event_error.hdr.error . Each type of filesystem
error has its own record type, that is used to report different
information useful for each type of error. For instance, an ext4
lookup error, caused by an invalid inode type read from disk, produces
the following record:
struct fanotify_event_ext4_inode_error {
struct fanotify_event_fs_error_hdr hdr;
__u64 block;
__u32 inode_type;
}
The separation between VFS and filesystem-specific error messages has
the benefit of always providing some information that an error has
occurred, regardless of filesystem-specific support, while allowing
capable filesystems to expose specific internal data to further
document the issue. This scheme leaves to the filesystem to expose
more meaningful information as needed. For instance, multi-disk
filesystems can single out the problematic disk, network filesystems
can expose a network error while accessing the server.
3.1 Visibility semantics
~~~~~~~~~~~~~~~~~~~~~~~~
Error reporting follows the same semantics of fanotify events.
Therefore, a watcher can request to watch the entire filesystem, a
single mountpoint or a subtree.
3.2 security implications
~~~~~~~~~~~~~~~~~~~~~~~~~
fanotify requires CAP_SYS_ADMIN. My understanding is this requirement
suffices for most use cases but, according to fanotify documentation,
it could be relaxed without issues for the existing fanotify API. For
instance, watching a subtree could be a allowed for a user who owns
that subtree.
3.3 error location
~~~~~~~~~~~~~~~~~~
While exposing the exact line of code that triggered the notification
ties that notification to a specific kernel version, it is an
important information for those who completely control their
environment and kernel builds, such as cloud providers. Therefore,
this proposal includes a mechanism to optionally include in the
notification the line of code where the error occurred
A watcher who passes the flag FAN_REPORT_ERROR_LOCATION to
fanotify_init(2) receives an extra record for FAN_ERROR events:
struct fanotify_event_fs_error_location {
struct fanotify_event_info_header hdr;
u32 line;
char function[];
}
This record identifies the place where the error occured. function is
a VLA whose size extend to the end of the region delimited by hdr.len.
This VLA text-encodes the function name where the error occurred.
What do you think about his interface? Would it be acceptable as part
of fanotify, or should it be a new fsnotify mode?
Regarding semantics, I believe fanotify should solve the visibility
problem for a subtree watcher not being able to see other branches
notifications. Do you think this would suffice?
Thanks,
--
Gabriel Krisman Bertazi
next reply other threads:[~2021-01-20 20:14 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-20 20:13 Gabriel Krisman Bertazi [this message]
2021-01-21 4:01 ` [RFC] Filesystem error notifications proposal Viacheslav Dubeyko
2021-01-21 11:44 ` Jan Kara
2021-01-21 13:27 ` Amir Goldstein
2021-01-21 18:56 ` Gabriel Krisman Bertazi
2021-01-21 22:44 ` Theodore Ts'o
2021-01-22 0:44 ` Gabriel Krisman Bertazi
2021-01-22 7:36 ` Amir Goldstein
2021-02-02 20:51 ` Gabriel Krisman Bertazi
2021-01-28 22:28 ` Theodore Ts'o
2021-02-02 20:26 ` Gabriel Krisman Bertazi
2021-02-02 22:34 ` Theodore Ts'o
2021-02-08 18:49 ` Gabriel Krisman Bertazi
2021-02-08 22:19 ` Dave Chinner
2021-02-09 1:08 ` Theodore Ts'o
2021-02-09 5:12 ` Khazhismel Kumykov
2021-02-09 8:55 ` Dave Chinner
2021-02-09 17:57 ` Theodore Ts'o
2021-02-10 0:52 ` Darrick J. Wong
2021-02-10 2:21 ` Theodore Ts'o
2021-02-10 2:32 ` Darrick J. Wong
2021-02-09 17:35 ` Jan Kara
2021-02-10 0:22 ` Darrick J. Wong
2021-02-10 7:46 ` Dave Chinner
2021-02-10 0:49 ` Darrick J. Wong
2021-02-10 0:09 ` Darrick J. Wong
2021-02-10 7:23 ` Amir Goldstein
2021-02-10 23:29 ` Gabriel Krisman Bertazi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87lfcne59g.fsf@collabora.com \
--to=krisman@collabora.com \
--cc=amir73il@gmail.com \
--cc=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=dhowells@redhat.com \
--cc=jack@suse.com \
--cc=kernel@collabora.com \
--cc=khazhy@google.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=tytso@mit.edu \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).