* Upcoming: Notifications, FS notifications and fsinfo() @ 2020-03-30 13:58 David Howells 2020-03-30 14:31 ` [GIT PULL] General notification queue and key notifications David Howells ` (6 more replies) 0 siblings, 7 replies; 99+ messages in thread From: David Howells @ 2020-03-30 13:58 UTC (permalink / raw) To: torvalds Cc: dhowells, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven, andres, christian.brauner, keyrings, linux-fsdevel, linux-kernel [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="maccentraleurope", Size: 4045 bytes --] Hi Linus, I have three sets of patches I'd like to push your way, if you (and Al) are willing to consider them. (1) General notification queue plus key/keyring notifications. This adds the core of the notification queue built on pipes, and adds the ability to watch for changes to keys. (2) Mount and superblock notifications. This builds on (1) to provide notifications of mount topology changes and implements a framework for superblock events (configuration changes, I/O errors, quota/space overruns and network status changes). (3) Filesystem information retrieval. This provides an extensible way to retrieve informational attributes about mount objects and filesystems. This includes providing information intended to make recovering from a notification queue overrun much easier. We need (1) for Gnome to efficiently watch for changes in kerberos keyrings. Debarshi Ray has patches ready to go for gnome-online-accounts so that it can make use of the facility. Sets (2) and (3) can make libmount more efficient. Karel Zak is working on making use of this to avoid reading /proc/mountinfo. We need something to make systemd's watching of the mount topology more efficient, and (2) and (3) can help with this by making it faster to narrow down what changed. I think Karel has this in his sights, but hasn't yet managed to work on it. Set (2) should be able to make it easier to watch for mount options inside a container, and set (3) should make it easier to examine the mounts inside another mount namespace inside a container in a way that can't be done with /proc/mounts. This is requested by Christian Brauner. Jeff Layton has a tentative addition to (3) to expose error state to userspace, and Andres Freund would like this for Postgres. Set (3) further allows the information returned by such as statx() and ioctl(FS_IOC_GETFLAGS) to be qualified by indicating which bits are/aren't supported. Further, for (3), I also allow filesystem-specific overrides/extensions to fsinfo() and have a use for it to AFS to expose information about server preference for a particular volume (something that is necessary for implementing the toolset). I've provided example code that does similar for NFS and some that exposes superblock info from Ext4. At Vault, Steve expressed an interest in this for CIFS and Ted Ts'o expressed a possible interest for Ext4. Notes: (*) These patches will conflict with apparently upcoming refactoring of the security core, but the fixup doesn't look too bad: https://lore.kernel.org/linux-next/20200330130636.0846e394@canb.auug.org.au/T/#u (*) Miklós Szeredi would much prefer to implement fsinfo() as a magic filesystem mounted on /proc/self/fsinfo/ whereby your open fds appear as directories under there, each with a set of attribute files corresponding to the attributes that fsinfo() would otherwise provide. To examine something by filename, you'd have to open it O_PATH and then read the individual attribute files in the corresponding per-fd directory. A readfile() system call has been mooted to elide the {open,read,close} sequence to make it more efficient. (*) James Bottomley would like to deprecate fsopen(), fspick(), fsconfig() and fsmount() in favour of a more generic configfs with dedicated open, set-config and action syscalls, with an additional get-config syscall that would be used instead of fsinfo() - though, as I understand it, you'd have to create a config (fspick-equivalent) before you could use get-config. (*) I don't think Al has particularly looked at fsinfo() or the fs notifications patches yet. (*) I'm not sure what *your* opinion of fsinfo() is yet. If you don't dislike it too, um, fragrantly, would you be willing to entertain part of it for now and prefer the rest to stew a bit longer? I can drop some of the pieces. Anyway, I'm going to formulate a pull request for each of them. Thanks, David ^ permalink raw reply [flat|nested] 99+ messages in thread
* [GIT PULL] General notification queue and key notifications 2020-03-30 13:58 Upcoming: Notifications, FS notifications and fsinfo() David Howells @ 2020-03-30 14:31 ` David Howells 2020-03-31 6:51 ` Stephen Rothwell 2020-06-02 15:55 ` David Howells 2020-03-30 14:36 ` [GIT PULL] Mount and superblock notifications David Howells ` (5 subsequent siblings) 6 siblings, 2 replies; 99+ messages in thread From: David Howells @ 2020-03-30 14:31 UTC (permalink / raw) To: torvalds Cc: dhowells, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven, andres, christian.brauner, jarkko.sakkinen, keyrings, linux-fsdevel, linux-kernel Hi Linus, Can you pull this, please? It adds a general notification queue concept and adds an event source for keys/keyrings, such as linking and unlinking keys and changing their attributes. A subsequent pull request will add mount and superblock event sources. LSM hooks are included: (1) A set of hooks are provided that allow an LSM to rule on whether or not a watch may be set. Each of these hooks takes a different "watched object" parameter, so they're not really shareable. The LSM should use current's credentials. [Wanted by SELinux & Smack] (2) A hook is provided to allow an LSM to rule on whether or not a particular message may be posted to a particular queue. This is given the credentials from the event generator (which may be the system) and the watch setter. [Wanted by Smack] I've provided SELinux and Smack with implementations of some of these hooks. WHY === Key/keyring notifications are desirable because if you have your kerberos tickets in a file/directory, your Gnome desktop will monitor that using something like fanotify and tell you if your credentials cache changes. However, we also have the ability to cache your kerberos tickets in the session, user or persistent keyring so that it isn't left around on disk across a reboot or logout. Keyrings, however, cannot currently be monitored asynchronously, so the desktop has to poll for it - not so good on a laptop. This source will allow the desktop to avoid the need to poll. Here's a pull request for usage by gnome-online-accounts: https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47 DESIGN DECISIONS ================ (1) The notification queue is built on top of a standard pipe. Messages are effectively spliced in. The pipe is opened with a special flag: pipe2(fds, O_NOTIFICATION_PIPE); The special flag has the same value as O_EXCL (which doesn't seem like it will ever be applicable in this context)[?]. It is given up front to make it a lot easier to prohibit splice and co. from accessing the pipe. [?] Should this be done some other way? I'd rather not use up a new O_* flag if I can avoid it - should I add a pipe3() system call instead? The pipe is then configured:: ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth); ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter); Messages are then read out of the pipe using read(). (2) It should be possible to allow write() to insert data into the notification pipes too, but this is currently disabled as the kernel has to be able to insert messages into the pipe *without* holding pipe->mutex and the code to make this work needs careful auditing. (3) sendfile(), splice() and vmsplice() are disabled on notification pipes because of the pipe->mutex issue and also because they sometimes want to revert what they just did - but one or more notification messages might've been interleaved in the ring. (4) The kernel inserts messages with the wait queue spinlock held. This means that pipe_read() and pipe_write() have to take the spinlock to update the queue pointers. (5) Records in the buffer are binary, typed and have a length so that they can be of varying size. This allows multiple heterogeneous sources to share a common buffer; there are 16 million types available, of which I've used just a few, so there is scope for others to be used. Tags may be specified when a watchpoint is created to help distinguish the sources. (6) Records are filterable as types have up to 256 subtypes that can be individually filtered. Other filtration is also available. (7) Notification pipes don't interfere with each other; each may be bound to a different set of watches. Any particular notification will be copied to all the queues that are currently watching for it - and only those that are watching for it. (8) When recording a notification, the kernel will not sleep, but will rather mark a queue as having lost a message if there's insufficient space. read() will fabricate a loss notification message at an appropriate point later. (9) The notification pipe is created and then watchpoints are attached to it, using one of: keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01); watch_mount(AT_FDCWD, "/", 0, fd, 0x02); watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03); where in both cases, fd indicates the queue and the number after is a tag between 0 and 255. (10) Watches are removed if either the notification pipe is destroyed or the watched object is destroyed. In the latter case, a message will be generated indicating the enforced watch removal. Things I want to avoid: (1) Introducing features that make the core VFS dependent on the network stack or networking namespaces (ie. usage of netlink). (2) Dumping all this stuff into dmesg and having a daemon that sits there parsing the output and distributing it as this then puts the responsibility for security into userspace and makes handling namespaces tricky. Further, dmesg might not exist or might be inaccessible inside a container. (3) Letting users see events they shouldn't be able to see. TESTING AND MANPAGES ==================== (*) The keyutils tree has a pipe-watch branch that has keyctl commands for making use of notifications. Proposed manual pages can also be found on this branch, though a couple of them really need to go to the main manpages repository instead. If the kernel supports the watching of keys, then running "make test" on that branch will cause the testing infrastructure to spawn a monitoring process on the side that monitors a notifications pipe for all the key/keyring changes induced by the tests and they'll all be checked off to make sure they happened. https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch (*) A test program is provided (samples/watch_queue/watch_test) that can be used to monitor for keyrings, mount and superblock events. Information on the notifications is simply logged to stdout. Thanks, David --- The following changes since commit f8788d86ab28f61f7b46eb6be375f8a726783636: Linux 5.6-rc3 (2020-02-23 16:17:42 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/notifications-20200330 for you to fetch changes up to 694435dbde3d1da79aafaf4cd680802f9eb229b7: smack: Implement the watch_key and post_notification hooks (2020-03-19 17:31:09 +0000) ---------------------------------------------------------------- Notifications over pipes ---------------------------------------------------------------- David Howells (11): uapi: General notification queue definitions security: Add hooks to rule on setting a watch security: Add a hook for the point of notification insertion pipe: Add O_NOTIFICATION_PIPE pipe: Add general notification queue support watch_queue: Add a key/keyring notification facility Add sample notification program pipe: Allow buffers to be marked read-whole-or-error for notifications pipe: Add notification lossage handling selinux: Implement the watch_key security hook smack: Implement the watch_key and post_notification hooks Documentation/security/keys/core.rst | 57 ++ Documentation/userspace-api/ioctl/ioctl-number.rst | 1 + Documentation/watch_queue.rst | 339 +++++++++++ fs/pipe.c | 242 +++++--- fs/splice.c | 12 +- include/linux/key.h | 3 + include/linux/lsm_audit.h | 1 + include/linux/lsm_hooks.h | 38 ++ include/linux/pipe_fs_i.h | 27 +- include/linux/security.h | 31 + include/linux/watch_queue.h | 127 ++++ include/uapi/linux/keyctl.h | 2 + include/uapi/linux/watch_queue.h | 104 ++++ init/Kconfig | 12 + kernel/Makefile | 1 + kernel/watch_queue.c | 659 +++++++++++++++++++++ samples/Kconfig | 6 + samples/Makefile | 1 + samples/watch_queue/Makefile | 7 + samples/watch_queue/watch_test.c | 186 ++++++ security/keys/Kconfig | 9 + security/keys/compat.c | 3 + security/keys/gc.c | 5 + security/keys/internal.h | 30 +- security/keys/key.c | 38 +- security/keys/keyctl.c | 99 +++- security/keys/keyring.c | 20 +- security/keys/request_key.c | 4 +- security/security.c | 23 + security/selinux/hooks.c | 14 + security/smack/smack_lsm.c | 83 ++- 31 files changed, 2079 insertions(+), 105 deletions(-) create mode 100644 Documentation/watch_queue.rst create mode 100644 include/linux/watch_queue.h create mode 100644 include/uapi/linux/watch_queue.h create mode 100644 kernel/watch_queue.c create mode 100644 samples/watch_queue/Makefile create mode 100644 samples/watch_queue/watch_test.c ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [GIT PULL] General notification queue and key notifications 2020-03-30 14:31 ` [GIT PULL] General notification queue and key notifications David Howells @ 2020-03-31 6:51 ` Stephen Rothwell 2020-06-02 15:55 ` David Howells 1 sibling, 0 replies; 99+ messages in thread From: Stephen Rothwell @ 2020-03-31 6:51 UTC (permalink / raw) To: David Howells Cc: torvalds, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven, andres, christian.brauner, jarkko.sakkinen, keyrings, linux-fsdevel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2288 bytes --] Hi David, On Mon, 30 Mar 2020 15:31:04 +0100 David Howells <dhowells@redhat.com> wrote: > > pipe: Add general notification queue support This commit has a (reasonably simple) conflict against commit 6551d5c56eb0 ("pipe: make sure to wake up everybody when the last reader/writer closes") from Linus' tree. Also a semantic conflict against commit 52b31bc9aabc ("io_uring: add splice(2) support") from the block tree needing this fix up (white space damaged) diff --git a/fs/io_uring.c b/fs/io_uring.c index fb8fe0bd5e18..8cdd3870cd4e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2470,7 +2470,7 @@ static int io_splice_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) static bool io_splice_punt(struct file *file) { - if (get_pipe_info(file)) + if (get_pipe_info(file, true)) return false; if (!io_file_supports_async(file)) return true; > security: Add hooks to rule on setting a watch > security: Add a hook for the point of notification insertion And these have a conflict against commitinclude/linux/lsm_hooks.h 98e828a0650f ("security: Refactor declaration of LSM hooks") from the bpf-next tree (will be in the net-next tree pull). That requires taking the net-next version of include/linux/lsm_hooks.h and then applying the following patch: diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h index 9cd4455528e5..4f8d63fd1327 100644 --- a/include/linux/lsm_hook_defs.h +++ b/include/linux/lsm_hook_defs.h @@ -252,6 +252,16 @@ LSM_HOOK(int, 0, inode_notifysecctx, struct inode *inode, void *ctx, u32 ctxlen) LSM_HOOK(int, 0, inode_setsecctx, struct dentry *dentry, void *ctx, u32 ctxlen) LSM_HOOK(int, 0, inode_getsecctx, struct inode *inode, void **ctx, u32 *ctxlen) +#ifdef CONFIG_KEY_NOTIFICATIONS +LSM_HOOK(int, 0, watch_key, struct key *key) +#endif +#ifdef CONFIG_DEVICE_NOTIFICATIONS +LSM_HOOK(int, 0, watch_devices, void) +#endif +#ifdef CONFIG_WATCH_QUEUE +LSM_HOOK(int, 0, post_notification, const struct cred *w_cred, + const struct cred *cred, struct watch_notification *n) +#endif #ifdef CONFIG_SECURITY_NETWORK LSM_HOOK(int, 0, unix_stream_connect, struct sock *sock, struct sock *other, -- Cheers, Stephen Rothwell [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply related [flat|nested] 99+ messages in thread
* [GIT PULL] General notification queue and key notifications 2020-03-30 14:31 ` [GIT PULL] General notification queue and key notifications David Howells 2020-03-31 6:51 ` Stephen Rothwell @ 2020-06-02 15:55 ` David Howells 2020-06-03 2:15 ` Ian Kent ` (4 more replies) 1 sibling, 5 replies; 99+ messages in thread From: David Howells @ 2020-06-02 15:55 UTC (permalink / raw) To: torvalds Cc: dhowells, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven, andres, christian.brauner, jarkko.sakkinen, keyrings, linux-fsdevel, linux-kernel Date: Tue, 02 Jun 2020 16:51:44 +0100 Hi Linus, Can you pull this, please? It adds a general notification queue concept and adds an event source for keys/keyrings, such as linking and unlinking keys and changing their attributes. Thanks to Debarshi Ray, we do have a pull request to use this to fix a problem with gnome-online-accounts - as mentioned last time: https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47 Without this, g-o-a has to constantly poll a keyring-based kerberos cache to find out if kinit has changed anything. [[ With regard to the mount/sb notifications and fsinfo(), Karel Zak and Ian Kent have been working on making libmount use them, preparatory to working on systemd: https://github.com/karelzak/util-linux/commits/topic/fsinfo https://github.com/raven-au/util-linux/commits/topic/fsinfo.public Development has stalled briefly due to other commitments, so I'm not sure I can ask you to pull those parts of the series for now. Christian Brauner would like to use them in lxc, but hasn't started. ]] LSM hooks are included: (1) A set of hooks are provided that allow an LSM to rule on whether or not a watch may be set. Each of these hooks takes a different "watched object" parameter, so they're not really shareable. The LSM should use current's credentials. [Wanted by SELinux & Smack] (2) A hook is provided to allow an LSM to rule on whether or not a particular message may be posted to a particular queue. This is given the credentials from the event generator (which may be the system) and the watch setter. [Wanted by Smack] I've provided SELinux and Smack with implementations of some of these hooks. WHY === Key/keyring notifications are desirable because if you have your kerberos tickets in a file/directory, your Gnome desktop will monitor that using something like fanotify and tell you if your credentials cache changes. However, we also have the ability to cache your kerberos tickets in the session, user or persistent keyring so that it isn't left around on disk across a reboot or logout. Keyrings, however, cannot currently be monitored asynchronously, so the desktop has to poll for it - not so good on a laptop. This facility will allow the desktop to avoid the need to poll. DESIGN DECISIONS ================ (1) The notification queue is built on top of a standard pipe. Messages are effectively spliced in. The pipe is opened with a special flag: pipe2(fds, O_NOTIFICATION_PIPE); The special flag has the same value as O_EXCL (which doesn't seem like it will ever be applicable in this context)[?]. It is given up front to make it a lot easier to prohibit splice and co. from accessing the pipe. [?] Should this be done some other way? I'd rather not use up a new O_* flag if I can avoid it - should I add a pipe3() system call instead? The pipe is then configured:: ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth); ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter); Messages are then read out of the pipe using read(). (2) It should be possible to allow write() to insert data into the notification pipes too, but this is currently disabled as the kernel has to be able to insert messages into the pipe *without* holding pipe->mutex and the code to make this work needs careful auditing. (3) sendfile(), splice() and vmsplice() are disabled on notification pipes because of the pipe->mutex issue and also because they sometimes want to revert what they just did - but one or more notification messages might've been interleaved in the ring. (4) The kernel inserts messages with the wait queue spinlock held. This means that pipe_read() and pipe_write() have to take the spinlock to update the queue pointers. (5) Records in the buffer are binary, typed and have a length so that they can be of varying size. This allows multiple heterogeneous sources to share a common buffer; there are 16 million types available, of which I've used just a few, so there is scope for others to be used. Tags may be specified when a watchpoint is created to help distinguish the sources. (6) Records are filterable as types have up to 256 subtypes that can be individually filtered. Other filtration is also available. (7) Notification pipes don't interfere with each other; each may be bound to a different set of watches. Any particular notification will be copied to all the queues that are currently watching for it - and only those that are watching for it. (8) When recording a notification, the kernel will not sleep, but will rather mark a queue as having lost a message if there's insufficient space. read() will fabricate a loss notification message at an appropriate point later. (9) The notification pipe is created and then watchpoints are attached to it, using one of: keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01); watch_mount(AT_FDCWD, "/", 0, fd, 0x02); watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03); where in both cases, fd indicates the queue and the number after is a tag between 0 and 255. (10) Watches are removed if either the notification pipe is destroyed or the watched object is destroyed. In the latter case, a message will be generated indicating the enforced watch removal. Things I want to avoid: (1) Introducing features that make the core VFS dependent on the network stack or networking namespaces (ie. usage of netlink). (2) Dumping all this stuff into dmesg and having a daemon that sits there parsing the output and distributing it as this then puts the responsibility for security into userspace and makes handling namespaces tricky. Further, dmesg might not exist or might be inaccessible inside a container. (3) Letting users see events they shouldn't be able to see. TESTING AND MANPAGES ==================== (*) The keyutils tree has a pipe-watch branch that has keyctl commands for making use of notifications. Proposed manual pages can also be found on this branch, though a couple of them really need to go to the main manpages repository instead. If the kernel supports the watching of keys, then running "make test" on that branch will cause the testing infrastructure to spawn a monitoring process on the side that monitors a notifications pipe for all the key/keyring changes induced by the tests and they'll all be checked off to make sure they happened. https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch (*) A test program is provided (samples/watch_queue/watch_test) that can be used to monitor for keyrings, mount and superblock events. Information on the notifications is simply logged to stdout. Thanks, David --- The following changes since commit b9bbe6ed63b2b9f2c9ee5cbd0f2c946a2723f4ce: Linux 5.7-rc6 (2020-05-17 16:48:37 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/notifications-20200601 for you to fetch changes up to a8478a602913dc89a7cd2060e613edecd07e1dbd: smack: Implement the watch_key and post_notification hooks (2020-05-19 15:47:38 +0100) ---------------------------------------------------------------- Notifications over pipes + Keyring notifications ---------------------------------------------------------------- David Howells (12): uapi: General notification queue definitions security: Add a hook for the point of notification insertion pipe: Add O_NOTIFICATION_PIPE pipe: Add general notification queue support security: Add hooks to rule on setting a watch watch_queue: Add a key/keyring notification facility Add sample notification program pipe: Allow buffers to be marked read-whole-or-error for notifications pipe: Add notification lossage handling keys: Make the KEY_NEED_* perms an enum rather than a mask selinux: Implement the watch_key security hook smack: Implement the watch_key and post_notification hooks Documentation/security/keys/core.rst | 57 ++ Documentation/userspace-api/ioctl/ioctl-number.rst | 1 + Documentation/watch_queue.rst | 339 +++++++++++ fs/pipe.c | 242 +++++--- fs/splice.c | 12 +- include/linux/key.h | 33 +- include/linux/lsm_audit.h | 1 + include/linux/lsm_hook_defs.h | 9 + include/linux/lsm_hooks.h | 14 + include/linux/pipe_fs_i.h | 27 +- include/linux/security.h | 30 +- include/linux/watch_queue.h | 127 ++++ include/uapi/linux/keyctl.h | 2 + include/uapi/linux/watch_queue.h | 104 ++++ init/Kconfig | 12 + kernel/Makefile | 1 + kernel/watch_queue.c | 659 +++++++++++++++++++++ samples/Kconfig | 6 + samples/Makefile | 1 + samples/watch_queue/Makefile | 7 + samples/watch_queue/watch_test.c | 186 ++++++ security/keys/Kconfig | 9 + security/keys/compat.c | 3 + security/keys/gc.c | 5 + security/keys/internal.h | 38 +- security/keys/key.c | 38 +- security/keys/keyctl.c | 115 +++- security/keys/keyring.c | 20 +- security/keys/permission.c | 31 +- security/keys/process_keys.c | 46 +- security/keys/request_key.c | 4 +- security/security.c | 22 +- security/selinux/hooks.c | 51 +- security/smack/smack_lsm.c | 112 +++- 34 files changed, 2185 insertions(+), 179 deletions(-) create mode 100644 Documentation/watch_queue.rst create mode 100644 include/linux/watch_queue.h create mode 100644 include/uapi/linux/watch_queue.h create mode 100644 kernel/watch_queue.c create mode 100644 samples/watch_queue/Makefile create mode 100644 samples/watch_queue/watch_test.c ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [GIT PULL] General notification queue and key notifications 2020-06-02 15:55 ` David Howells @ 2020-06-03 2:15 ` Ian Kent 2020-06-08 0:49 ` Ian Kent 2020-06-10 9:56 ` Christian Brauner ` (3 subsequent siblings) 4 siblings, 1 reply; 99+ messages in thread From: Ian Kent @ 2020-06-03 2:15 UTC (permalink / raw) To: David Howells, torvalds Cc: viro, dray, kzak, mszeredi, swhiteho, jlayton, andres, christian.brauner, jarkko.sakkinen, keyrings, linux-fsdevel, linux-kernel On Tue, 2020-06-02 at 16:55 +0100, David Howells wrote: > > [[ With regard to the mount/sb notifications and fsinfo(), Karel Zak > and > Ian Kent have been working on making libmount use them, > preparatory to > working on systemd: > > https://github.com/karelzak/util-linux/commits/topic/fsinfo > > https://github.com/raven-au/util-linux/commits/topic/fsinfo.public > > Development has stalled briefly due to other commitments, so I'm > not > sure I can ask you to pull those parts of the series for > now. Christian > Brauner would like to use them in lxc, but hasn't started. > ]] Linus, Just so your aware of what has been done and where we are at here's a summary. Karel has done quite a bit of work on libmount (at this stage it's getting hold of the mount information, aka. fsinfo()) and most of what I have done is included in that too which you can see in Karel's repo above). You can see a couple of bug fixes and a little bit of new code present in my repo which hasn't been sent over to Karel yet. This infrastructure is essential before notifications work is started which is where we will see the most improvement. It turns out that while systemd uses libmount it has it's own notifications handling sub-system as it deals with several event types, not just mount information, in the same area. So, unfortunately, changes will need to be made there as well as in libmount, more so than the trivial changes to use fsinfo() via libmount. That's where we are at the moment and I will get back to it once I've dealt with a few things I postponed to work on libmount. If you would like a more detailed account of what we have found I can provide that too. Is there anything else you would like from me or Karel? Ian ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [GIT PULL] General notification queue and key notifications 2020-06-03 2:15 ` Ian Kent @ 2020-06-08 0:49 ` Ian Kent 0 siblings, 0 replies; 99+ messages in thread From: Ian Kent @ 2020-06-08 0:49 UTC (permalink / raw) To: David Howells, torvalds Cc: viro, dray, kzak, mszeredi, swhiteho, jlayton, andres, christian.brauner, jarkko.sakkinen, keyrings, linux-fsdevel, linux-kernel On Wed, 2020-06-03 at 10:15 +0800, Ian Kent wrote: > On Tue, 2020-06-02 at 16:55 +0100, David Howells wrote: > > [[ With regard to the mount/sb notifications and fsinfo(), Karel > > Zak > > and > > Ian Kent have been working on making libmount use them, > > preparatory to > > working on systemd: > > > > https://github.com/karelzak/util-linux/commits/topic/fsinfo > > > > https://github.com/raven-au/util-linux/commits/topic/fsinfo.public > > > > Development has stalled briefly due to other commitments, so I'm > > not > > sure I can ask you to pull those parts of the series for > > now. Christian > > Brauner would like to use them in lxc, but hasn't started. > > ]] > > Linus, > > Just so your aware of what has been done and where we are at here's > a summary. > > Karel has done quite a bit of work on libmount (at this stage it's > getting hold of the mount information, aka. fsinfo()) and most of > what I have done is included in that too which you can see in Karel's > repo above). You can see a couple of bug fixes and a little bit of > new code present in my repo which hasn't been sent over to Karel > yet. > > This infrastructure is essential before notifications work is started > which is where we will see the most improvement. > > It turns out that while systemd uses libmount it has it's own > notifications handling sub-system as it deals with several event > types, not just mount information, in the same area. So, > unfortunately, > changes will need to be made there as well as in libmount, more so > than the trivial changes to use fsinfo() via libmount. > > That's where we are at the moment and I will get back to it once > I've dealt with a few things I postponed to work on libmount. > > If you would like a more detailed account of what we have found I > can provide that too. > > Is there anything else you would like from me or Karel? I think there's a bit more I should say about this. One reason work hasn't progressed further on this is I spent quite a bit of time looking at the affects of using fsinfo(). My testing was done by using a large autofs direct mount map of 20000 entries which means that at autofs startup 20000 autofs mounts must be done and at autofs shutdown those 20000 mounts must be umounted. Not very scientific but something to use to get a feel for the affect of our changes. Initially just using fsinfo() to load all the mount entries was done to see how that would perform. This was done in a way that required no modifications to library user code but didn't get much improvement. Next loading all the mount ids (alone) for mount entry traversal was done and the various fields retrieved on-demand (implemented by Karel). Loading the entire mount table and then traversing the entries means the mount table is always possibly out of date. And loading the ids and getting the fields on-demand might have made that problem worse. But loading only the mount ids and using an on-demand method to get needed fields worked surprisingly well. The main issue is a mount going away while getting the fields. Testing showed that simply checking the field is valid and ignoring the entry if it isn't is enough to handle that case. Also the mount going away after the needed fields have been retrieved must be handled by callers of libmount as mounts can just as easily go away after reading the proc based tables. The case of the underlying mount information changing needs to be considered too. We will need to do better on that in the future but it too is a problem with the proc table handing and hasn't seen problems logged against libmount for it AFAIK. So, all in all, this approach worked pretty well as libmount users do use the getter access methods to retrieve the mount entry fields (which is required for the on-demand method to work). Certainly systemd always uses them (and it looks like udisks2 does too). Unfortunately using the libmount on-demand implementation requires library user code be modified (only a little in the systemd case) to use the implementation. Testing showed that we get between 10-15% reduction in overhead and CPU usage remained high. I think processing large numbers of mounts is simply a lot of work and there are particular cases that will remain that require the use of the load and traverse method. For example matching all mounts with a given prefix string (one of the systemd use cases). It's hard to get information about this but I can say that running pref during the autofs start and stop shows the bulk of the counter hits on the fsinfo() table construction code so that ahs to be where the overhead is. The unavoidable conclusion is that the load and traverse method that's been imposed on us for so long (even before libmount) for mount handling is what we need to get away from. After all, this is essentially where the problem comes from in the first place. And fsinfo() is designed to not need to use this method for getting mount information for that reason. There's also the notifications side of things which is the next area to work on. Looking at systemd I see that monitoring the proc mount table leads to a load, traverse, and process of the entire table for every single notification. It's clear that's because of the (what I'll call) anonymous notifications that we have now. The notifications in David's series carry event specific information, for example the mount id for mount notifications and the libmount fsinfo() implementation is written to use the mount id (lowest overhead lookup option), so there has to be significant improvement for this case. But systemd has it's own notifications handling code so there will need to be non-trivial changes there as well as changes in libmount. Bottom line is we have a bit of a challenge with this because we are trying to change coding practices developed over many years that, necessarily, use a load/traverse method and it's going to take quite a while to change these coding practices. My question is, is there something specific, besides what we are doing, that you'd like to see done now in order to get the series merged? Ian ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [GIT PULL] General notification queue and key notifications 2020-06-02 15:55 ` David Howells 2020-06-03 2:15 ` Ian Kent @ 2020-06-10 9:56 ` Christian Brauner 2020-06-10 11:12 ` Karel Zak ` (2 subsequent siblings) 4 siblings, 0 replies; 99+ messages in thread From: Christian Brauner @ 2020-06-10 9:56 UTC (permalink / raw) To: David Howells Cc: torvalds, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven, andres, jarkko.sakkinen, keyrings, linux-fsdevel, linux-kernel On Tue, Jun 02, 2020 at 04:55:04PM +0100, David Howells wrote: > Date: Tue, 02 Jun 2020 16:51:44 +0100 > > Hi Linus, > > Can you pull this, please? It adds a general notification queue concept > and adds an event source for keys/keyrings, such as linking and unlinking > keys and changing their attributes. > > Thanks to Debarshi Ray, we do have a pull request to use this to fix a > problem with gnome-online-accounts - as mentioned last time: > > https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47 > > Without this, g-o-a has to constantly poll a keyring-based kerberos cache > to find out if kinit has changed anything. > > [[ With regard to the mount/sb notifications and fsinfo(), Karel Zak and The mount/sb notification and fsinfo() stuff is something we'd like to use. (And then later extend to allow for supervised mounts where a container manager can supervise the mounts of an unprivileged container.) I'm not sure if the mount notifications are already part of this pr. Christian > Ian Kent have been working on making libmount use them, preparatory to > working on systemd: > > https://github.com/karelzak/util-linux/commits/topic/fsinfo > https://github.com/raven-au/util-linux/commits/topic/fsinfo.public > > Development has stalled briefly due to other commitments, so I'm not > sure I can ask you to pull those parts of the series for now. Christian > Brauner would like to use them in lxc, but hasn't started. > ]] > > > LSM hooks are included: > > (1) A set of hooks are provided that allow an LSM to rule on whether or > not a watch may be set. Each of these hooks takes a different > "watched object" parameter, so they're not really shareable. The LSM > should use current's credentials. [Wanted by SELinux & Smack] > > (2) A hook is provided to allow an LSM to rule on whether or not a > particular message may be posted to a particular queue. This is given > the credentials from the event generator (which may be the system) and > the watch setter. [Wanted by Smack] > > I've provided SELinux and Smack with implementations of some of these hooks. > > > WHY > => > Key/keyring notifications are desirable because if you have your kerberos > tickets in a file/directory, your Gnome desktop will monitor that using > something like fanotify and tell you if your credentials cache changes. > > However, we also have the ability to cache your kerberos tickets in the > session, user or persistent keyring so that it isn't left around on disk > across a reboot or logout. Keyrings, however, cannot currently be > monitored asynchronously, so the desktop has to poll for it - not so good > on a laptop. This facility will allow the desktop to avoid the need to > poll. > > > DESIGN DECISIONS > ======== > > (1) The notification queue is built on top of a standard pipe. Messages > are effectively spliced in. The pipe is opened with a special flag: > > pipe2(fds, O_NOTIFICATION_PIPE); > > The special flag has the same value as O_EXCL (which doesn't seem like > it will ever be applicable in this context)[?]. It is given up front > to make it a lot easier to prohibit splice and co. from accessing the > pipe. > > [?] Should this be done some other way? I'd rather not use up a new > O_* flag if I can avoid it - should I add a pipe3() system call > instead? > > The pipe is then configured:: > > ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth); > ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter); > > Messages are then read out of the pipe using read(). > > (2) It should be possible to allow write() to insert data into the > notification pipes too, but this is currently disabled as the kernel > has to be able to insert messages into the pipe *without* holding > pipe->mutex and the code to make this work needs careful auditing. > > (3) sendfile(), splice() and vmsplice() are disabled on notification pipes > because of the pipe->mutex issue and also because they sometimes want > to revert what they just did - but one or more notification messages > might've been interleaved in the ring. > > (4) The kernel inserts messages with the wait queue spinlock held. This > means that pipe_read() and pipe_write() have to take the spinlock to > update the queue pointers. > > (5) Records in the buffer are binary, typed and have a length so that they > can be of varying size. > > This allows multiple heterogeneous sources to share a common buffer; > there are 16 million types available, of which I've used just a few, > so there is scope for others to be used. Tags may be specified when a > watchpoint is created to help distinguish the sources. > > (6) Records are filterable as types have up to 256 subtypes that can be > individually filtered. Other filtration is also available. > > (7) Notification pipes don't interfere with each other; each may be bound > to a different set of watches. Any particular notification will be > copied to all the queues that are currently watching for it - and only > those that are watching for it. > > (8) When recording a notification, the kernel will not sleep, but will > rather mark a queue as having lost a message if there's insufficient > space. read() will fabricate a loss notification message at an > appropriate point later. > > (9) The notification pipe is created and then watchpoints are attached to > it, using one of: > > keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01); > watch_mount(AT_FDCWD, "/", 0, fd, 0x02); > watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03); > > where in both cases, fd indicates the queue and the number after is a > tag between 0 and 255. > > (10) Watches are removed if either the notification pipe is destroyed or > the watched object is destroyed. In the latter case, a message will > be generated indicating the enforced watch removal. > > > Things I want to avoid: > > (1) Introducing features that make the core VFS dependent on the network > stack or networking namespaces (ie. usage of netlink). > > (2) Dumping all this stuff into dmesg and having a daemon that sits there > parsing the output and distributing it as this then puts the > responsibility for security into userspace and makes handling > namespaces tricky. Further, dmesg might not exist or might be > inaccessible inside a container. > > (3) Letting users see events they shouldn't be able to see. > > > TESTING AND MANPAGES > ========== > > (*) The keyutils tree has a pipe-watch branch that has keyctl commands for > making use of notifications. Proposed manual pages can also be found > on this branch, though a couple of them really need to go to the main > manpages repository instead. > > If the kernel supports the watching of keys, then running "make test" > on that branch will cause the testing infrastructure to spawn a > monitoring process on the side that monitors a notifications pipe for > all the key/keyring changes induced by the tests and they'll all be > checked off to make sure they happened. > > https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch > > (*) A test program is provided (samples/watch_queue/watch_test) that can > be used to monitor for keyrings, mount and superblock events. > Information on the notifications is simply logged to stdout. > > Thanks, > David > --- > The following changes since commit b9bbe6ed63b2b9f2c9ee5cbd0f2c946a2723f4ce: > > Linux 5.7-rc6 (2020-05-17 16:48:37 -0700) > > are available in the Git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/notifications-20200601 > > for you to fetch changes up to a8478a602913dc89a7cd2060e613edecd07e1dbd: > > smack: Implement the watch_key and post_notification hooks (2020-05-19 15:47:38 +0100) > > ---------------------------------------------------------------- > Notifications over pipes + Keyring notifications > > ---------------------------------------------------------------- > David Howells (12): > uapi: General notification queue definitions > security: Add a hook for the point of notification insertion > pipe: Add O_NOTIFICATION_PIPE > pipe: Add general notification queue support > security: Add hooks to rule on setting a watch > watch_queue: Add a key/keyring notification facility > Add sample notification program > pipe: Allow buffers to be marked read-whole-or-error for notifications > pipe: Add notification lossage handling > keys: Make the KEY_NEED_* perms an enum rather than a mask > selinux: Implement the watch_key security hook > smack: Implement the watch_key and post_notification hooks > > Documentation/security/keys/core.rst | 57 ++ > Documentation/userspace-api/ioctl/ioctl-number.rst | 1 + > Documentation/watch_queue.rst | 339 +++++++++++ > fs/pipe.c | 242 +++++--- > fs/splice.c | 12 +- > include/linux/key.h | 33 +- > include/linux/lsm_audit.h | 1 + > include/linux/lsm_hook_defs.h | 9 + > include/linux/lsm_hooks.h | 14 + > include/linux/pipe_fs_i.h | 27 +- > include/linux/security.h | 30 +- > include/linux/watch_queue.h | 127 ++++ > include/uapi/linux/keyctl.h | 2 + > include/uapi/linux/watch_queue.h | 104 ++++ > init/Kconfig | 12 + > kernel/Makefile | 1 + > kernel/watch_queue.c | 659 +++++++++++++++++++++ > samples/Kconfig | 6 + > samples/Makefile | 1 + > samples/watch_queue/Makefile | 7 + > samples/watch_queue/watch_test.c | 186 ++++++ > security/keys/Kconfig | 9 + > security/keys/compat.c | 3 + > security/keys/gc.c | 5 + > security/keys/internal.h | 38 +- > security/keys/key.c | 38 +- > security/keys/keyctl.c | 115 +++- > security/keys/keyring.c | 20 +- > security/keys/permission.c | 31 +- > security/keys/process_keys.c | 46 +- > security/keys/request_key.c | 4 +- > security/security.c | 22 +- > security/selinux/hooks.c | 51 +- > security/smack/smack_lsm.c | 112 +++- > 34 files changed, 2185 insertions(+), 179 deletions(-) > create mode 100644 Documentation/watch_queue.rst > create mode 100644 include/linux/watch_queue.h > create mode 100644 include/uapi/linux/watch_queue.h > create mode 100644 kernel/watch_queue.c > create mode 100644 samples/watch_queue/Makefile > create mode 100644 samples/watch_queue/watch_test.c > ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [GIT PULL] General notification queue and key notifications 2020-06-02 15:55 ` David Howells 2020-06-03 2:15 ` Ian Kent 2020-06-10 9:56 ` Christian Brauner @ 2020-06-10 11:12 ` Karel Zak 2020-06-12 21:32 ` Linus Torvalds ` (3 more replies) 2020-06-13 18:00 ` pr-tracker-bot 2020-06-17 1:15 ` Williams, Dan J 4 siblings, 4 replies; 99+ messages in thread From: Karel Zak @ 2020-06-10 11:12 UTC (permalink / raw) To: torvalds Cc: David Howells, viro, dray, mszeredi, swhiteho, jlayton, raven, andres, christian.brauner, jarkko.sakkinen, keyrings, linux-fsdevel, linux-kernel Hi Linus, On Tue, Jun 02, 2020 at 04:55:04PM +0100, David Howells wrote: > Can you pull this, please? It adds a general notification queue concept I'm trying to use David's notification stuff in userspace, and I guess feedback is welcome :-) The notification stuff looks pretty promising, but I do not understand why we need to use pipe for this purpose, see typical userspace use-case: int pipefd[2], fd; if (pipe2(pipefd, O_NOTIFICATION_PIPE) = -1) err(EXIT_FAILURE, "pipe2 failed"); fd = pipefd[0]; All the next operations are done with "fd". It's nowhere used as a pipe, and nothing uses pipefd[1]. The first impression from this code is "oh, this is strange; why?". Is it because we need to create a new file descriptor from nothing? Why O_NOTIFICATION_PIPE is better than introduce a new syscall notifyfd()? (We have signalfd(), no O_SIGNAL_PIPE, etc.) Karel -- Karel Zak <kzak@redhat.com> http://karelzak.blogspot.com ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [GIT PULL] General notification queue and key notifications 2020-06-10 11:12 ` Karel Zak @ 2020-06-12 21:32 ` Linus Torvalds 2020-06-12 22:01 ` Linus Torvalds ` (2 subsequent siblings) 3 siblings, 0 replies; 99+ messages in thread From: Linus Torvalds @ 2020-06-12 21:32 UTC (permalink / raw) To: Karel Zak Cc: David Howells, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, Christian Brauner, Jarkko Sakkinen, keyrings, linux-fsdevel, Linux Kernel Mailing List [ Finally getting around to this since my normal pull queue is now empty ] On Wed, Jun 10, 2020 at 4:13 AM Karel Zak <kzak@redhat.com> wrote: > > The notification stuff looks pretty promising, but I do not understand > why we need to use pipe for this purpose The original intent was never to use the "pipe()" system call itself, only use pipes as the actual transport mechanism (because I do not for a second believe in the crazy "use sockets" model that a lot of other people seem to blindly believe in). But using "pipe()" also allows for non-kernel notification queues (ie where the events come from a user space process). Then you'd not use O_NOTIFICATION_PIPE, but O_DIRECT (for a packetized pipe). > Is it because we need to create a new file descriptor from nothing? > Why O_NOTIFICATION_PIPE is better than introduce a new syscall > notifyfd()? We could eventually introduce a new system call. But I most definitely did *NOT* want to see anything like that for any first gen stuff. Especially since it wasn't clear who was going to use it, and whether early trials would literally be done with that user-space emulation model of using a perfectly regular pipe (just with packetization). I'm not even convinced O_NOTIFICATION_PIPE is necessary, but at worst it will be a useful marker. I think the only real reason for it was to avoid any clashes with splice(), which has more complex use of the pipe buffers. I'm so far just reading this thread and the arguments for users, and I haven't yet looked at all the actual details in the pull request - but last time I had objections to things it wasn't the code, it was the lack of any use. Linus ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [GIT PULL] General notification queue and key notifications 2020-06-10 11:12 ` Karel Zak 2020-06-12 21:32 ` Linus Torvalds @ 2020-06-12 22:01 ` Linus Torvalds 2020-06-13 13:04 ` David Howells 2020-06-13 13:24 ` David Howells 3 siblings, 0 replies; 99+ messages in thread From: Linus Torvalds @ 2020-06-12 22:01 UTC (permalink / raw) To: Karel Zak Cc: David Howells, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, Christian Brauner, Jarkko Sakkinen, keyrings, linux-fsdevel, Linux Kernel Mailing List [ Actually going through the code now ] On Wed, Jun 10, 2020 at 4:13 AM Karel Zak <kzak@redhat.com> wrote: > > All the next operations are done with "fd". It's nowhere used as a > pipe, and nothing uses pipefd[1]. As an aside, that isn't necessarily true. In some of the examples, pipefd[1] is used for configuration (sizing and adding filters), although I think right now that's not really enforced, and other examples seem to have pipefd[0] do that too. DavidH: should that perhaps be a hard rule, so that you can pass a pipefd[0] to readers, while knowing that they can't then change the kinds of notifications they see. In the "pipe: Add general notification queue support" commit message, the code example uses pipefd[0] for IOC_WATCH_QUEUE_SET_SIZE, but then in the commit message for "watch_queue: Add a key/keyring notification facility" it uses pipefd[1]. And that latter example does make sense: using the write-side pipefd[1] for configuration, while the read-side pipefd[0] is the side that sees the results. That is also how it would work if you have a user-mode pipe with the notification source controlling the writing side - the reading side can obviously not add filters or change the semantics of the watches. So that allows a trusted side to add and create filters, while some untrusted entity can then see the results. This isn't going to hold up me merging the code, but it would be good to clarify and make that something that gets enforced if we decide it's worth it. It does seem conceptually like a good idea, and potentially actually useful to clearly separate the domain of "you can add watches and filters" from "you can see the notifications". Linus ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [GIT PULL] General notification queue and key notifications 2020-06-10 11:12 ` Karel Zak 2020-06-12 21:32 ` Linus Torvalds 2020-06-12 22:01 ` Linus Torvalds @ 2020-06-13 13:04 ` David Howells 2020-06-13 16:47 ` Linus Torvalds 2020-06-13 19:22 ` Miklos Szeredi 2020-06-13 13:24 ` David Howells 3 siblings, 2 replies; 99+ messages in thread From: David Howells @ 2020-06-13 13:04 UTC (permalink / raw) To: Linus Torvalds Cc: dhowells, Karel Zak, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, Christian Brauner, Jarkko Sakkinen, keyrings, linux-fsdevel, Linux Kernel Mailing List Linus Torvalds <torvalds@linux-foundation.org> wrote: > I'm not even convinced O_NOTIFICATION_PIPE is necessary, but at worst > it will be a useful marker. I think the only real reason for it was to > avoid any clashes with splice(), which has more complex use of the > pipe buffers. The main reason is to prevent splice because the iov_iter rewind for splice gets quite tricky if the kernel can randomly insert packets into the pipe buffer in between what splice is inserting. > I'm so far just reading this thread and the arguments for users, and I > haven't yet looked at all the actual details in the pull request - but > last time I had objections to things it wasn't the code, it was the > lack of any use. Would you be willing at this point to consider pulling the mount notifications and fsinfo() which helps support that? I could whip up pull reqs for those two pieces - or do you want to see more concrete patches that use it? David ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [GIT PULL] General notification queue and key notifications 2020-06-13 13:04 ` David Howells @ 2020-06-13 16:47 ` Linus Torvalds 2020-06-13 17:03 ` Linus Torvalds 2020-06-13 19:22 ` Miklos Szeredi 1 sibling, 1 reply; 99+ messages in thread From: Linus Torvalds @ 2020-06-13 16:47 UTC (permalink / raw) To: David Howells Cc: Karel Zak, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, Christian Brauner, Jarkko Sakkinen, keyrings, linux-fsdevel, Linux Kernel Mailing List On Sat, Jun 13, 2020 at 6:05 AM David Howells <dhowells@redhat.com> wrote: > > Would you be willing at this point to consider pulling the mount notifications > and fsinfo() which helps support that? I could whip up pull reqs for those > two pieces - or do you want to see more concrete patches that use it? I'd want to see more concrete use cases, but I'd also like to see that this keyring thing gets used and doesn't find any show-stoppers when it does. If we have multiple uses, and one of them notices some problem that requires any ABI changes, but the other one has already started using it, we'll have more problems. Linus ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [GIT PULL] General notification queue and key notifications 2020-06-13 16:47 ` Linus Torvalds @ 2020-06-13 17:03 ` Linus Torvalds 0 siblings, 0 replies; 99+ messages in thread From: Linus Torvalds @ 2020-06-13 17:03 UTC (permalink / raw) To: David Howells Cc: Karel Zak, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, Christian Brauner, Jarkko Sakkinen, keyrings, linux-fsdevel, Linux Kernel Mailing List On Sat, Jun 13, 2020 at 9:47 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > If we have multiple uses, and one of them notices some problem that > requires any ABI changes, but the other one has already started using > it, we'll have more problems. Ok, it's merged in my tree, although I was somewhat unhappy about the incomprehensible calling conventions of "get_pipe_info()". The random second argument just makes no sense when you read the code, it would have probably been better as a helper function or #define to clarify the whole "for_splice" thing. But let's see how it works and what actually happens. Linus ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [GIT PULL] General notification queue and key notifications 2020-06-13 13:04 ` David Howells 2020-06-13 16:47 ` Linus Torvalds @ 2020-06-13 19:22 ` Miklos Szeredi 1 sibling, 0 replies; 99+ messages in thread From: Miklos Szeredi @ 2020-06-13 19:22 UTC (permalink / raw) To: David Howells Cc: Linus Torvalds, Karel Zak, Al Viro, dray, Steven Whitehouse, Jeff Layton, Ian Kent, andres, Christian Brauner, Jarkko Sakkinen, keyrings, linux-fsdevel, Linux Kernel Mailing List On Sat, Jun 13, 2020 at 3:05 PM David Howells <dhowells@redhat.com> wrote: > > I'm so far just reading this thread and the arguments for users, and I > > haven't yet looked at all the actual details in the pull request - but > > last time I had objections to things it wasn't the code, it was the > > lack of any use. > > Would you be willing at this point to consider pulling the mount notifications > and fsinfo() which helps support that? I could whip up pull reqs for those > two pieces - or do you want to see more concrete patches that use it? Well, I had some questions and comments for the mount notifications last time around[1] and didn't yet get a reply. And the fsinfo stuff is simply immature, please lets not merge it just yet. When we have some uses (most notably systemd) running on top of the current fsinfo interface, we can sit down and discuss how the API can be cleaned up. BTW I had a similar experience with the fsconfig() merge, which was pushed with some unpolished bits and where my comments were also largely ignored. So, before asking to pull, please at least *answer* reviews. You don't have to agree, but at least consider and think about the comments. Thanks, Miklos [1] https://lore.kernel.org/linux-fsdevel/CAJfpegspWA6oUtdcYvYF?ij=Bnq03b8VMbU9RNMKc+zzjbag@mail.gmail.com/ ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [GIT PULL] General notification queue and key notifications 2020-06-10 11:12 ` Karel Zak ` (2 preceding siblings ...) 2020-06-13 13:04 ` David Howells @ 2020-06-13 13:24 ` David Howells 3 siblings, 0 replies; 99+ messages in thread From: David Howells @ 2020-06-13 13:24 UTC (permalink / raw) To: Linus Torvalds Cc: dhowells, Karel Zak, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, Christian Brauner, Jarkko Sakkinen, keyrings, linux-fsdevel, Linux Kernel Mailing List Linus Torvalds <torvalds@linux-foundation.org> wrote: > > All the next operations are done with "fd". It's nowhere used as a > > pipe, and nothing uses pipefd[1]. > > As an aside, that isn't necessarily true. > > In some of the examples, pipefd[1] is used for configuration (sizing > and adding filters), although I think right now that's not really > enforced, and other examples seem to have pipefd[0] do that too. The configuration can happen on either end of the pipe. I just need to be able to find the pipe object. > DavidH: should that perhaps be a hard rule, so that you can pass a > pipefd[0] to readers, while knowing that they can't then change the > kinds of notifications they see. You can argue that the other way: that it should be a hard rule that you can pass pipefd[1] to writers, whilst knowing that they can't then change the kind of notifications that the kernel can insert into the pipe. My feeling is that it's more likely that you would keep the read end yourself and give the write end away - if at all. Most likely, IMO, would be that you attach notification sources and never use the write end directly. There is some argument for making it so that the notification sources belong to the read end only and that they keep the write side alive internally - meaning that you can just close the write end. All the notification sources just then disappear when the read end is closed - but dup() might make this kind of tricky as there is only one pipe object and its shared between both ends. The existence of O_RDWR FIFOs might also make this tricky. > In the "pipe: Add general notification queue support" commit message, > the code example uses pipefd[0] for IOC_WATCH_QUEUE_SET_SIZE, but then > in the commit message for "watch_queue: Add a key/keyring notification > facility" it uses pipefd[1]. > > And that latter example does make sense: using the write-side > pipefd[1] for configuration, while the read-side pipefd[0] is the side > that sees the results. That is also how it would work if you have a > user-mode pipe with the notification source controlling the writing > side - the reading side can obviously not add filters or change the > semantics of the watches. > > So that allows a trusted side to add and create filters, while some > untrusted entity can then see the results. As stated above, I think you should be looking at this the other way round - you're more likely to keep the read end for yourself. If you attach multiple sources to a pipe, everything they produce comes out mixed together from the read end of the pipe. You might even pass the write end to multiple userspace-side event generators, but I'm not sure it would make sense to pass the read end around unless you have sufficient flow that you need multiple consumers to keep up with it. David ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [GIT PULL] General notification queue and key notifications 2020-06-02 15:55 ` David Howells ` (2 preceding siblings ...) 2020-06-10 11:12 ` Karel Zak @ 2020-06-13 18:00 ` pr-tracker-bot 2020-06-17 1:15 ` Williams, Dan J 4 siblings, 0 replies; 99+ messages in thread From: pr-tracker-bot @ 2020-06-13 18:00 UTC (permalink / raw) To: David Howells Cc: torvalds, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven, andres, christian.brauner, jarkko.sakkinen, keyrings, linux-fsdevel, linux-kernel The pull request you sent on Tue, 02 Jun 2020 16:55:04 +0100: > git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/notifications-20200601 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/6c3297841472b4e53e22e53826eea9e483d993e5 Thank you! -- Deet-doot-dot, I am a bot. https://korg.wiki.kernel.org/userdoc/prtracker ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [GIT PULL] General notification queue and key notifications 2020-06-02 15:55 ` David Howells ` (3 preceding siblings ...) 2020-06-13 18:00 ` pr-tracker-bot @ 2020-06-17 1:15 ` Williams, Dan J 2020-06-23 23:38 ` Dan Williams 2020-06-24 0:55 ` David Howells 4 siblings, 2 replies; 99+ messages in thread From: Williams, Dan J @ 2020-06-17 1:15 UTC (permalink / raw) To: torvalds, dhowells Cc: raven, kzak, jarkko.sakkinen, linux-nvdimm, dray, swhiteho, linux-kernel, linux-fsdevel, mszeredi, jlayton, viro, andres, keyrings, christian.brauner Hi David, On Tue, 2020-06-02 at 16:55 +-0100, David Howells wrote: +AD4- Date: Tue, 02 Jun 2020 16:51:44 +-0100 +AD4- +AD4- Hi Linus, +AD4- +AD4- Can you pull this, please? It adds a general notification queue +AD4- concept +AD4- and adds an event source for keys/keyrings, such as linking and +AD4- unlinking +AD4- keys and changing their attributes. +AFs-..+AF0- This commit: +AD4- keys: Make the KEY+AF8-NEED+AF8AKg- perms an enum rather than a mask ...upstream as: 8c0637e950d6 keys: Make the KEY+AF8-NEED+AF8AKg- perms an enum rather than a mask ...triggers a regression in the libnvdimm unit test that exercises the encrypted keys used to store nvdimm passphrases. It results in the below warning. --- WARNING: CPU: 15 PID: 6276 at security/keys/permission.c:35 key+AF8-task+AF8-permission+-0xd3/0x140 Modules linked in: nd+AF8-blk(OE) nfit+AF8-test(OE) device+AF8-dax(OE) ebtable+AF8-filter(E) ebtables(E) ip6table+AF8-filter(E) ip6+AF8-tables(E) kvm+AF8-intel(E) kvm(E) irqbypass(E) nd+AF8-pmem(OE) dax+AF8-pmem(OE) nd+AF8-btt(OE) dax+AF8-p ct10dif+AF8-pclmul(E) nd+AF8-e820(OE) nfit(OE) crc32+AF8-pclmul(E) libnvdimm(OE) crc32c+AF8-intel(E) ghash+AF8-clmulni+AF8-intel(E) serio+AF8-raw(E) encrypted+AF8-keys(E) trusted(E) nfit+AF8-test+AF8-iomap(OE) tpm(E) drm(E) CPU: 15 PID: 6276 Comm: lt-ndctl Tainted: G OE 5.7.0-rc6+- +ACM-155 Hardware name: QEMU Standard PC (i440FX +- PIIX, 1996), BIOS 0.0.0 02/06/2015 RIP: 0010:key+AF8-task+AF8-permission+-0xd3/0x140 Code: c8 21 d9 39 d9 75 25 48 83 c4 08 4c 89 e6 48 89 ef 5b 5d 41 5c 41 5d e9 1b a7 00 00 bb 01 00 00 00 83 fa 01 0f 84 68 ff ff ff +ADw-0f+AD4- 0b 48 83 c4 08 b8 f3 ff ff ff 5b 5d 41 5c 41 5d c3 83 fa 06 RSP: 0018:ffffaddc42db7c90 EFLAGS: 00010297 RAX: 0000000000000001 RBX: 0000000000000001 RCX: ffffaddc42db7c7c RDX: 0000000000000000 RSI: ffff985e1c46e840 RDI: ffff985e3a03de01 RBP: ffff985e3a03de01 R08: 0000000000000000 R09: 5461e7bc000002a0 R10: 0000000000000004 R11: 0000000066666666 R12: ffff985e1c46e840 R13: 0000000000000000 R14: ffffaddc42db7cd8 R15: ffff985e248c6540 FS: 00007f863c18a780(0000) GS:ffff985e3bbc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000006d3708 CR3: 0000000125a1e006 CR4: 0000000000160ee0 Call Trace: lookup+AF8-user+AF8-key+-0xeb/0x6b0 ? vsscanf+-0x3df/0x840 ? key+AF8-validate+-0x50/0x50 ? key+AF8-default+AF8-cmp+-0x20/0x20 nvdimm+AF8-get+AF8-user+AF8-key+AF8-payload.part.0+-0x21/0x110 +AFs-libnvdimm+AF0- nvdimm+AF8-security+AF8-store+-0x67d/0xb20 +AFs-libnvdimm+AF0- security+AF8-store+-0x67/0x1a0 +AFs-libnvdimm+AF0- kernfs+AF8-fop+AF8-write+-0xcf/0x1c0 vfs+AF8-write+-0xde/0x1d0 ksys+AF8-write+-0x68/0xe0 do+AF8-syscall+AF8-64+-0x5c/0xa0 entry+AF8-SYSCALL+AF8-64+AF8-after+AF8-hwframe+-0x49/0xb3 RIP: 0033:0x7f863c624547 Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 +ADw-48+AD4- 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 RSP: 002b:00007ffd61d8f5e8 EFLAGS: 00000246 ORIG+AF8-RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007ffd61d8f640 RCX: 00007f863c624547 RDX: 0000000000000014 RSI: 00007ffd61d8f640 RDI: 0000000000000005 RBP: 0000000000000005 R08: 0000000000000014 R09: 00007ffd61d8f4a0 R10: fffffffffffff455 R11: 0000000000000246 R12: 00000000006dbbf0 R13: 00000000006cd710 R14: 00007f863c18a6a8 R15: 00007ffd61d8fae0 irq event stamp: 36976 hardirqs last enabled at (36975): +AFsAPA-ffffffff9131fa40+AD4AXQ- +AF8AXw-slab+AF8-alloc+-0x70/0x90 hardirqs last disabled at (36976): +AFsAPA-ffffffff910049c7+AD4AXQ- trace+AF8-hardirqs+AF8-off+AF8-thunk+-0x1a/0x1c softirqs last enabled at (35474): +AFsAPA-ffffffff91e00357+AD4AXQ- +AF8AXw-do+AF8-softirq+-0x357/0x466 softirqs last disabled at (35467): +AFsAPA-ffffffff910eae96+AD4AXQ- irq+AF8-exit+-0xe6/0xf0 ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [GIT PULL] General notification queue and key notifications 2020-06-17 1:15 ` Williams, Dan J @ 2020-06-23 23:38 ` Dan Williams 2020-06-24 0:55 ` David Howells 1 sibling, 0 replies; 99+ messages in thread From: Dan Williams @ 2020-06-23 23:38 UTC (permalink / raw) To: torvalds, dhowells Cc: raven, kzak, jarkko.sakkinen, linux-nvdimm, dray, swhiteho, linux-kernel, linux-fsdevel, mszeredi, jlayton, viro, andres, keyrings, christian.brauner On Tue, Jun 16, 2020 at 6:15 PM Williams, Dan J <dan.j.williams@intel.com> wrote: > > Hi David, > > On Tue, 2020-06-02 at 16:55 +0100, David Howells wrote: > > Date: Tue, 02 Jun 2020 16:51:44 +0100 > > > > Hi Linus, > > > > Can you pull this, please? It adds a general notification queue > > concept > > and adds an event source for keys/keyrings, such as linking and > > unlinking > > keys and changing their attributes. > [..] > > This commit: > > > keys: Make the KEY_NEED_* perms an enum rather than a mask > > ...upstream as: > > 8c0637e950d6 keys: Make the KEY_NEED_* perms an enum rather than a mask > > ...triggers a regression in the libnvdimm unit test that exercises the > encrypted keys used to store nvdimm passphrases. It results in the > below warning. This regression is still present in tip of tree. David, have you had a chance to take a look? > > --- > > WARNING: CPU: 15 PID: 6276 at security/keys/permission.c:35 key_task_permission+0xd3/0x140 > Modules linked in: nd_blk(OE) nfit_test(OE) device_dax(OE) ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E) kvm_intel(E) kvm(E) irqbypass(E) nd_pmem(OE) dax_pmem(OE) nd_btt(OE) dax_p > ct10dif_pclmul(E) nd_e820(OE) nfit(OE) crc32_pclmul(E) libnvdimm(OE) crc32c_intel(E) ghash_clmulni_intel(E) serio_raw(E) encrypted_keys(E) trusted(E) nfit_test_iomap(OE) tpm(E) drm(E) > CPU: 15 PID: 6276 Comm: lt-ndctl Tainted: G OE 5.7.0-rc6+ #155 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 > RIP: 0010:key_task_permission+0xd3/0x140 > Code: c8 21 d9 39 d9 75 25 48 83 c4 08 4c 89 e6 48 89 ef 5b 5d 41 5c 41 5d e9 1b a7 00 00 bb 01 00 00 00 83 fa 01 0f 84 68 ff ff ff <0f> 0b 48 83 c4 08 b8 f3 ff ff ff 5b 5d 41 5c 41 5d c3 83 fa 06 > > RSP: 0018:ffffaddc42db7c90 EFLAGS: 00010297 > RAX: 0000000000000001 RBX: 0000000000000001 RCX: ffffaddc42db7c7c > RDX: 0000000000000000 RSI: ffff985e1c46e840 RDI: ffff985e3a03de01 > RBP: ffff985e3a03de01 R08: 0000000000000000 R09: 5461e7bc000002a0 > R10: 0000000000000004 R11: 0000000066666666 R12: ffff985e1c46e840 > R13: 0000000000000000 R14: ffffaddc42db7cd8 R15: ffff985e248c6540 > FS: 00007f863c18a780(0000) GS:ffff985e3bbc0000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00000000006d3708 CR3: 0000000125a1e006 CR4: 0000000000160ee0 > Call Trace: > lookup_user_key+0xeb/0x6b0 > ? vsscanf+0x3df/0x840 > ? key_validate+0x50/0x50 > ? key_default_cmp+0x20/0x20 > nvdimm_get_user_key_payload.part.0+0x21/0x110 [libnvdimm] > nvdimm_security_store+0x67d/0xb20 [libnvdimm] > security_store+0x67/0x1a0 [libnvdimm] > kernfs_fop_write+0xcf/0x1c0 > vfs_write+0xde/0x1d0 > ksys_write+0x68/0xe0 > do_syscall_64+0x5c/0xa0 > entry_SYSCALL_64_after_hwframe+0x49/0xb3 > RIP: 0033:0x7f863c624547 > Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 > RSP: 002b:00007ffd61d8f5e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 > RAX: ffffffffffffffda RBX: 00007ffd61d8f640 RCX: 00007f863c624547 > RDX: 0000000000000014 RSI: 00007ffd61d8f640 RDI: 0000000000000005 > RBP: 0000000000000005 R08: 0000000000000014 R09: 00007ffd61d8f4a0 > R10: fffffffffffff455 R11: 0000000000000246 R12: 00000000006dbbf0 > R13: 00000000006cd710 R14: 00007f863c18a6a8 R15: 00007ffd61d8fae0 > irq event stamp: 36976 > hardirqs last enabled at (36975): [<ffffffff9131fa40>] __slab_alloc+0x70/0x90 > hardirqs last disabled at (36976): [<ffffffff910049c7>] trace_hardirqs_off_thunk+0x1a/0x1c > softirqs last enabled at (35474): [<ffffffff91e00357>] __do_softirq+0x357/0x466 > softirqs last disabled at (35467): [<ffffffff910eae96>] irq_exit+0xe6/0xf0 > ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [GIT PULL] General notification queue and key notifications 2020-06-17 1:15 ` Williams, Dan J 2020-06-23 23:38 ` Dan Williams @ 2020-06-24 0:55 ` David Howells 2020-06-24 1:03 ` Dan Williams 2020-06-24 1:17 ` David Howells 1 sibling, 2 replies; 99+ messages in thread From: David Howells @ 2020-06-24 0:55 UTC (permalink / raw) To: Dan Williams Cc: dhowells, torvalds, raven, kzak, jarkko.sakkinen, linux-nvdimm, dray, swhiteho, linux-kernel, linux-fsdevel, mszeredi, jlayton, viro, andres, keyrings, christian.brauner Dan Williams <dan.j.williams@intel.com> wrote: > > This commit: > > > > > keys: Make the KEY_NEED_* perms an enum rather than a mask > > > > ...upstream as: > > > > 8c0637e950d6 keys: Make the KEY_NEED_* perms an enum rather than a mask > > > > ...triggers a regression in the libnvdimm unit test that exercises the > > encrypted keys used to store nvdimm passphrases. It results in the > > below warning. > > This regression is still present in tip of tree. David, have you had a > chance to take a look? nvdimm_lookup_user_key() needs to indicate to lookup_user_key() what it wants the key for so that the appropriate security checks can take place in SELinux and Smack. Note that I have a patch in the works that changes this still further. Does setting the third argument of lookup_user_key() to KEY_NEED_SEARCH work for you? David ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [GIT PULL] General notification queue and key notifications 2020-06-24 0:55 ` David Howells @ 2020-06-24 1:03 ` Dan Williams 2020-06-24 1:17 ` David Howells 1 sibling, 0 replies; 99+ messages in thread From: Dan Williams @ 2020-06-24 1:03 UTC (permalink / raw) To: David Howells Cc: torvalds, raven, kzak, jarkko.sakkinen, linux-nvdimm, dray, swhiteho, linux-kernel, linux-fsdevel, mszeredi, jlayton, viro, andres, keyrings, christian.brauner On Tue, Jun 23, 2020 at 5:55 PM David Howells <dhowells@redhat.com> wrote: > > Dan Williams <dan.j.williams@intel.com> wrote: > > > > This commit: > > > > > > > keys: Make the KEY_NEED_* perms an enum rather than a mask > > > > > > ...upstream as: > > > > > > 8c0637e950d6 keys: Make the KEY_NEED_* perms an enum rather than a mask > > > > > > ...triggers a regression in the libnvdimm unit test that exercises the > > > encrypted keys used to store nvdimm passphrases. It results in the > > > below warning. > > > > This regression is still present in tip of tree. David, have you had a > > chance to take a look? > > nvdimm_lookup_user_key() needs to indicate to lookup_user_key() what it wants > the key for so that the appropriate security checks can take place in SELinux > and Smack. Note that I have a patch in the works that changes this still > further. > > Does setting the third argument of lookup_user_key() to KEY_NEED_SEARCH work > for you? It does, thanks. Shall I wait for your further reworks to fix this for v5.8, or is that v5.9 material? ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [GIT PULL] General notification queue and key notifications 2020-06-24 0:55 ` David Howells 2020-06-24 1:03 ` Dan Williams @ 2020-06-24 1:17 ` David Howells 1 sibling, 0 replies; 99+ messages in thread From: David Howells @ 2020-06-24 1:17 UTC (permalink / raw) To: Dan Williams Cc: dhowells, torvalds, raven, kzak, jarkko.sakkinen, linux-nvdimm, dray, swhiteho, linux-kernel, linux-fsdevel, mszeredi, jlayton, viro, andres, keyrings, christian.brauner Dan Williams <dan.j.williams@intel.com> wrote: > Shall I wait for your further reworks to fix this for v5.8, or is that > v5.9 material? It could do with stewing in linux-next for a while, so 5.9 probably. David ^ permalink raw reply [flat|nested] 99+ messages in thread
* [GIT PULL] Mount and superblock notifications 2020-03-30 13:58 Upcoming: Notifications, FS notifications and fsinfo() David Howells 2020-03-30 14:31 ` [GIT PULL] General notification queue and key notifications David Howells @ 2020-03-30 14:36 ` David Howells 2020-04-04 21:13 ` Linus Torvalds 2020-03-30 14:43 ` [GIT PULL] fsinfo: Filesystem information query David Howells ` (4 subsequent siblings) 6 siblings, 1 reply; 99+ messages in thread From: David Howells @ 2020-03-30 14:36 UTC (permalink / raw) To: torvalds Cc: dhowells, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven, andres, christian.brauner, keyrings, linux-fsdevel, linux-kernel Hi Linus, If you could consider pulling this - or would you prefer it to go through Al? It adds a couple of VFS-related event sources for the general notification mechanism: (1) Mount topology events, such as mounting, unmounting, mount expiry, mount reconfiguration. (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O errors (not complete yet). WHY === (1) Mount notifications. This one is wanted to avoid repeated trawling of /proc/mounts or similar to work out changes to the mount object attributes and mount topology. I'm told that the proc file holding the namespace_sem is a point of contention, especially as the process of generating the text descriptions of the mounts/superblocks can be quite involved. Whilst you can use poll() on /proc/mounts, it doesn't give you any clues as to what changed. The notification generated here directly indicates the mounts involved in any particular event and gives an idea of what the change was. This is combined with a new fsinfo() system call that allows, amongst other things, the ability to retrieve in one go an { id, change_counter } tuple from all the children of a specified mount, allowing buffer overruns to be dealt with quickly. This can be used by systemd to improve efficiency: https://lore.kernel.org/linux-fsdevel/20200227151421.3u74ijhqt6ekbiss@ws.net.home/ And it's not just Red Hat that's potentially interested in this: https://lore.kernel.org/linux-fsdevel/293c9bd3-f530-d75e-c353-ddeabac27cf6@6wind.com/ Also, this can be used to improve management of containers by allowing watches to be set in foreign mount namespaces, such as are in a container. (2) Superblock notifications. This one is provided to allow systemd or the desktop to more easily detect events such as I/O errors and EDQUOT/ENOSPC. This would be of interest to Postgres: https://lore.kernel.org/linux-fsdevel/20200211005626.7yqjf5rbs3vbwagd@alap3.anarazel.de/ But could also be used to indicate to systemd when a superblock has had its configuration changed. Thanks, David --- The following changes since commit 694435dbde3d1da79aafaf4cd680802f9eb229b7: smack: Implement the watch_key and post_notification hooks (2020-03-19 17:31:09 +0000) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/notifications-fs-20200330 for you to fetch changes up to 8dbf1aa122da5bbb4ede0f363a8a18dfc723be33: watch_queue: sample: Display superblock notifications (2020-03-19 17:31:09 +0000) ---------------------------------------------------------------- Filesystem notifications ---------------------------------------------------------------- David Howells (6): watch_queue: Add security hooks to rule on setting mount and sb watches watch_queue: Implement mount topology and attribute change notifications watch_queue: sample: Display mount tree change notifications watch_queue: Introduce a non-repeating system-unique superblock ID watch_queue: Add superblock notifications watch_queue: sample: Display superblock notifications Documentation/watch_queue.rst | 24 ++- arch/alpha/kernel/syscalls/syscall.tbl | 2 + arch/arm/tools/syscall.tbl | 2 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 4 + arch/ia64/kernel/syscalls/syscall.tbl | 2 + arch/m68k/kernel/syscalls/syscall.tbl | 2 + arch/microblaze/kernel/syscalls/syscall.tbl | 2 + arch/mips/kernel/syscalls/syscall_n32.tbl | 2 + arch/mips/kernel/syscalls/syscall_n64.tbl | 2 + arch/mips/kernel/syscalls/syscall_o32.tbl | 2 + arch/parisc/kernel/syscalls/syscall.tbl | 2 + arch/powerpc/kernel/syscalls/syscall.tbl | 2 + arch/s390/kernel/syscalls/syscall.tbl | 2 + arch/sh/kernel/syscalls/syscall.tbl | 2 + arch/sparc/kernel/syscalls/syscall.tbl | 2 + arch/x86/entry/syscalls/syscall_32.tbl | 2 + arch/x86/entry/syscalls/syscall_64.tbl | 2 + arch/xtensa/kernel/syscalls/syscall.tbl | 2 + fs/Kconfig | 21 +++ fs/Makefile | 1 + fs/internal.h | 1 + fs/mount.h | 21 +++ fs/mount_notify.c | 228 ++++++++++++++++++++++++++++ fs/namespace.c | 22 +++ fs/super.c | 205 +++++++++++++++++++++++++ include/linux/dcache.h | 1 + include/linux/fs.h | 62 ++++++++ include/linux/lsm_hooks.h | 24 +++ include/linux/security.h | 16 ++ include/linux/syscalls.h | 4 + include/uapi/asm-generic/unistd.h | 6 +- include/uapi/linux/watch_queue.h | 65 +++++++- kernel/sys_ni.c | 6 + samples/watch_queue/watch_test.c | 81 +++++++++- security/security.c | 14 ++ 36 files changed, 835 insertions(+), 5 deletions(-) create mode 100644 fs/mount_notify.c ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [GIT PULL] Mount and superblock notifications 2020-03-30 14:36 ` [GIT PULL] Mount and superblock notifications David Howells @ 2020-04-04 21:13 ` Linus Torvalds 2020-04-05 22:52 ` Andres Freund 0 siblings, 1 reply; 99+ messages in thread From: Linus Torvalds @ 2020-04-04 21:13 UTC (permalink / raw) To: David Howells Cc: Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, jlayton, Ian Kent, andres, Christian Brauner, keyrings, linux-fsdevel, Linux Kernel Mailing List On Mon, Mar 30, 2020 at 7:37 AM David Howells <dhowells@redhat.com> wrote: > > If you could consider pulling this - or would you prefer it to go through > Al? It adds a couple of VFS-related event sources for the general > notification mechanism: <y issue with these remains the same it was last time, so I'll just quote what I said back then: "So I no longer hate the implementation, but I do want to see the actual user space users come out of the woodwork and try this out for their use cases. I'd hate to see a new event queue interface that people then can't really use due to it not fulfilling their needs, or can't use for some other reason." I want to see somebody step up enough to say "yes, I actually use this, and have the patches for the user space side, and it helps my load by 3000%, and here are the numbers, and the event overflow case isn't an issue because Y" Or whatever. It doesn't have to be performance, but the separate discussion I've seen has been about that being the reason for it. I just don't want it to be a _hypothetical_ reason. I want it to be a tested reason where people said "yeah, this is easy to use and actually fixes the problems". Because if what happens is that when the events overflow, and maybe people fall back on the old model (or whatever) then that probably just means that you do better up until a point where you start doing _worse_ than we used to. Or people find out that they needed more information anyway, and the event model doesn't work when you restart your special server because you've lost the original state. Or any other number of "cool feature, but I can't really use it". IOW, I really want to know that yes, the design is what people will then use and it actually fixes real-world issues. And it needs to be interesting and pressing enough that those people actually at least do a working prototype on top of a patch-set that hasn't made it into the kernel yet. Now, I realize that other projects won't _upstream_ their support before the kernel has the infrastructure, so I'm not looking for _that_ kind of "yeah, look, project XYZ already does this and Red Hat ships it". No, I'm looking for those outside developers who say more than "this is a pet peeve of mine with the existing interface". I want to see some actual use - even if it's just in a development environment - that shows that it's (a) sufficient and (b) actually fixes problems. Linus ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [GIT PULL] Mount and superblock notifications 2020-04-04 21:13 ` Linus Torvalds @ 2020-04-05 22:52 ` Andres Freund 0 siblings, 0 replies; 99+ messages in thread From: Andres Freund @ 2020-04-05 22:52 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, jlayton, Ian Kent, Christian Brauner, keyrings, linux-fsdevel, Linux Kernel Mailing List Hi, On 2020-04-04 14:13:03 -0700, Linus Torvalds wrote: > And it needs to be interesting and pressing enough that those people > actually at least do a working prototype on top of a patch-set that > hasn't made it into the kernel yet. > > Now, I realize that other projects won't _upstream_ their support > before the kernel has the infrastructure, so I'm not looking for > _that_ kind of "yeah, look, project XYZ already does this and Red Hat > ships it". No, I'm looking for those outside developers who say more > than "this is a pet peeve of mine with the existing interface". I want > to see some actual use - even if it's just in a development > environment - that shows that it's (a) sufficient and (b) actually > fixes problems. FWIW, postgres remains interested in using the per-superblock events. On 2020-03-30 15:36:54 +0100, David Howells wrote: > (2) Superblock notifications. > > This one is provided to allow systemd or the desktop to more easily > detect events such as I/O errors and EDQUOT/ENOSPC. This would be of > interest to Postgres: > > https://lore.kernel.org/linux-fsdevel/20200211005626.7yqjf5rbs3vbwagd@alap3.anarazel.de/ > > But could also be used to indicate to systemd when a superblock has > had its configuration changed. What prevents me from coming up with a prototype is that the error handling pieces aren't complete, as far as I can tell: On 2020-03-30 15:36:54 +0100, David Howells wrote: > (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O > errors (not complete yet). There's afaict no notify_sb_error() callers, making it hard for me to actually test anything. The important issue for us is I/O errors, but EDQUOT/ENOSPC could also be useful (but is not urgent). Greetings, Andres Freund ^ permalink raw reply [flat|nested] 99+ messages in thread
* [GIT PULL] fsinfo: Filesystem information query 2020-03-30 13:58 Upcoming: Notifications, FS notifications and fsinfo() David Howells 2020-03-30 14:31 ` [GIT PULL] General notification queue and key notifications David Howells 2020-03-30 14:36 ` [GIT PULL] Mount and superblock notifications David Howells @ 2020-03-30 14:43 ` David Howells 2020-03-30 20:28 ` Upcoming: Notifications, FS notifications and fsinfo() Miklos Szeredi ` (3 subsequent siblings) 6 siblings, 0 replies; 99+ messages in thread From: David Howells @ 2020-03-30 14:43 UTC (permalink / raw) To: torvalds Cc: dhowells, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven, andres, christian.brauner, keyrings, linux-fsdevel, linux-kernel Hi Linus, If you could consider pulling this - or would you prefer it to go through Al? It adds a system call, fsinfo(), that allows information about the VFS, mount topology, superblock and files to be retrieved. This based on top of the mount and superblock notifications patchset and allows event counters implemented in the latter to be retrieved to allow overruns to be efficiently managed. I've excluded the Ext4 and NFS sample code from this pull request that I've previously included in my posted patchsets. ======= THE WHY ======= Why do we want this? Using /proc/mounts (or similar) has problems: (1) Reading from it holds a global lock (namespace_sem) that prevents mounting and unmounting. Lots of data is encoded and mangled into text whilst the lock is held, including superblock option strings and mount point paths. This causes performance problems when there are a lot of mount objects in a system. (2) Even though namespace_sem is held during a read, reading the whole file isn't necessarily atomic with respect to mount-type operations. If a read isn't satisfied in one go, then it may return to userspace briefly and then continue reading some way into the file. But changes can occur in the interval that may then go unseen. (3) Determining what has changed means parsing and comparing consecutive outputs of /proc/mounts. (4) Querying a specific mount or superblock means searching through /proc/mounts and searching by path or mount ID - but we might have an fd we want to query. (5) Mount topology is not explicit. One must derive it manually by comparing entries. (6) Whilst you can poll() it for events, it only tells you that something changed in the namespace, not what or whether you can even see the change. To fix the notification issues, the preceding notifications patchset added mount watch notifications whereby you can watch for notifications in a specific mount subtree. The notification messages include the ID(s) of the affected mounts. To support notifications, however, we need to be able to handle overruns in the notification queue. I added a number of event counters to struct super_block and struct mount to allow you to pin down the changes, but there needs to be a way to retrieve them. Exposing them through /proc would require adding yet another /proc/mounts-type file. We could add per-mount directories full of attributes in sysfs, but that has issues also (see below). Adding an extensible system call interface for retrieving filesystem information also allows other things to be exposed: (1) Jeff Layton's error handling changes need a way to allow error event information to be retrieved. (2) Bits in masks returned by things like statx() and FS_IOC_GETFLAGS are actually 3-state { Set, Unset, Not supported }. It could be useful to provide a way to expose information like this[*]. (3) Limits of the numerical metadata values in a filesystem[*]. (4) Filesystem capability information[*]. Filesystems don't all have the same capabilities, and even different instances may have different capabilities, particularly with network filesystems where the set of may be server-dependent. Capabilities might even vary at file granularity - though possibly such information should be conveyed through statx() instead. (5) ID mapping/shifting tables in use for a superblock. (6) Filesystem-specific information. I need something for AFS so that I can do pioctl()-emulation, thereby allowing me to implement certain of the AFS command line utilities that query state of a particular file. This could also have application for other filesystems, such as NFS, CIFS and ext4. [*] In a lot of cases these are probably fixed and can be memcpy'd from static data. There's a further consideration: I want to make it possible to have fsconfig(fd, FSCONFIG_CMD_CREATE) be intercepted by a container manager such that the manager can supervise a mount attempted inside the container. The manager would be given an fd pointing to the fs_context struct and would then need some way to query it (fsinfo()) and modify it (fsconfig()). This could also be used to arbitrate user-requested mounts when containers are not in play. ================ DESIGN DECISIONS ================ (1) Information is partitioned into sets of attributes. (2) Attribute IDs are integers as they're fast to compare. (3) Attribute values are typed (struct, list of structs, string, opaque blob). They type is fixed for a particular attribute. (4) For structure types, the length is also a version. New fields can be tacked onto the end. (5) When copying a versioned struct to userspace, the core handles a version mismatch by truncating or zero-padding the data as necessary. None of this is seen by the filesystem. (6) The core handles all the buffering and buffer resizing. (7) The filesystem never gets any access to the userspace parameter buffer or result buffer. (8) "Meta" attributes can describe other attributes. ======== OVERVIEW ======== fsinfo() is a system call that allows information about the filesystem at a particular path point to be queried as a set of attributes. Attribute values are of four basic types: (1) Structure with version-dependent length (the length is the version). (2) Variable-length string. (3) List of structures (all the same length). (4) Opaque blob. Attributes can have multiple values either as a sequence of values or a sequence-of-sequences of values and all the values of a particular attribute must be of the same type. Values can be up to INT_MAX size, subject to memory availability. Note that the values of an attribute *are* allowed to vary between dentries within a single superblock, depending on the specific dentry that you're looking at, but the values still have to be of the type for that attribute. I've tried to make the interface as light as possible, so integer attribute ID rather than string and the core does all the buffer allocation and expansion and all the extensibility support work rather than leaving that to the filesystems. This means that userspace pointers are not exposed to the filesystem. fsinfo() allows a variety of information to be retrieved about a filesystem and the mount topology: (1) General superblock attributes: - Filesystem identifiers (UUID, volume label, device numbers, ...) - The limits on a filesystem's capabilities - Information on supported statx fields and attributes and IOC flags. - A variety single-bit flags indicating supported capabilities. - Timestamp resolution and range. - The amount of space/free space in a filesystem (as statfs()). - Superblock notification counter. (2) Filesystem-specific superblock attributes: - Superblock-level timestamps. - Cell name, workgroup or other netfs grouping concept. - Server names and addresses. (3) VFS information: - Mount topology information. - Mount attributes. - Mount notification counter. - Mount point path. (4) Information about what the fsinfo() syscall itself supports, including the type and struct size of attributes. The system is extensible: (1) New attributes can be added. There is no requirement that a filesystem implement every attribute. A helper function is provided to scan a list of attributes and a filesystem can have multiple such lists. (2) Version length-dependent structure attributes can be made larger and have additional information tacked on the end, provided it keeps the layout of the existing fields. If an older process asks for a shorter structure, it will only be given the bits it asks for. If a newer process asks for a longer structure on an older kernel, the extra space will be set to 0. In all cases, the size of the data actually available is returned. In essence, the size of a structure is that structure's version: a smaller size is an earlier version and a later version includes everything that the earlier version did. (3) New single-bit capability flags can be added. This is a structure-typed attribute and, as such, (2) applies. Any bits you wanted but the kernel doesn't support are automatically set to 0. fsinfo() may be called like the following, for example: struct fsinfo_params params = { .at_flags = AT_SYMLINK_NOFOLLOW, .flags = FSINFO_FLAGS_QUERY_PATH, .request = FSINFO_ATTR_AFS_SERVER_ADDRESSES, .Nth = 2, }; struct fsinfo_server_address address; len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", ¶ms, &address, sizeof(address)); The above example would query an AFS filesystem to retrieve the address list for the 3rd server, and: struct fsinfo_params params = { .at_flags = AT_SYMLINK_NOFOLLOW, .flags = FSINFO_FLAGS_QUERY_PATH, .request = FSINFO_ATTR_NFS_SERVER_NAME; }; char server_name[256]; len = fsinfo(AT_FDCWD, "/home/dhowells/", ¶ms, &server_name, sizeof(server_name)); would retrieve the name of the NFS server as a string. In future, I want to make fsinfo() capable of querying a context created by fsopen() or fspick(), e.g.: fd = fsopen("ext4", 0); struct fsinfo_params params = { .flags = FSINFO_FLAGS_QUERY_FSCONTEXT, .request = FSINFO_ATTR_CONFIGURATION; }; char buffer[65536]; fsinfo(fd, NULL, ¶ms, &buffer, sizeof(buffer)); even if that context doesn't currently have a superblock attached. Thanks, David --- The following changes since commit 8dbf1aa122da5bbb4ede0f363a8a18dfc723be33: watch_queue: sample: Display superblock notifications (2020-03-19 17:31:09 +0000) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/fsinfo-20200330 for you to fetch changes up to 80f0ef902951fd3e7446767e7d37d543d936b72f: fsinfo: Add support for AFS (2020-03-19 17:31:34 +0000) ---------------------------------------------------------------- Filesystem information ---------------------------------------------------------------- David Howells (11): fsinfo: Add fsinfo() syscall to query filesystem information fsinfo: Provide a bitmap of supported features fsinfo: Allow retrieval of superblock devname, options and stats fsinfo: Allow fsinfo() to look up a mount object by ID fsinfo: Add a uniquifier ID to struct mount fsinfo: Allow mount information to be queried fsinfo: Allow mount topology and propagation info to be retrieved fsinfo: Provide notification overrun handling support fsinfo: sample: Mount listing program fsinfo: Add API documentation fsinfo: Add support for AFS Documentation/filesystems/fsinfo.rst | 574 +++++++++++++++++++ arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/Kconfig | 7 + fs/Makefile | 1 + fs/afs/internal.h | 1 + fs/afs/super.c | 218 +++++++- fs/d_path.c | 2 +- fs/fsinfo.c | 725 ++++++++++++++++++++++++ fs/internal.h | 14 + fs/mount.h | 3 + fs/mount_notify.c | 2 + fs/namespace.c | 389 ++++++++++++- include/linux/fs.h | 4 + include/linux/fsinfo.h | 111 ++++ include/linux/syscalls.h | 4 + include/uapi/asm-generic/unistd.h | 4 +- include/uapi/linux/fsinfo.h | 326 +++++++++++ include/uapi/linux/mount.h | 10 +- kernel/sys_ni.c | 1 + samples/vfs/Makefile | 7 + samples/vfs/test-fsinfo.c | 818 ++++++++++++++++++++++++++++ samples/vfs/test-mntinfo.c | 279 ++++++++++ 39 files changed, 3511 insertions(+), 9 deletions(-) create mode 100644 Documentation/filesystems/fsinfo.rst create mode 100644 fs/fsinfo.c create mode 100644 include/linux/fsinfo.h create mode 100644 include/uapi/linux/fsinfo.h create mode 100644 samples/vfs/test-fsinfo.c create mode 100644 samples/vfs/test-mntinfo.c ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-30 13:58 Upcoming: Notifications, FS notifications and fsinfo() David Howells ` (2 preceding siblings ...) 2020-03-30 14:43 ` [GIT PULL] fsinfo: Filesystem information query David Howells @ 2020-03-30 20:28 ` Miklos Szeredi 2020-03-31 9:21 ` Karel Zak 2020-03-30 21:17 ` Christian Brauner ` (2 subsequent siblings) 6 siblings, 1 reply; 99+ messages in thread From: Miklos Szeredi @ 2020-03-30 20:28 UTC (permalink / raw) To: David Howells Cc: Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, Christian Brauner, keyrings, linux-fsdevel, linux-kernel On Mon, Mar 30, 2020 at 3:58 PM David Howells <dhowells@redhat.com> wrote: > > > Hi Linus, > > I have three sets of patches I'd like to push your way, if you (and Al) are > willing to consider them. The basic problem in my view, is that the performance requirement of a "get filesystem information" type of API just does not warrant a binary coded interface. I've said this a number of times, but it fell on deaf ears. Such binary ABIs (especially if not very carefully designed and reviewed) usually go through several revisions as the structure fails to account for future changes in the representation of those structure fields. There are too many examples of this to count. Then there's the problem of needing to update libc, utilities and language bindings on each revision or extension of the interface. All this could be solved with a string key/value representation of the same data, with minimal performance loss on encoding/parsing. The proposed fs interface[1] is one example of that, but I could also imagine a syscall based one too. Thanks, Miklos [1] https://lore.kernel.org/linux-fsdevel/20200309200238.GB28467@miu.piliscsaba.redhat.com/ ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-30 20:28 ` Upcoming: Notifications, FS notifications and fsinfo() Miklos Szeredi @ 2020-03-31 9:21 ` Karel Zak 0 siblings, 0 replies; 99+ messages in thread From: Karel Zak @ 2020-03-31 9:21 UTC (permalink / raw) To: Miklos Szeredi Cc: David Howells, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, Christian Brauner, keyrings, linux-fsdevel, linux-kernel On Mon, Mar 30, 2020 at 10:28:56PM +0200, Miklos Szeredi wrote: > All this could be solved with a string key/value representation of the > same data, with minimal performance loss on encoding/parsing. The > proposed fs interface[1] is one example of that, but I could also > imagine a syscall based one too. Yes, key/value is possible solution. The question is if we really need to add extra /sys-like filesystem to get key/value ;-) I can imagine key/value from FD based interface without open/read/close for each attribute, fd = open("/mnt", O_PATH); fsinfo(fd, "propagation", buf, sizeof(buf)); fsinfo(fd, "fstype", buf, sizeof(buf)); close(fd); why I need /mountfs/<id>/propagation and /mountfs/<id>/fstype to get the same? It sounds like over-engineering without any extra bonus. Anyway, if we have FD based interfaces like fsopen(), fsmount(), open_tree() and move_mount() then it sounds strange that you cannot use the FD to ask kernel for the mount node attributes and you need to open and read another /sys-like files. IMHO it would be nice that after open(/mnt, O_PATH) I can do whatever with the mount point (umount, move, reconfigure, query, etc.). Please, try to keep it simple and consistent ;-) Karel -- Karel Zak <kzak@redhat.com> http://karelzak.blogspot.com ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-30 13:58 Upcoming: Notifications, FS notifications and fsinfo() David Howells ` (3 preceding siblings ...) 2020-03-30 20:28 ` Upcoming: Notifications, FS notifications and fsinfo() Miklos Szeredi @ 2020-03-30 21:17 ` Christian Brauner 2020-03-31 5:11 ` Miklos Szeredi 2020-03-31 7:22 ` Lennart Poettering 2020-03-31 17:31 ` David Howells 2020-03-31 21:52 ` David Howells 6 siblings, 2 replies; 99+ messages in thread From: Christian Brauner @ 2020-03-30 21:17 UTC (permalink / raw) To: David Howells Cc: torvalds, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven, andres, keyrings, linux-fsdevel, linux-kernel, lennart, cyphar [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="maccentraleurope", Size: 4438 bytes --] [Cc Lennart and Aleksa, both of which maintain projects too that would make use of this] On Mon, Mar 30, 2020 at 02:58:22PM +0100, David Howells wrote: > > Hi Linus, > > I have three sets of patches I'd like to push your way, if you (and Al) are > willing to consider them. > > (1) General notification queue plus key/keyring notifications. > > This adds the core of the notification queue built on pipes, and adds > the ability to watch for changes to keys. > > (2) Mount and superblock notifications. > > This builds on (1) to provide notifications of mount topology changes > and implements a framework for superblock events (configuration > changes, I/O errors, quota/space overruns and network status changes). > > (3) Filesystem information retrieval. > > This provides an extensible way to retrieve informational attributes > about mount objects and filesystems. This includes providing > information intended to make recovering from a notification queue > overrun much easier. > > We need (1) for Gnome to efficiently watch for changes in kerberos > keyrings. Debarshi Ray has patches ready to go for gnome-online-accounts > so that it can make use of the facility. > > Sets (2) and (3) can make libmount more efficient. Karel Zak is working on > making use of this to avoid reading /proc/mountinfo. > > We need something to make systemd's watching of the mount topology more > efficient, and (2) and (3) can help with this by making it faster to narrow > down what changed. I think Karel has this in his sights, but hasn't yet > managed to work on it. > > Set (2) should be able to make it easier to watch for mount options inside > a container, and set (3) should make it easier to examine the mounts inside > another mount namespace inside a container in a way that can't be done with > /proc/mounts. This is requested by Christian Brauner. > > Jeff Layton has a tentative addition to (3) to expose error state to > userspace, and Andres Freund would like this for Postgres. > > Set (3) further allows the information returned by such as statx() and > ioctl(FS_IOC_GETFLAGS) to be qualified by indicating which bits are/aren't > supported. > > Further, for (3), I also allow filesystem-specific overrides/extensions to > fsinfo() and have a use for it to AFS to expose information about server > preference for a particular volume (something that is necessary for > implementing the toolset). I've provided example code that does similar > for NFS and some that exposes superblock info from Ext4. At Vault, Steve > expressed an interest in this for CIFS and Ted Ts'o expressed a possible > interest for Ext4. > > Notes: > > (*) These patches will conflict with apparently upcoming refactoring of > the security core, but the fixup doesn't look too bad: > > https://lore.kernel.org/linux-next/20200330130636.0846e394@canb.auug.org.au/T/#u > > (*) Miklós Szeredi would much prefer to implement fsinfo() as a magic > filesystem mounted on /proc/self/fsinfo/ whereby your open fds appear > as directories under there, each with a set of attribute files > corresponding to the attributes that fsinfo() would otherwise provide. > To examine something by filename, you'd have to open it O_PATH and > then read the individual attribute files in the corresponding per-fd > directory. A readfile() system call has been mooted to elide the > {open,read,close} sequence to make it more efficient. Fwiw, putting down my kernel hat and speaking as someone who maintains two container runtimes and various other low-level bits and pieces in userspace who'd make heavy use of this stuff I would prefer the fd-based fsinfo() approach especially in the light of across namespace operations, querying all properties of a mount atomically all-at-once, and safe delegation through fds. Another heavy user of this would be systemd (Cced Lennart who I've discussed this with) which would prefer the fd-based approach as well. I think pulling this into a filesystem and making userspace parse around in a filesystem tree to query mount information is the wrong approach and will get messy pretty quickly especially in the face of mount and user namespace interactions and various other pitfalls. fsinfo() fits quite nicely with the all-fd-based approach of the whole mount api. So yes, definitely preferred from my end. Christian ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-30 21:17 ` Christian Brauner @ 2020-03-31 5:11 ` Miklos Szeredi 2020-03-31 8:15 ` Christian Brauner ` (2 more replies) 2020-03-31 7:22 ` Lennart Poettering 1 sibling, 3 replies; 99+ messages in thread From: Miklos Szeredi @ 2020-03-31 5:11 UTC (permalink / raw) To: Christian Brauner Cc: David Howells, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Lennart Poettering, Aleksa Sarai On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner <christian.brauner@ubuntu.com> wrote: > Fwiw, putting down my kernel hat and speaking as someone who maintains > two container runtimes and various other low-level bits and pieces in > userspace who'd make heavy use of this stuff I would prefer the fd-based > fsinfo() approach especially in the light of across namespace > operations, querying all properties of a mount atomically all-at-once, fsinfo(2) doesn't meet the atomically all-at-once requirement. Sure, it's possible to check the various change counters before and after a batch of calls to check that the result is consistent. Still, that's not an atomic all-at-once query, if you'd really require that, than fsinfo(2) as it currently stands would be inadequate. > and safe delegation through fds. Another heavy user of this would be > systemd (Cced Lennart who I've discussed this with) which would prefer > the fd-based approach as well. I think pulling this into a filesystem > and making userspace parse around in a filesystem tree to query mount > information is the wrong approach and will get messy pretty quickly > especially in the face of mount and user namespace interactions and > various other pitfalls. Have you actually looked at my proposed patch? Do you have concrete issues or just vague bad feelings? Thanks, Miklos ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-31 5:11 ` Miklos Szeredi @ 2020-03-31 8:15 ` Christian Brauner 2020-03-31 8:34 ` Miklos Szeredi 2020-03-31 8:34 ` Karel Zak 2020-03-31 21:54 ` David Howells 2 siblings, 1 reply; 99+ messages in thread From: Christian Brauner @ 2020-03-31 8:15 UTC (permalink / raw) To: Miklos Szeredi Cc: David Howells, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Lennart Poettering, Aleksa Sarai On Tue, Mar 31, 2020 at 07:11:11AM +0200, Miklos Szeredi wrote: > On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner > <christian.brauner@ubuntu.com> wrote: > > > Fwiw, putting down my kernel hat and speaking as someone who maintains > > two container runtimes and various other low-level bits and pieces in > > userspace who'd make heavy use of this stuff I would prefer the fd-based > > fsinfo() approach especially in the light of across namespace > > operations, querying all properties of a mount atomically all-at-once, > > fsinfo(2) doesn't meet the atomically all-at-once requirement. Sure, > it's possible to check the various change counters before and after a > batch of calls to check that the result is consistent. Still, that's > not an atomic all-at-once query, if you'd really require that, than > fsinfo(2) as it currently stands would be inadequate. It at all that's only true for batch requests. > > > and safe delegation through fds. Another heavy user of this would be > > systemd (Cced Lennart who I've discussed this with) which would prefer > > the fd-based approach as well. I think pulling this into a filesystem > > and making userspace parse around in a filesystem tree to query mount > > information is the wrong approach and will get messy pretty quickly > > especially in the face of mount and user namespace interactions and > > various other pitfalls. > > Have you actually looked at my proposed patch? Do you have concrete Yes. So have others, Al actively disliked and nacked it and no-one got excited about it. > issues or just vague bad feelings? We have had that discussion on-list where I made my "vague bad feelings" clear where you responded with the same dismissive style so I don't see the point in repeating this experience. Again, I want to make it clear that here I'm stating my preference as a user of this api and as such I don't want to have to parse through a filesystem to get complex information about filesystems. We've had fruitful discussions [1] around how fsinfo() ties in with supervised mounts and the rest of the mount api and its clear and simple especially in the face of namespaces and implements a nice delegation model. So +1 from me. Christian [1]: https://youtu.be/LN2CUgp8deo?th40 ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-31 8:15 ` Christian Brauner @ 2020-03-31 8:34 ` Miklos Szeredi 0 siblings, 0 replies; 99+ messages in thread From: Miklos Szeredi @ 2020-03-31 8:34 UTC (permalink / raw) To: Christian Brauner Cc: David Howells, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Lennart Poettering, Aleksa Sarai On Tue, Mar 31, 2020 at 10:15 AM Christian Brauner <christian.brauner@ubuntu.com> wrote: > > On Tue, Mar 31, 2020 at 07:11:11AM +0200, Miklos Szeredi wrote: > > On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner > > <christian.brauner@ubuntu.com> wrote: > > > > > Fwiw, putting down my kernel hat and speaking as someone who maintains > > > two container runtimes and various other low-level bits and pieces in > > > userspace who'd make heavy use of this stuff I would prefer the fd-based > > > fsinfo() approach especially in the light of across namespace > > > operations, querying all properties of a mount atomically all-at-once, > > > > fsinfo(2) doesn't meet the atomically all-at-once requirement. Sure, > > it's possible to check the various change counters before and after a > > batch of calls to check that the result is consistent. Still, that's > > not an atomic all-at-once query, if you'd really require that, than > > fsinfo(2) as it currently stands would be inadequate. > > It at all that's only true for batch requests. For example, there's no way to atomically query mount flags, parent, and list of children with a single fsinfo() call, you actually need three calls and they can reflect different states of the same mount. Not saying this is a problem, just that there's no list of requirements on what is needed and why. > > > and safe delegation through fds. Another heavy user of this would be > > > systemd (Cced Lennart who I've discussed this with) which would prefer > > > the fd-based approach as well. I think pulling this into a filesystem > > > and making userspace parse around in a filesystem tree to query mount > > > information is the wrong approach and will get messy pretty quickly > > > especially in the face of mount and user namespace interactions and > > > various other pitfalls. > > > > Have you actually looked at my proposed patch? Do you have concrete > > Yes. So have others, Al actively disliked and nacked it and no-one got > excited about it. Al, as far as I remember, nacked several things the patch was doing. I fixed those. > > issues or just vague bad feelings? > > We have had that discussion on-list where I made my "vague bad feelings" > clear where you responded with the same dismissive style so I don't see > the point in repeating this experience. > > Again, I want to make it clear that here I'm stating my preference as a > user of this api and as such I don't want to have to parse through a > filesystem to get complex information about filesystems. We've had > fruitful discussions [1] around how fsinfo() ties in with supervised > mounts and the rest of the mount api and its clear and simple especially > in the face of namespaces and implements a nice delegation model. So +1 > from me. And you keep ignoring the fact that my patch implements that exact same delegation model. That's why I'm asking if you have looked at it or not. The "I don't want to have to parse through a filesystem to get complex information about filesystems" is not a set of requirements that an API can be designed from. Thanks, Miklos ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-31 5:11 ` Miklos Szeredi 2020-03-31 8:15 ` Christian Brauner @ 2020-03-31 8:34 ` Karel Zak 2020-03-31 8:56 ` Miklos Szeredi 2020-03-31 21:54 ` David Howells 2 siblings, 1 reply; 99+ messages in thread From: Karel Zak @ 2020-03-31 8:34 UTC (permalink / raw) To: Miklos Szeredi Cc: Christian Brauner, David Howells, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Lennart Poettering, Aleksa Sarai On Tue, Mar 31, 2020 at 07:11:11AM +0200, Miklos Szeredi wrote: > On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner > <christian.brauner@ubuntu.com> wrote: > > > Fwiw, putting down my kernel hat and speaking as someone who maintains > > two container runtimes and various other low-level bits and pieces in > > userspace who'd make heavy use of this stuff I would prefer the fd-based > > fsinfo() approach especially in the light of across namespace > > operations, querying all properties of a mount atomically all-at-once, > > fsinfo(2) doesn't meet the atomically all-at-once requirement. I guess your /proc based idea have exactly the same problem... I see two possible ways: - after open("/mnt", O_PATH) create copy-on-write object in kernel to represent mount node -- kernel will able to modify it, but userspace will get unchanged data from the FD until to close() - improve fsinfo() to provide set (list) of the attributes by one call Karel -- Karel Zak <kzak@redhat.com> http://karelzak.blogspot.com ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-31 8:34 ` Karel Zak @ 2020-03-31 8:56 ` Miklos Szeredi 2020-03-31 9:49 ` Karel Zak ` (2 more replies) 0 siblings, 3 replies; 99+ messages in thread From: Miklos Szeredi @ 2020-03-31 8:56 UTC (permalink / raw) To: Karel Zak Cc: Christian Brauner, David Howells, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Lennart Poettering, Aleksa Sarai On Tue, Mar 31, 2020 at 10:34 AM Karel Zak <kzak@redhat.com> wrote: > > On Tue, Mar 31, 2020 at 07:11:11AM +0200, Miklos Szeredi wrote: > > On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner > > <christian.brauner@ubuntu.com> wrote: > > > > > Fwiw, putting down my kernel hat and speaking as someone who maintains > > > two container runtimes and various other low-level bits and pieces in > > > userspace who'd make heavy use of this stuff I would prefer the fd-based > > > fsinfo() approach especially in the light of across namespace > > > operations, querying all properties of a mount atomically all-at-once, > > > > fsinfo(2) doesn't meet the atomically all-at-once requirement. > > I guess your /proc based idea have exactly the same problem... Yes, that's exactly what I wanted to demonstrate: there's no fundamental difference between the two API's in this respect. > I see two possible ways: > > - after open("/mnt", O_PATH) create copy-on-write object in kernel to > represent mount node -- kernel will able to modify it, but userspace > will get unchanged data from the FD until to close() > > - improve fsinfo() to provide set (list) of the attributes by one call I think we are approaching this from the wrong end. Let's just ignore all of the proposed interfaces for now and only concentrate on what this will be used for. Start with a set of use cases by all interested parties. E.g. - systemd wants to keep track attached mounts in a namespace, as well as new detached mounts created by fsmount() - systemd need to keep information (such as parent, children, mount flags, fs options, etc) up to date on any change of topology or attributes. - util linux needs to display the topology and state of mounts in the system that corresponds to a consistent state that set of mounts - etc... From that we can derive a set of requirements for the API. Thanks, Miklos ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-31 8:56 ` Miklos Szeredi @ 2020-03-31 9:49 ` Karel Zak 2020-03-31 12:25 ` Lennart Poettering 2020-03-31 21:56 ` David Howells 2 siblings, 0 replies; 99+ messages in thread From: Karel Zak @ 2020-03-31 9:49 UTC (permalink / raw) To: Miklos Szeredi Cc: Christian Brauner, David Howells, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Lennart Poettering, Aleksa Sarai On Tue, Mar 31, 2020 at 10:56:35AM +0200, Miklos Szeredi wrote: > I think we are approaching this from the wrong end. Let's just > ignore all of the proposed interfaces for now and only concentrate on > what this will be used for. > > Start with a set of use cases by all interested parties. E.g. > > - systemd wants to keep track attached mounts in a namespace, as well > as new detached mounts created by fsmount() > > - systemd need to keep information (such as parent, children, mount > flags, fs options, etc) up to date on any change of topology or > attributes. > > - util linux needs to display the topology and state of mounts in the > system that corresponds to a consistent state that set of mounts - like systemd we also need in mount/umount to query one mountpoint rather than parse all /proc/self/mountinfo Karel -- Karel Zak <kzak@redhat.com> http://karelzak.blogspot.com ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-31 8:56 ` Miklos Szeredi 2020-03-31 9:49 ` Karel Zak @ 2020-03-31 12:25 ` Lennart Poettering 2020-03-31 15:10 ` Miklos Szeredi 2020-03-31 21:56 ` David Howells 2 siblings, 1 reply; 99+ messages in thread From: Lennart Poettering @ 2020-03-31 12:25 UTC (permalink / raw) To: Miklos Szeredi Cc: Karel Zak, Christian Brauner, David Howells, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Di, 31.03.20 10:56, Miklos Szeredi (miklos@szeredi.hu) wrote: > On Tue, Mar 31, 2020 at 10:34 AM Karel Zak <kzak@redhat.com> wrote: > > > > On Tue, Mar 31, 2020 at 07:11:11AM +0200, Miklos Szeredi wrote: > > > On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner > > > <christian.brauner@ubuntu.com> wrote: > > > > > > > Fwiw, putting down my kernel hat and speaking as someone who maintains > > > > two container runtimes and various other low-level bits and pieces in > > > > userspace who'd make heavy use of this stuff I would prefer the fd-based > > > > fsinfo() approach especially in the light of across namespace > > > > operations, querying all properties of a mount atomically all-at-once, > > > > > > fsinfo(2) doesn't meet the atomically all-at-once requirement. > > > > I guess your /proc based idea have exactly the same problem... > > Yes, that's exactly what I wanted to demonstrate: there's no > fundamental difference between the two API's in this respect. > > > I see two possible ways: > > > > - after open("/mnt", O_PATH) create copy-on-write object in kernel to > > represent mount node -- kernel will able to modify it, but userspace > > will get unchanged data from the FD until to close() > > > > - improve fsinfo() to provide set (list) of the attributes by one call > > I think we are approaching this from the wrong end. Let's just > ignore all of the proposed interfaces for now and only concentrate on > what this will be used for. > > Start with a set of use cases by all interested parties. E.g. > > - systemd wants to keep track attached mounts in a namespace, as well > as new detached mounts created by fsmount() > > - systemd need to keep information (such as parent, children, mount > flags, fs options, etc) up to date on any change of topology or > attributes. - We also have code that recursively remounts r/o or unmounts some directory tree (with filters), which is currently nasty to do since the relationships between dirs are not always clear from /proc/self/mountinfo alone, in particular not in an even remotely atomic fashion, or when stuff is overmounted. - We also have code that needs to check if /dev/ is plain tmpfs or devtmpfs. We cannot use statfs for that, since in both cases TMPFS_MAGIC is reported, hence we currently parse /proc/self/mountinfo for that to find the fstype string there, which is different for both cases. Lennart -- Lennart Poettering, Berlin ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-31 12:25 ` Lennart Poettering @ 2020-03-31 15:10 ` Miklos Szeredi 2020-03-31 15:24 ` Lennart Poettering 0 siblings, 1 reply; 99+ messages in thread From: Miklos Szeredi @ 2020-03-31 15:10 UTC (permalink / raw) To: Lennart Poettering Cc: Karel Zak, Christian Brauner, David Howells, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Tue, Mar 31, 2020 at 2:25 PM Lennart Poettering <mzxreary@0pointer.de> wrote: > > On Di, 31.03.20 10:56, Miklos Szeredi (miklos@szeredi.hu) wrote: > > > On Tue, Mar 31, 2020 at 10:34 AM Karel Zak <kzak@redhat.com> wrote: > > > > > > On Tue, Mar 31, 2020 at 07:11:11AM +0200, Miklos Szeredi wrote: > > > > On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner > > > > <christian.brauner@ubuntu.com> wrote: > > > > > > > > > Fwiw, putting down my kernel hat and speaking as someone who maintains > > > > > two container runtimes and various other low-level bits and pieces in > > > > > userspace who'd make heavy use of this stuff I would prefer the fd-based > > > > > fsinfo() approach especially in the light of across namespace > > > > > operations, querying all properties of a mount atomically all-at-once, > > > > > > > > fsinfo(2) doesn't meet the atomically all-at-once requirement. > > > > > > I guess your /proc based idea have exactly the same problem... > > > > Yes, that's exactly what I wanted to demonstrate: there's no > > fundamental difference between the two API's in this respect. > > > > > I see two possible ways: > > > > > > - after open("/mnt", O_PATH) create copy-on-write object in kernel to > > > represent mount node -- kernel will able to modify it, but userspace > > > will get unchanged data from the FD until to close() > > > > > > - improve fsinfo() to provide set (list) of the attributes by one call > > > > I think we are approaching this from the wrong end. Let's just > > ignore all of the proposed interfaces for now and only concentrate on > > what this will be used for. > > > > Start with a set of use cases by all interested parties. E.g. > > > > - systemd wants to keep track attached mounts in a namespace, as well > > as new detached mounts created by fsmount() > > > > - systemd need to keep information (such as parent, children, mount > > flags, fs options, etc) up to date on any change of topology or > > attributes. > > - We also have code that recursively remounts r/o or unmounts some > directory tree (with filters), Recursive remount-ro is clear. What is not clear is whether you need to do this for hidden mounts (not possible from userspace without a way to disable mount following on path lookup). Would it make sense to add a kernel API for recursive setting of mount flags? What exactly is this unmount with filters? Can you give examples? > - We also have code that needs to check if /dev/ is plain tmpfs or > devtmpfs. We cannot use statfs for that, since in both cases > TMPFS_MAGIC is reported, hence we currently parse > /proc/self/mountinfo for that to find the fstype string there, which > is different for both cases. Okay. Thanks, Miklos ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-31 15:10 ` Miklos Szeredi @ 2020-03-31 15:24 ` Lennart Poettering 0 siblings, 0 replies; 99+ messages in thread From: Lennart Poettering @ 2020-03-31 15:24 UTC (permalink / raw) To: Miklos Szeredi Cc: Karel Zak, Christian Brauner, David Howells, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Di, 31.03.20 17:10, Miklos Szeredi (miklos@szeredi.hu) wrote: > On Tue, Mar 31, 2020 at 2:25 PM Lennart Poettering <mzxreary@0pointer.de> wrote: > > > > On Di, 31.03.20 10:56, Miklos Szeredi (miklos@szeredi.hu) wrote: > > > > > On Tue, Mar 31, 2020 at 10:34 AM Karel Zak <kzak@redhat.com> wrote: > > > > > > > > On Tue, Mar 31, 2020 at 07:11:11AM +0200, Miklos Szeredi wrote: > > > > > On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner > > > > > <christian.brauner@ubuntu.com> wrote: > > > > > > > > > > > Fwiw, putting down my kernel hat and speaking as someone who maintains > > > > > > two container runtimes and various other low-level bits and pieces in > > > > > > userspace who'd make heavy use of this stuff I would prefer the fd-based > > > > > > fsinfo() approach especially in the light of across namespace > > > > > > operations, querying all properties of a mount atomically all-at-once, > > > > > > > > > > fsinfo(2) doesn't meet the atomically all-at-once requirement. > > > > > > > > I guess your /proc based idea have exactly the same problem... > > > > > > Yes, that's exactly what I wanted to demonstrate: there's no > > > fundamental difference between the two API's in this respect. > > > > > > > I see two possible ways: > > > > > > > > - after open("/mnt", O_PATH) create copy-on-write object in kernel to > > > > represent mount node -- kernel will able to modify it, but userspace > > > > will get unchanged data from the FD until to close() > > > > > > > > - improve fsinfo() to provide set (list) of the attributes by one call > > > > > > I think we are approaching this from the wrong end. Let's just > > > ignore all of the proposed interfaces for now and only concentrate on > > > what this will be used for. > > > > > > Start with a set of use cases by all interested parties. E.g. > > > > > > - systemd wants to keep track attached mounts in a namespace, as well > > > as new detached mounts created by fsmount() > > > > > > - systemd need to keep information (such as parent, children, mount > > > flags, fs options, etc) up to date on any change of topology or > > > attributes. > > > > - We also have code that recursively remounts r/o or unmounts some > > directory tree (with filters), > > Recursive remount-ro is clear. What is not clear is whether you need > to do this for hidden mounts (not possible from userspace without a > way to disable mount following on path lookup). Would it make sense > to add a kernel API for recursive setting of mount flags? I would be very happy about an explicit kernel API for recursively toggling the MS_RDONLY. But for many usecases in systemd we need the ability to filter some subdirs and leave them as is, so while helpful we'd have to keep the userspace code we currently have anyway. > What exactly is this unmount with filters? Can you give examples? Hmm, actually it's only the r/o remount that has filters, not the unmount. Sorry for the confusion. And the r/o remount with filters just means: "remount everything below X read-only except for X/Y and X/Z/A"... Lennart -- Lennart Poettering, Berlin ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-31 8:56 ` Miklos Szeredi 2020-03-31 9:49 ` Karel Zak 2020-03-31 12:25 ` Lennart Poettering @ 2020-03-31 21:56 ` David Howells 2 siblings, 0 replies; 99+ messages in thread From: David Howells @ 2020-03-31 21:56 UTC (permalink / raw) To: Lennart Poettering Cc: dhowells, Miklos Szeredi, Karel Zak, Christian Brauner, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai Lennart Poettering <mzxreary@0pointer.de> wrote: > - We also have code that needs to check if /dev/ is plain tmpfs or > devtmpfs. We cannot use statfs for that, since in both cases > TMPFS_MAGIC is reported, hence we currently parse > /proc/self/mountinfo for that to find the fstype string there, which > is different for both cases. btw, fsinfo(FSINFO_ATTR_IDS) gets you the name of the filesystem type in addition to the magic number. David ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-31 5:11 ` Miklos Szeredi 2020-03-31 8:15 ` Christian Brauner 2020-03-31 8:34 ` Karel Zak @ 2020-03-31 21:54 ` David Howells 2020-04-01 8:43 ` Karel Zak 2 siblings, 1 reply; 99+ messages in thread From: David Howells @ 2020-03-31 21:54 UTC (permalink / raw) To: Karel Zak Cc: dhowells, Miklos Szeredi, Christian Brauner, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Lennart Poettering, Aleksa Sarai Karel Zak <kzak@redhat.com> wrote: > - improve fsinfo() to provide set (list) of the attributes by one call That would be my preferred way. I wouldn't want to let the user pin copies of state, and I wouldn't want to make open(O_PATH) do it automatically. David ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-31 21:54 ` David Howells @ 2020-04-01 8:43 ` Karel Zak 0 siblings, 0 replies; 99+ messages in thread From: Karel Zak @ 2020-04-01 8:43 UTC (permalink / raw) To: David Howells Cc: Miklos Szeredi, Christian Brauner, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Lennart Poettering, Aleksa Sarai On Tue, Mar 31, 2020 at 10:54:23PM +0100, David Howells wrote: > Karel Zak <kzak@redhat.com> wrote: > > > - improve fsinfo() to provide set (list) of the attributes by one call > > That would be my preferred way. I wouldn't want to let the user pin copies of > state, and I wouldn't want to make open(O_PATH) do it automatically. You can create cow object on first fsinfo() call, ideally add some flags to control this behavior -- but you're right, this way is complicated to implement and possibly dangerous. I guess return some vector of attributes in one fsinfo() will be good enough. Karel -- Karel Zak <kzak@redhat.com> http://karelzak.blogspot.com ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-30 21:17 ` Christian Brauner 2020-03-31 5:11 ` Miklos Szeredi @ 2020-03-31 7:22 ` Lennart Poettering 1 sibling, 0 replies; 99+ messages in thread From: Lennart Poettering @ 2020-03-31 7:22 UTC (permalink / raw) To: Christian Brauner Cc: David Howells, torvalds, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven, andres, keyrings, linux-fsdevel, linux-kernel, cyphar On Mo, 30.03.20 23:17, Christian Brauner (christian.brauner@ubuntu.com) wrote: > Fwiw, putting down my kernel hat and speaking as someone who maintains > two container runtimes and various other low-level bits and pieces in > userspace who'd make heavy use of this stuff I would prefer the fd-based > fsinfo() approach especially in the light of across namespace > operations, querying all properties of a mount atomically all-at-once, > and safe delegation through fds. Another heavy user of this would be > systemd (Cced Lennart who I've discussed this with) which would prefer > the fd-based approach as well. I think pulling this into a filesystem > and making userspace parse around in a filesystem tree to query mount > information is the wrong approach and will get messy pretty quickly > especially in the face of mount and user namespace interactions and > various other pitfalls. fsinfo() fits quite nicely with the all-fd-based > approach of the whole mount api. So yes, definitely preferred from my > end. Christian is right. I think it's very important to have an API that allows to query the state of fs attributes in a consistent state, i.e. so that the attributes userspace is interested in can be queried in a single call, so they all describe the very same point in time. Distributing attributes onto multiple individual files just sucks, because it's then guaranteed that you never can read them in a way they actually fit together, some attributes you read will be older, others newer. It's a big design flaw of sysfs (which is structured like this) if you ask me. I don't really care if the kernel API for this is binary or textual. Slight preference for binary, but I don't care too much. I think it would be wise to bind such APIs to fds, simply because it always works. Doing path based stuff sucks, because you always need to mount stuff and have a path tree set up, which is less ideal in a world where namespacing is common, and namespaces are a shared concept (at least with your other threads, if not with other processes). As a maintainer of an init system I really dislike APIs that I can only use after a mount structure has been set up, too often we want to do stuff before that. Moreover, philosophically I find it questionnable to use path based APIs to interface with the path object hierarchy itself. it feels "too recursive". Just keep this separate: build stuff on top of the fs that fits on top of the fs, but don't build fs APIs on top of fs APIs that stem from the same layer. Summary: atomic APIs rock, fd-based APIs rock. APIs built on individual files one can only read individually suck. APIs of the path layer exposed in the path layer suck. Hope this makes some sense? Lennart ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-30 13:58 Upcoming: Notifications, FS notifications and fsinfo() David Howells ` (4 preceding siblings ...) 2020-03-30 21:17 ` Christian Brauner @ 2020-03-31 17:31 ` David Howells 2020-03-31 19:42 ` Miklos Szeredi ` (3 more replies) 2020-03-31 21:52 ` David Howells 6 siblings, 4 replies; 99+ messages in thread From: David Howells @ 2020-03-31 17:31 UTC (permalink / raw) To: Miklos Szeredi Cc: dhowells, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, Christian Brauner, Lennart Poettering, keyrings, linux-fsdevel, linux-kernel [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="maccentraleurope", Size: 28519 bytes --] Miklos Szeredi <miklos@szeredi.hu> wrote: > The basic problem in my view, is that the performance requirement of a > "get filesystem information" type of API just does not warrant a > binary coded interface. I've said this a number of times, but it fell > on deaf ears. It hasn't so fallen, but don't necessarily agree with you. Let's pin some numbers on this. Using what I think is your latest patch where you look up /proc/<pid>/fdinfo/<fd> to find a file that gives a summary of some information in "key: val" format, including a mount ID. You then have to look in a mounted mountfs magic filesystem for a directory corresponding mount ID that has a bunch of attribute files in it, most with a single attribute value. What I can do with it is, say, look up the mount ID of the object attached to a path - but that's about all because it doesn't implement anything like look-up-by-mount-ID or list-children. Attached is a kernel patch, supplementary to the fsinfo patchset, that adds your implementation, fixed for coexistence with the mount notifications code, plus a sample program that creates N mounts and then sees how long it takes to query each of those mounts for its mnt_id by four different methods: (1) "f" - Use fsinfo, looking up each mount root directly by path. (2) "f2" - Use fsinfo, firstly using fsinfo() to look up the base mount by path, then use fsinfo() to get a list of all the children of that mount (which in fact gives me the mnt_id, but ignoring that), and then call fsinfo() by mount ID for each child to get its information, including its mnt_id. (3) "p" - Open the root of each mount with O_PATH and then open and read the procfile to retrieve information, then parse the received text to find the line with that key, then parse the line to get the number, allowing for the possibility that the line might acquire extra numbers. (4) "p2" - Open the root of the base mount with O_PATH, then read the appropriate file in /proc/fdinfo to find the base mount ID. Open "/mnt" O_PATH to use as a base. Then read <mntid>/children and parse the list to find each child. Each child's <mntid>/id file is then read. Run the program like: mount -t mountfs none /mnt mkdir /tmp/a ./test-fsinfo-perf /tmp/a 20000 Note that it detaches its base mount afterwards and lets it get cleaned up and systemd goes crazy for a bit. Note also that it prints the sum of all the mount IDs as a consistency check for each test. Okay, the results: For 1000 mounts, f= 1514us f2= 1102us p= 6014us p2= 6935us; p=4.0*f p=5.5*f2 p=0.9*p2 For 2000 mounts, f= 4712us f2= 3675us p= 20937us p2= 22878us; p=4.4*f p=5.7*f2 p=0.9*p2 For 3000 mounts, f= 6795us f2= 5304us p= 31080us p2= 34056us; p=4.6*f p=5.9*f2 p=0.9*p2 For 4000 mounts, f= 9291us f2= 7434us p= 40723us p2= 46479us; p=4.4*f p=5.5*f2 p=0.9*p2 For 5000 mounts, f=11423us f2= 9219us p= 50878us p2= 58857us; p=4.5*f p=5.5*f2 p=0.9*p2 For 10000 mounts, f=22899us f2=18240us p=101054us p2=117273us; p=4.4*f p=5.5*f2 p=0.9*p2 For 20000 mounts, f=45811us f2=37211us p=203640us p2=237377us; p=4.4*f p=5.5*f2 p=0.9*p2 For 30000 mounts, f=69703us f2=54800us p=306778us p2=357629us; p=4.4*f p=5.6*f2 p=0.9*p2 The number of mounts doesn't have an effect - not surprising with direct pathwalk-based approaches ("f" and "p") since the pathwalk part is the same in both cases, though in one fsinfo() does it and in the other, open(O_PATH). As you can see, your procfs-based approach takes consistently about 4.4 times as long as fsinfo(QUERY_PATH) and 5.5 times as long as fsinfo(QUERY_MOUNT). Going through mountfs ("p2") is even slower than going through procfs, though this really ought to be comparable to fsinfo-by-mount-ID ("f2"), but the latter is something like 6.5x faster. I suspect the procfs-based and mountfs-based approaches suffer from creating lots of inodes, dentries and file structs as you access the files. This also means that they use more live state memory - and I think it lingers - if you start using them, whereas fsinfo() uses none at all, beyond whatever is used by the pathwalk to find the object to query (if you go that route). mountfs is going to be worse also if you want more than one value if you persist in putting one attribute in each file. David --- commit ed109ef4351d44a3e881e6518a207431113c17c0 Author: David Howells <dhowells@redhat.com> Date: Tue Mar 31 14:39:07 2020 +0100 Performance test Miklós's patch vs fsinfo diff --git a/fs/Makefile b/fs/Makefile index b6bf2424c7f7..ac0627176db1 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -137,3 +137,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/ obj-$(CONFIG_EROFS_FS) += erofs/ obj-$(CONFIG_VBOXSF_FS) += vboxsf/ obj-$(CONFIG_ZONEFS_FS) += zonefs/ +obj-y += mountfs/ diff --git a/fs/mount.h b/fs/mount.h index 063f41bc2e93..89b091fc482f 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -82,6 +82,7 @@ struct mount { atomic_t mnt_subtree_notifications; /* Number of notifications in subtree */ struct watch_list *mnt_watchers; /* Watches on dentries within this mount */ #endif + struct mountfs_entry *mnt_mountfs_entry; } __randomize_layout; #define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */ @@ -177,3 +178,11 @@ static inline void notify_mount(struct mount *triggered, { } #endif + +void mnt_namespace_lock_read(void); +void mnt_namespace_unlock_read(void); + +void mountfs_create(struct mount *mnt); +extern void mountfs_remove(struct mount *mnt); +int mountfs_lookup_internal(struct vfsmount *m, struct path *path); + diff --git a/fs/mountfs/Makefile b/fs/mountfs/Makefile new file mode 100644 index 000000000000..35a65e9a966f --- /dev/null +++ b/fs/mountfs/Makefile @@ -0,0 +1 @@ +obj-y += super.o diff --git a/fs/mountfs/super.c b/fs/mountfs/super.c new file mode 100644 index 000000000000..82c01eb6154d --- /dev/null +++ b/fs/mountfs/super.c @@ -0,0 +1,502 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#include "../pnode.h" +#include <linux/fs.h> +#include <linux/kref.h> +#include <linux/nsproxy.h> +#include <linux/fs_struct.h> +#include <linux/fs_context.h> + +#define MOUNTFS_SUPER_MAGIC 0x4e756f4d + +static DEFINE_SPINLOCK(mountfs_lock); +static struct rb_root mountfs_entries = RB_ROOT; +static struct vfsmount *mountfs_mnt __read_mostly; + +struct mountfs_entry { + struct kref kref; + struct mount *mnt; + struct rb_node node; + int id; +}; + +static const char *mountfs_attrs[] = { + "root", "mountpoint", "id", "parent", "options", "children", + "group", "master", "propagate_from" +}; + +#define MOUNTFS_INO(id) (((unsigned long) id + 1) * \ + (ARRAY_SIZE(mountfs_attrs) + 1)) + +void mountfs_entry_release(struct kref *kref) +{ + kfree(container_of(kref, struct mountfs_entry, kref)); +} + +void mountfs_entry_put(struct mountfs_entry *entry) +{ + kref_put(&entry->kref, mountfs_entry_release); +} + +static bool mountfs_entry_visible(struct mountfs_entry *entry) +{ + struct mount *mnt; + bool visible = false; + + rcu_read_lock(); + mnt = rcu_dereference(entry->mnt); + if (mnt && mnt->mnt_ns == current->nsproxy->mnt_ns) + visible = true; + rcu_read_unlock(); + + return visible; +} +static int mountfs_attr_show(struct seq_file *sf, void *v) +{ + const char *name = sf->file->f_path.dentry->d_name.name; + struct mountfs_entry *entry = sf->private; + struct mount *mnt; + struct vfsmount *m; + struct super_block *sb; + struct path root; + int tmp, err = -ENODEV; + + mnt_namespace_lock_read(); + + mnt = entry->mnt; + if (!mnt || !mnt->mnt_ns) + goto out; + + err = 0; + m = &mnt->mnt; + sb = m->mnt_sb; + + if (strcmp(name, "root") == 0) { + if (sb->s_op->show_path) { + err = sb->s_op->show_path(sf, m->mnt_root); + } else { + seq_dentry(sf, m->mnt_root, " \t\n\\"); + } + seq_putc(sf, '\n'); + } else if (strcmp(name, "mountpoint") == 0) { + struct path mnt_path = { .dentry = m->mnt_root, .mnt = m }; + + get_fs_root(current->fs, &root); + err = seq_path_root(sf, &mnt_path, &root, " \t\n\\"); + if (err == SEQ_SKIP) { + seq_puts(sf, "(unreachable)"); + err = 0; + } + seq_putc(sf, '\n'); + path_put(&root); + } else if (strcmp(name, "id") == 0) { + seq_printf(sf, "%i\n", mnt->mnt_id); + } else if (strcmp(name, "parent") == 0) { + tmp = rcu_dereference(mnt->mnt_parent)->mnt_id; + seq_printf(sf, "%i\n", tmp); + } else if (strcmp(name, "options") == 0) { + int mnt_flags = READ_ONCE(m->mnt_flags); + + seq_puts(sf, mnt_flags & MNT_READONLY ? "ro" : "rw"); + seq_mnt_opts(sf, mnt_flags); + seq_putc(sf, '\n'); + } else if (strcmp(name, "children") == 0) { + struct mount *child; + bool first = true; + + list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) { + if (!first) + seq_putc(sf, ','); + else + first = false; + seq_printf(sf, "%i", child->mnt_id); + } + if (!first) + seq_putc(sf, '\n'); + } else if (strcmp(name, "group") == 0) { + if (IS_MNT_SHARED(mnt)) + seq_printf(sf, "%i\n", mnt->mnt_group_id); + } else if (strcmp(name, "master") == 0) { + if (IS_MNT_SLAVE(mnt)) { + tmp = rcu_dereference(mnt->mnt_master)->mnt_group_id; + seq_printf(sf, "%i\n", tmp); + } + } else if (strcmp(name, "propagate_from") == 0) { + if (IS_MNT_SLAVE(mnt)) { + get_fs_root(current->fs, &root); + tmp = get_dominating_id(mnt, &root); + if (tmp) + seq_printf(sf, "%i\n", tmp); + } + } else { + WARN_ON(1); + err = -EIO; + } +out: + mnt_namespace_unlock_read(); + + return err; +} + +static int mountfs_attr_open(struct inode *inode, struct file *file) +{ + return single_open(file, mountfs_attr_show, inode->i_private); +} + +static const struct file_operations mountfs_attr_fops = { + .open = mountfs_attr_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static struct mountfs_entry *mountfs_node_to_entry(struct rb_node *node) +{ + return rb_entry(node, struct mountfs_entry, node); +} + +static struct rb_node **mountfs_find_node(int id, struct rb_node **parent) +{ + struct rb_node **link = &mountfs_entries.rb_node; + + *parent = NULL; + while (*link) { + struct mountfs_entry *entry = mountfs_node_to_entry(*link); + + *parent = *link; + if (id < entry->id) + link = &entry->node.rb_left; + else if (id > entry->id) + link = &entry->node.rb_right; + else + break; + } + return link; +} + +void mountfs_create(struct mount *mnt) +{ + struct mountfs_entry *entry; + struct rb_node **link, *parent; + + entry = kzalloc(sizeof(*entry), GFP_KERNEL); + if (!entry) { + WARN(1, "failed to allocate mountfs entry"); + return; + } + kref_init(&entry->kref); + entry->mnt = mnt; + entry->id = mnt->mnt_id; + + spin_lock(&mountfs_lock); + link = mountfs_find_node(entry->id, &parent); + if (!WARN_ON(*link)) { + rb_link_node(&entry->node, parent, link); + rb_insert_color(&entry->node, &mountfs_entries); + mnt->mnt_mountfs_entry = entry; + } else { + kfree(entry); + } + spin_unlock(&mountfs_lock); +} + +void mountfs_remove(struct mount *mnt) +{ + struct mountfs_entry *entry = mnt->mnt_mountfs_entry; + + if (!entry) + return; + spin_lock(&mountfs_lock); + entry->mnt = NULL; + rb_erase(&entry->node, &mountfs_entries); + spin_unlock(&mountfs_lock); + + mountfs_entry_put(entry); + + mnt->mnt_mountfs_entry = NULL; +} + +static struct mountfs_entry *mountfs_get_entry(const char *name) +{ + struct mountfs_entry *entry = NULL; + struct rb_node **link, *dummy; + unsigned long mnt_id; + char buf[32]; + int ret; + + ret = kstrtoul(name, 10, &mnt_id); + if (ret || mnt_id > INT_MAX) + return NULL; + + snprintf(buf, sizeof(buf), "%lu", mnt_id); + if (strcmp(buf, name) != 0) + return NULL; + + spin_lock(&mountfs_lock); + link = mountfs_find_node(mnt_id, &dummy); + if (*link) { + entry = mountfs_node_to_entry(*link); + if (!mountfs_entry_visible(entry)) + entry = NULL; + else + kref_get(&entry->kref); + } + spin_unlock(&mountfs_lock); + + return entry; +} + +static void mountfs_init_inode(struct inode *inode, umode_t mode); + +static struct dentry *mountfs_lookup_entry(struct dentry *dentry, + struct mountfs_entry *entry, + int idx) +{ + struct inode *inode; + + inode = new_inode(dentry->d_sb); + if (!inode) { + mountfs_entry_put(entry); + return ERR_PTR(-ENOMEM); + } + inode->i_private = entry; + inode->i_ino = MOUNTFS_INO(entry->id) + idx; + mountfs_init_inode(inode, idx ? S_IFREG | 0444 : S_IFDIR | 0555); + return d_splice_alias(inode, dentry); + +} + +static struct dentry *mountfs_lookup(struct inode *dir, struct dentry *dentry, + unsigned int flags) +{ + struct mountfs_entry *entry = dir->i_private; + int i = 0; + + if (entry) { + for (i = 0; i < ARRAY_SIZE(mountfs_attrs); i++) + if (strcmp(mountfs_attrs[i], dentry->d_name.name) == 0) + break; + if (i == ARRAY_SIZE(mountfs_attrs)) + return ERR_PTR(-ENOMEM); + i++; + kref_get(&entry->kref); + } else { + entry = mountfs_get_entry(dentry->d_name.name); + if (!entry) + return ERR_PTR(-ENOENT); + } + + return mountfs_lookup_entry(dentry, entry, i); +} + +static int mountfs_d_revalidate(struct dentry *dentry, unsigned int flags) +{ + struct mountfs_entry *entry = dentry->d_inode->i_private; + + /* root: valid */ + if (!entry) + return 1; + + /* removed: invalid */ + if (!entry->mnt) + return 0; + + /* attribute or visible in this namespace: valid */ + if (!d_can_lookup(dentry) || mountfs_entry_visible(entry)) + return 1; + + /* invlisible in this namespace: valid but deny entry*/ + return -ENOENT; +} + +static int mountfs_readdir(struct file *file, struct dir_context *ctx) +{ + struct rb_node *node; + struct mountfs_entry *entry = file_inode(file)->i_private; + char name[32]; + const char *s; + unsigned int len, pos, id; + + if (ctx->pos - 2 > INT_MAX || !dir_emit_dots(file, ctx)) + return 0; + + if (entry) { + while (ctx->pos - 2 < ARRAY_SIZE(mountfs_attrs)) { + s = mountfs_attrs[ctx->pos - 2]; + if (!dir_emit(ctx, s, strlen(s), + MOUNTFS_INO(entry->id) + ctx->pos, + DT_REG)) + break; + ctx->pos++; + } + return 0; + } + + pos = ctx->pos - 2; + do { + spin_lock(&mountfs_lock); + mountfs_find_node(pos, &node); + pos = 1U + INT_MAX; + do { + if (!node) { + spin_unlock(&mountfs_lock); + goto out; + } + entry = mountfs_node_to_entry(node); + node = rb_next(node); + } while (!mountfs_entry_visible(entry)); + if (node) + pos = mountfs_node_to_entry(node)->id; + id = entry->id; + spin_unlock(&mountfs_lock); + + len = snprintf(name, sizeof(name), "%i", id); + ctx->pos = id + 2; + if (!dir_emit(ctx, name, len, MOUNTFS_INO(id), DT_DIR)) + return 0; + } while (pos <= INT_MAX); +out: + ctx->pos = pos + 2; + return 0; +} + +int mountfs_lookup_internal(struct vfsmount *m, struct path *path) +{ + char name[32]; + struct qstr this = { .name = name }; + struct mount *mnt = real_mount(m); + struct mountfs_entry *entry = mnt->mnt_mountfs_entry; + struct dentry *dentry, *old, *root = mountfs_mnt->mnt_root; + + this.len = snprintf(name, sizeof(name), "%i", mnt->mnt_id); + dentry = d_hash_and_lookup(root, &this); + if (dentry && dentry->d_inode->i_private != entry) { + d_invalidate(dentry); + dput(dentry); + dentry = NULL; + } + if (!dentry) { + dentry = d_alloc(root, &this); + if (!dentry) + return -ENOMEM; + + kref_get(&entry->kref); + old = mountfs_lookup_entry(dentry, entry, 0); + if (old) { + dput(dentry); + if (IS_ERR(old)) + return PTR_ERR(old); + dentry = old; + } + } + + *path = (struct path) { .mnt = mountfs_mnt, .dentry = dentry }; + return 0; +} + +static const struct dentry_operations mountfs_dops = { + .d_revalidate = mountfs_d_revalidate, +}; + +static const struct inode_operations mountfs_iops = { + .lookup = mountfs_lookup, +}; + +static const struct file_operations mountfs_fops = { + .iterate_shared = mountfs_readdir, + .read = generic_read_dir, + .llseek = generic_file_llseek, +}; + +static void mountfs_init_inode(struct inode *inode, umode_t mode) +{ + inode->i_mode = mode; + inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode); + if (S_ISREG(mode)) { + inode->i_size = PAGE_SIZE; + inode->i_fop = &mountfs_attr_fops; + } else { + inode->i_op = &mountfs_iops; + inode->i_fop = &mountfs_fops; + } +} + +static void mountfs_evict_inode(struct inode *inode) +{ + struct mountfs_entry *entry = inode->i_private; + + clear_inode(inode); + if (entry) + mountfs_entry_put(entry); +} + +static const struct super_operations mountfs_sops = { + .statfs = simple_statfs, + .drop_inode = generic_delete_inode, + .evict_inode = mountfs_evict_inode, +}; + +static int mountfs_fill_super(struct super_block *sb, struct fs_context *fc) +{ + struct inode *root; + + sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV; + sb->s_blocksize = PAGE_SIZE; + sb->s_blocksize_bits = PAGE_SHIFT; + sb->s_magic = MOUNTFS_SUPER_MAGIC; + sb->s_time_gran = 1; + sb->s_shrink.seeks = 0; + sb->s_op = &mountfs_sops; + sb->s_d_op = &mountfs_dops; + + root = new_inode(sb); + if (!root) + return -ENOMEM; + + root->i_ino = 1; + mountfs_init_inode(root, S_IFDIR | 0444); + + sb->s_root = d_make_root(root); + if (!sb->s_root) + return -ENOMEM; + + return 0; +} + +static int mountfs_get_tree(struct fs_context *fc) +{ + return get_tree_single(fc, mountfs_fill_super); +} + +static const struct fs_context_operations mountfs_context_ops = { + .get_tree = mountfs_get_tree, +}; + +static int mountfs_init_fs_context(struct fs_context *fc) +{ + fc->ops = &mountfs_context_ops; + fc->global = true; + return 0; +} + +static struct file_system_type mountfs_fs_type = { + .name = "mountfs", + .init_fs_context = mountfs_init_fs_context, + .kill_sb = kill_anon_super, +}; + +static int __init mountfs_init(void) +{ + int err; + + err = register_filesystem(&mountfs_fs_type); + if (!err) { + mountfs_mnt = kern_mount(&mountfs_fs_type); + if (IS_ERR(mountfs_mnt)) { + err = PTR_ERR(mountfs_mnt); + unregister_filesystem(&mountfs_fs_type); + } + } + return err; +} +fs_initcall(mountfs_init); diff --git a/fs/namespace.c b/fs/namespace.c index 5427e732c1bf..a05a2885a90e 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -962,6 +962,8 @@ struct vfsmount *vfs_create_mount(struct fs_context *fc) if (fc->sb_flags & SB_KERNMOUNT) mnt->mnt.mnt_flags = MNT_INTERNAL; + else + mountfs_create(mnt); atomic_inc(&fc->root->d_sb->s_active); mnt->mnt.mnt_sb = fc->root->d_sb; @@ -1033,7 +1035,7 @@ vfs_submount(const struct dentry *mountpoint, struct file_system_type *type, } EXPORT_SYMBOL_GPL(vfs_submount); -static struct mount *clone_mnt(struct mount *old, struct dentry *root, +static struct mount *clone_mnt_common(struct mount *old, struct dentry *root, int flag) { struct super_block *sb = old->mnt.mnt_sb; @@ -1100,6 +1102,17 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root, return ERR_PTR(err); } +static struct mount *clone_mnt(struct mount *old, struct dentry *root, + int flag) +{ + struct mount *mnt = clone_mnt_common(old, root, flag); + + if (!IS_ERR(mnt)) + mountfs_create(mnt); + + return mnt; +} + static void cleanup_mnt(struct mount *mnt) { struct hlist_node *p; @@ -1112,6 +1125,7 @@ static void cleanup_mnt(struct mount *mnt) * so mnt_get_writers() below is safe. */ WARN_ON(mnt_get_writers(mnt)); + if (unlikely(mnt->mnt_pins.first)) mnt_pin_kill(mnt); hlist_for_each_entry_safe(m, p, &mnt->mnt_stuck_children, mnt_umount) { @@ -1197,6 +1211,8 @@ static void mntput_no_expire(struct mount *mnt) unlock_mount_hash(); shrink_dentry_list(&list); + mountfs_remove(mnt); + if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) { struct task_struct *task = current; if (likely(!(task->flags & PF_KTHREAD))) { @@ -1263,13 +1279,14 @@ EXPORT_SYMBOL(path_is_mountpoint); struct vfsmount *mnt_clone_internal(const struct path *path) { struct mount *p; - p = clone_mnt(real_mount(path->mnt), path->dentry, CL_PRIVATE); + p = clone_mnt_common(real_mount(path->mnt), path->dentry, CL_PRIVATE); if (IS_ERR(p)) return ERR_CAST(p); p->mnt.mnt_flags |= MNT_INTERNAL; return &p->mnt; } + #ifdef CONFIG_PROC_FS /* iterator; we want it to have access to namespace_sem, thus here... */ static void *m_start(struct seq_file *m, loff_t *pos) @@ -1411,6 +1428,16 @@ static inline void namespace_lock(void) down_write(&namespace_sem); } +void mnt_namespace_lock_read(void) +{ + down_read(&namespace_sem); +} + +void mnt_namespace_unlock_read(void) +{ + up_read(&namespace_sem); +} + enum umount_tree_flags { UMOUNT_SYNC = 1, UMOUNT_PROPAGATE = 2, diff --git a/fs/proc/base.c b/fs/proc/base.c index c7c64272b0fa..0477f8b51182 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3092,6 +3092,7 @@ static const struct pid_entry tgid_base_stuff[] = { DIR("fd", S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations), DIR("map_files", S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations), DIR("fdinfo", S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations), + DIR("fdmount", S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations), DIR("ns", S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations), #ifdef CONFIG_NET DIR("net", S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations), @@ -3497,6 +3498,7 @@ static const struct inode_operations proc_tid_comm_inode_operations = { static const struct pid_entry tid_base_stuff[] = { DIR("fd", S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations), DIR("fdinfo", S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations), + DIR("fdmount", S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations), DIR("ns", S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations), #ifdef CONFIG_NET DIR("net", S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations), diff --git a/fs/proc/fd.c b/fs/proc/fd.c index 81882a13212d..94a57e178801 100644 --- a/fs/proc/fd.c +++ b/fs/proc/fd.c @@ -361,3 +361,85 @@ const struct file_operations proc_fdinfo_operations = { .iterate_shared = proc_readfdinfo, .llseek = generic_file_llseek, }; + +static int proc_fdmount_link(struct dentry *dentry, struct path *path) +{ + struct files_struct *files = NULL; + struct task_struct *task; + struct path fd_path; + int ret = -ENOENT; + + task = get_proc_task(d_inode(dentry)); + if (task) { + files = get_files_struct(task); + put_task_struct(task); + } + + if (files) { + unsigned int fd = proc_fd(d_inode(dentry)); + struct file *fd_file; + + spin_lock(&files->file_lock); + fd_file = fcheck_files(files, fd); + if (fd_file) { + fd_path = fd_file->f_path; + path_get(&fd_path); + ret = 0; + } + spin_unlock(&files->file_lock); + put_files_struct(files); + } + if (!ret) { + ret = mountfs_lookup_internal(fd_path.mnt, path); + path_put(&fd_path); + } + + return ret; +} + +static struct dentry *proc_fdmount_instantiate(struct dentry *dentry, + struct task_struct *task, const void *ptr) +{ + const struct fd_data *data = ptr; + struct proc_inode *ei; + struct inode *inode; + + inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | 0400); + if (!inode) + return ERR_PTR(-ENOENT); + + ei = PROC_I(inode); + ei->fd = data->fd; + + inode->i_op = &proc_pid_link_inode_operations; + inode->i_size = 64; + + ei->op.proc_get_link = proc_fdmount_link; + tid_fd_update_inode(task, inode, 0); + + d_set_d_op(dentry, &tid_fd_dentry_operations); + return d_splice_alias(inode, dentry); +} + +static struct dentry * +proc_lookupfdmount(struct inode *dir, struct dentry *dentry, unsigned int flags) +{ + return proc_lookupfd_common(dir, dentry, proc_fdmount_instantiate); +} + +static int proc_readfdmount(struct file *file, struct dir_context *ctx) +{ + return proc_readfd_common(file, ctx, + proc_fdmount_instantiate); +} + +const struct inode_operations proc_fdmount_inode_operations = { + .lookup = proc_lookupfdmount, + .setattr = proc_setattr, +}; + +const struct file_operations proc_fdmount_operations = { + .read = generic_read_dir, + .iterate_shared = proc_readfdmount, + .llseek = generic_file_llseek, +}; diff --git a/fs/proc/fd.h b/fs/proc/fd.h index f371a602bf58..9e087c833e65 100644 --- a/fs/proc/fd.h +++ b/fs/proc/fd.h @@ -10,6 +10,9 @@ extern const struct inode_operations proc_fd_inode_operations; extern const struct file_operations proc_fdinfo_operations; extern const struct inode_operations proc_fdinfo_inode_operations; +extern const struct file_operations proc_fdmount_operations; +extern const struct inode_operations proc_fdmount_inode_operations; + extern int proc_fd_permission(struct inode *inode, int mask); static inline unsigned int proc_fd(struct inode *inode) diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c index 273ee82d8aa9..e634faa9160e 100644 --- a/fs/proc_namespace.c +++ b/fs/proc_namespace.c @@ -61,24 +61,6 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb) return security_sb_show_options(m, sb); } -static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt) -{ - static const struct proc_fs_info mnt_info[] = { - { MNT_NOSUID, ",nosuid" }, - { MNT_NODEV, ",nodev" }, - { MNT_NOEXEC, ",noexec" }, - { MNT_NOATIME, ",noatime" }, - { MNT_NODIRATIME, ",nodiratime" }, - { MNT_RELATIME, ",relatime" }, - { 0, NULL } - }; - const struct proc_fs_info *fs_infop; - - for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) { - if (mnt->mnt_flags & fs_infop->flag) - seq_puts(m, fs_infop->str); - } -} static inline void mangle(struct seq_file *m, const char *s) { @@ -120,7 +102,7 @@ static int show_vfsmnt(struct seq_file *m, struct vfsmount *mnt) err = show_sb_opts(m, sb); if (err) goto out; - show_mnt_opts(m, mnt); + seq_mnt_opts(m, mnt->mnt_flags); if (sb->s_op->show_options) err = sb->s_op->show_options(m, mnt_path.dentry); seq_puts(m, " 0 0\n"); @@ -153,7 +135,7 @@ static int show_mountinfo(struct seq_file *m, struct vfsmount *mnt) goto out; seq_puts(m, mnt->mnt_flags & MNT_READONLY ? " ro" : " rw"); - show_mnt_opts(m, mnt); + seq_mnt_opts(m, mnt->mnt_flags); /* Tagged fields ("foo:X" or "bar") */ if (IS_MNT_SHARED(r)) diff --git a/fs/seq_file.c b/fs/seq_file.c index 1600034a929b..9726baba1732 100644 --- a/fs/seq_file.c +++ b/fs/seq_file.c @@ -15,6 +15,7 @@ #include <linux/cred.h> #include <linux/mm.h> #include <linux/printk.h> +#include <linux/mount.h> #include <linux/string_helpers.h> #include <linux/uaccess.h> @@ -548,6 +549,28 @@ int seq_dentry(struct seq_file *m, struct dentry *dentry, const char *esc) } EXPORT_SYMBOL(seq_dentry); +void seq_mnt_opts(struct seq_file *m, int mnt_flags) +{ + unsigned int i; + static const struct { + int flag; + const char *str; + } mnt_info[] = { + { MNT_NOSUID, ",nosuid" }, + { MNT_NODEV, ",nodev" }, + { MNT_NOEXEC, ",noexec" }, + { MNT_NOATIME, ",noatime" }, + { MNT_NODIRATIME, ",nodiratime" }, + { MNT_RELATIME, ",relatime" }, + { 0, NULL } + }; + + for (i = 0; mnt_info[i].flag; i++) { + if (mnt_flags & mnt_info[i].flag) + seq_puts(m, mnt_info[i].str); + } +} + static void *single_start(struct seq_file *p, loff_t *pos) { return NULL + (*pos == 0); diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h index 770c2bf3aa43..9dd7812eb777 100644 --- a/include/linux/seq_file.h +++ b/include/linux/seq_file.h @@ -138,6 +138,7 @@ int seq_file_path(struct seq_file *, struct file *, const char *); int seq_dentry(struct seq_file *, struct dentry *, const char *); int seq_path_root(struct seq_file *m, const struct path *path, const struct path *root, const char *esc); +void seq_mnt_opts(struct seq_file *m, int mnt_flags); int single_open(struct file *, int (*)(struct seq_file *, void *), void *); int single_open_size(struct file *, int (*)(struct seq_file *, void *), void *, size_t); diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile index 19be60ab950e..78deb8483d27 100644 --- a/samples/vfs/Makefile +++ b/samples/vfs/Makefile @@ -4,6 +4,7 @@ hostprogs := \ test-fsinfo \ test-fsmount \ + test-fsinfo-perf \ test-mntinfo \ test-statx @@ -12,6 +13,7 @@ always-y := $(hostprogs) HOSTCFLAGS_test-fsinfo.o += -I$(objtree)/usr/include HOSTLDLIBS_test-fsinfo += -static -lm HOSTCFLAGS_test-mntinfo.o += -I$(objtree)/usr/include +HOSTCFLAGS_test-fsinfo-perf.o += -I$(objtree)/usr/include HOSTCFLAGS_test-fsmount.o += -I$(objtree)/usr/include HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include ^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-31 17:31 ` David Howells @ 2020-03-31 19:42 ` Miklos Szeredi 2020-03-31 19:47 ` David Howells ` (2 subsequent siblings) 3 siblings, 0 replies; 99+ messages in thread From: Miklos Szeredi @ 2020-03-31 19:42 UTC (permalink / raw) To: David Howells Cc: Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, Christian Brauner, Lennart Poettering, keyrings, linux-fsdevel, linux-kernel On Tue, Mar 31, 2020 at 7:31 PM David Howells <dhowells@redhat.com> wrote: > > Miklos Szeredi <miklos@szeredi.hu> wrote: > > > The basic problem in my view, is that the performance requirement of a > > "get filesystem information" type of API just does not warrant a > > binary coded interface. I've said this a number of times, but it fell > > on deaf ears. > > It hasn't so fallen, but don't necessarily agree with you. Let's pin some > numbers on this. Cool, thanks for testing. Unfortunately the test-fsinfo-perf.c file didn't make it into the patch. Can you please refresh and resend? > Okay, the results: > > For 1000 mounts, f= 1514us f2= 1102us p= 6014us p2= 6935us; p=4.0*f p=5.5*f2 p=0.9*p2 > For 2000 mounts, f= 4712us f2= 3675us p= 20937us p2= 22878us; p=4.4*f p=5.7*f2 p=0.9*p2 > For 3000 mounts, f= 6795us f2= 5304us p= 31080us p2= 34056us; p=4.6*f p=5.9*f2 p=0.9*p2 > For 4000 mounts, f= 9291us f2= 7434us p= 40723us p2= 46479us; p=4.4*f p=5.5*f2 p=0.9*p2 > For 5000 mounts, f\x11423us f2= 9219us p= 50878us p2= 58857us; p=4.5*f p=5.5*f2 p=0.9*p2 > For 10000 mounts, f"899us f2\x18240us p\x101054us p2\x117273us; p=4.4*f p=5.5*f2 p=0.9*p2 > For 20000 mounts, fE811us f27211us p 3640us p2#7377us; p=4.4*f p=5.5*f2 p=0.9*p2 > For 30000 mounts, fi703us f2T800us p06778us p257629us; p=4.4*f p=5.6*f2 p=0.9*p2 So even the p2 method will give at least 80k queries/s, which is quite good, considering that the need to rescan the complete mount tree should be exceedingly rare (and in case it mattered, could be optimized by priming from /proc/self/mountinfo). Thanks, Miklos ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-31 17:31 ` David Howells 2020-03-31 19:42 ` Miklos Szeredi @ 2020-03-31 19:47 ` David Howells 2020-03-31 21:14 ` David Howells 2020-03-31 21:23 ` David Howells 3 siblings, 0 replies; 99+ messages in thread From: David Howells @ 2020-03-31 19:47 UTC (permalink / raw) To: Miklos Szeredi Cc: dhowells, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, Christian Brauner, Lennart Poettering, keyrings, linux-fsdevel, linux-kernel [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="maccentraleurope", Size: 33923 bytes --] Miklos Szeredi <miklos@szeredi.hu> wrote: > Cool, thanks for testing. Unfortunately the test-fsinfo-perf.c file > didn't make it into the patch. Can you please refresh and resend? Oops - I forgot to add it. See attached. David --- commit b7239021cb7660bf328bb7fcce05e3a35ce5842b Author: David Howells <dhowells@redhat.com> Date: Tue Mar 31 14:39:07 2020 +0100 Performance test Miklós's patch vs fsinfo diff --git a/fs/Makefile b/fs/Makefile index b6bf2424c7f7..ac0627176db1 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -137,3 +137,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/ obj-$(CONFIG_EROFS_FS) += erofs/ obj-$(CONFIG_VBOXSF_FS) += vboxsf/ obj-$(CONFIG_ZONEFS_FS) += zonefs/ +obj-y += mountfs/ diff --git a/fs/mount.h b/fs/mount.h index 063f41bc2e93..89b091fc482f 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -82,6 +82,7 @@ struct mount { atomic_t mnt_subtree_notifications; /* Number of notifications in subtree */ struct watch_list *mnt_watchers; /* Watches on dentries within this mount */ #endif + struct mountfs_entry *mnt_mountfs_entry; } __randomize_layout; #define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */ @@ -177,3 +178,11 @@ static inline void notify_mount(struct mount *triggered, { } #endif + +void mnt_namespace_lock_read(void); +void mnt_namespace_unlock_read(void); + +void mountfs_create(struct mount *mnt); +extern void mountfs_remove(struct mount *mnt); +int mountfs_lookup_internal(struct vfsmount *m, struct path *path); + diff --git a/fs/mountfs/Makefile b/fs/mountfs/Makefile new file mode 100644 index 000000000000..35a65e9a966f --- /dev/null +++ b/fs/mountfs/Makefile @@ -0,0 +1 @@ +obj-y += super.o diff --git a/fs/mountfs/super.c b/fs/mountfs/super.c new file mode 100644 index 000000000000..82c01eb6154d --- /dev/null +++ b/fs/mountfs/super.c @@ -0,0 +1,502 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#include "../pnode.h" +#include <linux/fs.h> +#include <linux/kref.h> +#include <linux/nsproxy.h> +#include <linux/fs_struct.h> +#include <linux/fs_context.h> + +#define MOUNTFS_SUPER_MAGIC 0x4e756f4d + +static DEFINE_SPINLOCK(mountfs_lock); +static struct rb_root mountfs_entries = RB_ROOT; +static struct vfsmount *mountfs_mnt __read_mostly; + +struct mountfs_entry { + struct kref kref; + struct mount *mnt; + struct rb_node node; + int id; +}; + +static const char *mountfs_attrs[] = { + "root", "mountpoint", "id", "parent", "options", "children", + "group", "master", "propagate_from" +}; + +#define MOUNTFS_INO(id) (((unsigned long) id + 1) * \ + (ARRAY_SIZE(mountfs_attrs) + 1)) + +void mountfs_entry_release(struct kref *kref) +{ + kfree(container_of(kref, struct mountfs_entry, kref)); +} + +void mountfs_entry_put(struct mountfs_entry *entry) +{ + kref_put(&entry->kref, mountfs_entry_release); +} + +static bool mountfs_entry_visible(struct mountfs_entry *entry) +{ + struct mount *mnt; + bool visible = false; + + rcu_read_lock(); + mnt = rcu_dereference(entry->mnt); + if (mnt && mnt->mnt_ns == current->nsproxy->mnt_ns) + visible = true; + rcu_read_unlock(); + + return visible; +} +static int mountfs_attr_show(struct seq_file *sf, void *v) +{ + const char *name = sf->file->f_path.dentry->d_name.name; + struct mountfs_entry *entry = sf->private; + struct mount *mnt; + struct vfsmount *m; + struct super_block *sb; + struct path root; + int tmp, err = -ENODEV; + + mnt_namespace_lock_read(); + + mnt = entry->mnt; + if (!mnt || !mnt->mnt_ns) + goto out; + + err = 0; + m = &mnt->mnt; + sb = m->mnt_sb; + + if (strcmp(name, "root") == 0) { + if (sb->s_op->show_path) { + err = sb->s_op->show_path(sf, m->mnt_root); + } else { + seq_dentry(sf, m->mnt_root, " \t\n\\"); + } + seq_putc(sf, '\n'); + } else if (strcmp(name, "mountpoint") == 0) { + struct path mnt_path = { .dentry = m->mnt_root, .mnt = m }; + + get_fs_root(current->fs, &root); + err = seq_path_root(sf, &mnt_path, &root, " \t\n\\"); + if (err == SEQ_SKIP) { + seq_puts(sf, "(unreachable)"); + err = 0; + } + seq_putc(sf, '\n'); + path_put(&root); + } else if (strcmp(name, "id") == 0) { + seq_printf(sf, "%i\n", mnt->mnt_id); + } else if (strcmp(name, "parent") == 0) { + tmp = rcu_dereference(mnt->mnt_parent)->mnt_id; + seq_printf(sf, "%i\n", tmp); + } else if (strcmp(name, "options") == 0) { + int mnt_flags = READ_ONCE(m->mnt_flags); + + seq_puts(sf, mnt_flags & MNT_READONLY ? "ro" : "rw"); + seq_mnt_opts(sf, mnt_flags); + seq_putc(sf, '\n'); + } else if (strcmp(name, "children") == 0) { + struct mount *child; + bool first = true; + + list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) { + if (!first) + seq_putc(sf, ','); + else + first = false; + seq_printf(sf, "%i", child->mnt_id); + } + if (!first) + seq_putc(sf, '\n'); + } else if (strcmp(name, "group") == 0) { + if (IS_MNT_SHARED(mnt)) + seq_printf(sf, "%i\n", mnt->mnt_group_id); + } else if (strcmp(name, "master") == 0) { + if (IS_MNT_SLAVE(mnt)) { + tmp = rcu_dereference(mnt->mnt_master)->mnt_group_id; + seq_printf(sf, "%i\n", tmp); + } + } else if (strcmp(name, "propagate_from") == 0) { + if (IS_MNT_SLAVE(mnt)) { + get_fs_root(current->fs, &root); + tmp = get_dominating_id(mnt, &root); + if (tmp) + seq_printf(sf, "%i\n", tmp); + } + } else { + WARN_ON(1); + err = -EIO; + } +out: + mnt_namespace_unlock_read(); + + return err; +} + +static int mountfs_attr_open(struct inode *inode, struct file *file) +{ + return single_open(file, mountfs_attr_show, inode->i_private); +} + +static const struct file_operations mountfs_attr_fops = { + .open = mountfs_attr_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static struct mountfs_entry *mountfs_node_to_entry(struct rb_node *node) +{ + return rb_entry(node, struct mountfs_entry, node); +} + +static struct rb_node **mountfs_find_node(int id, struct rb_node **parent) +{ + struct rb_node **link = &mountfs_entries.rb_node; + + *parent = NULL; + while (*link) { + struct mountfs_entry *entry = mountfs_node_to_entry(*link); + + *parent = *link; + if (id < entry->id) + link = &entry->node.rb_left; + else if (id > entry->id) + link = &entry->node.rb_right; + else + break; + } + return link; +} + +void mountfs_create(struct mount *mnt) +{ + struct mountfs_entry *entry; + struct rb_node **link, *parent; + + entry = kzalloc(sizeof(*entry), GFP_KERNEL); + if (!entry) { + WARN(1, "failed to allocate mountfs entry"); + return; + } + kref_init(&entry->kref); + entry->mnt = mnt; + entry->id = mnt->mnt_id; + + spin_lock(&mountfs_lock); + link = mountfs_find_node(entry->id, &parent); + if (!WARN_ON(*link)) { + rb_link_node(&entry->node, parent, link); + rb_insert_color(&entry->node, &mountfs_entries); + mnt->mnt_mountfs_entry = entry; + } else { + kfree(entry); + } + spin_unlock(&mountfs_lock); +} + +void mountfs_remove(struct mount *mnt) +{ + struct mountfs_entry *entry = mnt->mnt_mountfs_entry; + + if (!entry) + return; + spin_lock(&mountfs_lock); + entry->mnt = NULL; + rb_erase(&entry->node, &mountfs_entries); + spin_unlock(&mountfs_lock); + + mountfs_entry_put(entry); + + mnt->mnt_mountfs_entry = NULL; +} + +static struct mountfs_entry *mountfs_get_entry(const char *name) +{ + struct mountfs_entry *entry = NULL; + struct rb_node **link, *dummy; + unsigned long mnt_id; + char buf[32]; + int ret; + + ret = kstrtoul(name, 10, &mnt_id); + if (ret || mnt_id > INT_MAX) + return NULL; + + snprintf(buf, sizeof(buf), "%lu", mnt_id); + if (strcmp(buf, name) != 0) + return NULL; + + spin_lock(&mountfs_lock); + link = mountfs_find_node(mnt_id, &dummy); + if (*link) { + entry = mountfs_node_to_entry(*link); + if (!mountfs_entry_visible(entry)) + entry = NULL; + else + kref_get(&entry->kref); + } + spin_unlock(&mountfs_lock); + + return entry; +} + +static void mountfs_init_inode(struct inode *inode, umode_t mode); + +static struct dentry *mountfs_lookup_entry(struct dentry *dentry, + struct mountfs_entry *entry, + int idx) +{ + struct inode *inode; + + inode = new_inode(dentry->d_sb); + if (!inode) { + mountfs_entry_put(entry); + return ERR_PTR(-ENOMEM); + } + inode->i_private = entry; + inode->i_ino = MOUNTFS_INO(entry->id) + idx; + mountfs_init_inode(inode, idx ? S_IFREG | 0444 : S_IFDIR | 0555); + return d_splice_alias(inode, dentry); + +} + +static struct dentry *mountfs_lookup(struct inode *dir, struct dentry *dentry, + unsigned int flags) +{ + struct mountfs_entry *entry = dir->i_private; + int i = 0; + + if (entry) { + for (i = 0; i < ARRAY_SIZE(mountfs_attrs); i++) + if (strcmp(mountfs_attrs[i], dentry->d_name.name) == 0) + break; + if (i == ARRAY_SIZE(mountfs_attrs)) + return ERR_PTR(-ENOMEM); + i++; + kref_get(&entry->kref); + } else { + entry = mountfs_get_entry(dentry->d_name.name); + if (!entry) + return ERR_PTR(-ENOENT); + } + + return mountfs_lookup_entry(dentry, entry, i); +} + +static int mountfs_d_revalidate(struct dentry *dentry, unsigned int flags) +{ + struct mountfs_entry *entry = dentry->d_inode->i_private; + + /* root: valid */ + if (!entry) + return 1; + + /* removed: invalid */ + if (!entry->mnt) + return 0; + + /* attribute or visible in this namespace: valid */ + if (!d_can_lookup(dentry) || mountfs_entry_visible(entry)) + return 1; + + /* invlisible in this namespace: valid but deny entry*/ + return -ENOENT; +} + +static int mountfs_readdir(struct file *file, struct dir_context *ctx) +{ + struct rb_node *node; + struct mountfs_entry *entry = file_inode(file)->i_private; + char name[32]; + const char *s; + unsigned int len, pos, id; + + if (ctx->pos - 2 > INT_MAX || !dir_emit_dots(file, ctx)) + return 0; + + if (entry) { + while (ctx->pos - 2 < ARRAY_SIZE(mountfs_attrs)) { + s = mountfs_attrs[ctx->pos - 2]; + if (!dir_emit(ctx, s, strlen(s), + MOUNTFS_INO(entry->id) + ctx->pos, + DT_REG)) + break; + ctx->pos++; + } + return 0; + } + + pos = ctx->pos - 2; + do { + spin_lock(&mountfs_lock); + mountfs_find_node(pos, &node); + pos = 1U + INT_MAX; + do { + if (!node) { + spin_unlock(&mountfs_lock); + goto out; + } + entry = mountfs_node_to_entry(node); + node = rb_next(node); + } while (!mountfs_entry_visible(entry)); + if (node) + pos = mountfs_node_to_entry(node)->id; + id = entry->id; + spin_unlock(&mountfs_lock); + + len = snprintf(name, sizeof(name), "%i", id); + ctx->pos = id + 2; + if (!dir_emit(ctx, name, len, MOUNTFS_INO(id), DT_DIR)) + return 0; + } while (pos <= INT_MAX); +out: + ctx->pos = pos + 2; + return 0; +} + +int mountfs_lookup_internal(struct vfsmount *m, struct path *path) +{ + char name[32]; + struct qstr this = { .name = name }; + struct mount *mnt = real_mount(m); + struct mountfs_entry *entry = mnt->mnt_mountfs_entry; + struct dentry *dentry, *old, *root = mountfs_mnt->mnt_root; + + this.len = snprintf(name, sizeof(name), "%i", mnt->mnt_id); + dentry = d_hash_and_lookup(root, &this); + if (dentry && dentry->d_inode->i_private != entry) { + d_invalidate(dentry); + dput(dentry); + dentry = NULL; + } + if (!dentry) { + dentry = d_alloc(root, &this); + if (!dentry) + return -ENOMEM; + + kref_get(&entry->kref); + old = mountfs_lookup_entry(dentry, entry, 0); + if (old) { + dput(dentry); + if (IS_ERR(old)) + return PTR_ERR(old); + dentry = old; + } + } + + *path = (struct path) { .mnt = mountfs_mnt, .dentry = dentry }; + return 0; +} + +static const struct dentry_operations mountfs_dops = { + .d_revalidate = mountfs_d_revalidate, +}; + +static const struct inode_operations mountfs_iops = { + .lookup = mountfs_lookup, +}; + +static const struct file_operations mountfs_fops = { + .iterate_shared = mountfs_readdir, + .read = generic_read_dir, + .llseek = generic_file_llseek, +}; + +static void mountfs_init_inode(struct inode *inode, umode_t mode) +{ + inode->i_mode = mode; + inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode); + if (S_ISREG(mode)) { + inode->i_size = PAGE_SIZE; + inode->i_fop = &mountfs_attr_fops; + } else { + inode->i_op = &mountfs_iops; + inode->i_fop = &mountfs_fops; + } +} + +static void mountfs_evict_inode(struct inode *inode) +{ + struct mountfs_entry *entry = inode->i_private; + + clear_inode(inode); + if (entry) + mountfs_entry_put(entry); +} + +static const struct super_operations mountfs_sops = { + .statfs = simple_statfs, + .drop_inode = generic_delete_inode, + .evict_inode = mountfs_evict_inode, +}; + +static int mountfs_fill_super(struct super_block *sb, struct fs_context *fc) +{ + struct inode *root; + + sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV; + sb->s_blocksize = PAGE_SIZE; + sb->s_blocksize_bits = PAGE_SHIFT; + sb->s_magic = MOUNTFS_SUPER_MAGIC; + sb->s_time_gran = 1; + sb->s_shrink.seeks = 0; + sb->s_op = &mountfs_sops; + sb->s_d_op = &mountfs_dops; + + root = new_inode(sb); + if (!root) + return -ENOMEM; + + root->i_ino = 1; + mountfs_init_inode(root, S_IFDIR | 0444); + + sb->s_root = d_make_root(root); + if (!sb->s_root) + return -ENOMEM; + + return 0; +} + +static int mountfs_get_tree(struct fs_context *fc) +{ + return get_tree_single(fc, mountfs_fill_super); +} + +static const struct fs_context_operations mountfs_context_ops = { + .get_tree = mountfs_get_tree, +}; + +static int mountfs_init_fs_context(struct fs_context *fc) +{ + fc->ops = &mountfs_context_ops; + fc->global = true; + return 0; +} + +static struct file_system_type mountfs_fs_type = { + .name = "mountfs", + .init_fs_context = mountfs_init_fs_context, + .kill_sb = kill_anon_super, +}; + +static int __init mountfs_init(void) +{ + int err; + + err = register_filesystem(&mountfs_fs_type); + if (!err) { + mountfs_mnt = kern_mount(&mountfs_fs_type); + if (IS_ERR(mountfs_mnt)) { + err = PTR_ERR(mountfs_mnt); + unregister_filesystem(&mountfs_fs_type); + } + } + return err; +} +fs_initcall(mountfs_init); diff --git a/fs/namespace.c b/fs/namespace.c index 5427e732c1bf..a05a2885a90e 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -962,6 +962,8 @@ struct vfsmount *vfs_create_mount(struct fs_context *fc) if (fc->sb_flags & SB_KERNMOUNT) mnt->mnt.mnt_flags = MNT_INTERNAL; + else + mountfs_create(mnt); atomic_inc(&fc->root->d_sb->s_active); mnt->mnt.mnt_sb = fc->root->d_sb; @@ -1033,7 +1035,7 @@ vfs_submount(const struct dentry *mountpoint, struct file_system_type *type, } EXPORT_SYMBOL_GPL(vfs_submount); -static struct mount *clone_mnt(struct mount *old, struct dentry *root, +static struct mount *clone_mnt_common(struct mount *old, struct dentry *root, int flag) { struct super_block *sb = old->mnt.mnt_sb; @@ -1100,6 +1102,17 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root, return ERR_PTR(err); } +static struct mount *clone_mnt(struct mount *old, struct dentry *root, + int flag) +{ + struct mount *mnt = clone_mnt_common(old, root, flag); + + if (!IS_ERR(mnt)) + mountfs_create(mnt); + + return mnt; +} + static void cleanup_mnt(struct mount *mnt) { struct hlist_node *p; @@ -1112,6 +1125,7 @@ static void cleanup_mnt(struct mount *mnt) * so mnt_get_writers() below is safe. */ WARN_ON(mnt_get_writers(mnt)); + if (unlikely(mnt->mnt_pins.first)) mnt_pin_kill(mnt); hlist_for_each_entry_safe(m, p, &mnt->mnt_stuck_children, mnt_umount) { @@ -1197,6 +1211,8 @@ static void mntput_no_expire(struct mount *mnt) unlock_mount_hash(); shrink_dentry_list(&list); + mountfs_remove(mnt); + if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) { struct task_struct *task = current; if (likely(!(task->flags & PF_KTHREAD))) { @@ -1263,13 +1279,14 @@ EXPORT_SYMBOL(path_is_mountpoint); struct vfsmount *mnt_clone_internal(const struct path *path) { struct mount *p; - p = clone_mnt(real_mount(path->mnt), path->dentry, CL_PRIVATE); + p = clone_mnt_common(real_mount(path->mnt), path->dentry, CL_PRIVATE); if (IS_ERR(p)) return ERR_CAST(p); p->mnt.mnt_flags |= MNT_INTERNAL; return &p->mnt; } + #ifdef CONFIG_PROC_FS /* iterator; we want it to have access to namespace_sem, thus here... */ static void *m_start(struct seq_file *m, loff_t *pos) @@ -1411,6 +1428,16 @@ static inline void namespace_lock(void) down_write(&namespace_sem); } +void mnt_namespace_lock_read(void) +{ + down_read(&namespace_sem); +} + +void mnt_namespace_unlock_read(void) +{ + up_read(&namespace_sem); +} + enum umount_tree_flags { UMOUNT_SYNC = 1, UMOUNT_PROPAGATE = 2, diff --git a/fs/proc/base.c b/fs/proc/base.c index c7c64272b0fa..0477f8b51182 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3092,6 +3092,7 @@ static const struct pid_entry tgid_base_stuff[] = { DIR("fd", S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations), DIR("map_files", S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations), DIR("fdinfo", S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations), + DIR("fdmount", S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations), DIR("ns", S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations), #ifdef CONFIG_NET DIR("net", S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations), @@ -3497,6 +3498,7 @@ static const struct inode_operations proc_tid_comm_inode_operations = { static const struct pid_entry tid_base_stuff[] = { DIR("fd", S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations), DIR("fdinfo", S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations), + DIR("fdmount", S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations), DIR("ns", S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations), #ifdef CONFIG_NET DIR("net", S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations), diff --git a/fs/proc/fd.c b/fs/proc/fd.c index 81882a13212d..94a57e178801 100644 --- a/fs/proc/fd.c +++ b/fs/proc/fd.c @@ -361,3 +361,85 @@ const struct file_operations proc_fdinfo_operations = { .iterate_shared = proc_readfdinfo, .llseek = generic_file_llseek, }; + +static int proc_fdmount_link(struct dentry *dentry, struct path *path) +{ + struct files_struct *files = NULL; + struct task_struct *task; + struct path fd_path; + int ret = -ENOENT; + + task = get_proc_task(d_inode(dentry)); + if (task) { + files = get_files_struct(task); + put_task_struct(task); + } + + if (files) { + unsigned int fd = proc_fd(d_inode(dentry)); + struct file *fd_file; + + spin_lock(&files->file_lock); + fd_file = fcheck_files(files, fd); + if (fd_file) { + fd_path = fd_file->f_path; + path_get(&fd_path); + ret = 0; + } + spin_unlock(&files->file_lock); + put_files_struct(files); + } + if (!ret) { + ret = mountfs_lookup_internal(fd_path.mnt, path); + path_put(&fd_path); + } + + return ret; +} + +static struct dentry *proc_fdmount_instantiate(struct dentry *dentry, + struct task_struct *task, const void *ptr) +{ + const struct fd_data *data = ptr; + struct proc_inode *ei; + struct inode *inode; + + inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | 0400); + if (!inode) + return ERR_PTR(-ENOENT); + + ei = PROC_I(inode); + ei->fd = data->fd; + + inode->i_op = &proc_pid_link_inode_operations; + inode->i_size = 64; + + ei->op.proc_get_link = proc_fdmount_link; + tid_fd_update_inode(task, inode, 0); + + d_set_d_op(dentry, &tid_fd_dentry_operations); + return d_splice_alias(inode, dentry); +} + +static struct dentry * +proc_lookupfdmount(struct inode *dir, struct dentry *dentry, unsigned int flags) +{ + return proc_lookupfd_common(dir, dentry, proc_fdmount_instantiate); +} + +static int proc_readfdmount(struct file *file, struct dir_context *ctx) +{ + return proc_readfd_common(file, ctx, + proc_fdmount_instantiate); +} + +const struct inode_operations proc_fdmount_inode_operations = { + .lookup = proc_lookupfdmount, + .setattr = proc_setattr, +}; + +const struct file_operations proc_fdmount_operations = { + .read = generic_read_dir, + .iterate_shared = proc_readfdmount, + .llseek = generic_file_llseek, +}; diff --git a/fs/proc/fd.h b/fs/proc/fd.h index f371a602bf58..9e087c833e65 100644 --- a/fs/proc/fd.h +++ b/fs/proc/fd.h @@ -10,6 +10,9 @@ extern const struct inode_operations proc_fd_inode_operations; extern const struct file_operations proc_fdinfo_operations; extern const struct inode_operations proc_fdinfo_inode_operations; +extern const struct file_operations proc_fdmount_operations; +extern const struct inode_operations proc_fdmount_inode_operations; + extern int proc_fd_permission(struct inode *inode, int mask); static inline unsigned int proc_fd(struct inode *inode) diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c index 273ee82d8aa9..e634faa9160e 100644 --- a/fs/proc_namespace.c +++ b/fs/proc_namespace.c @@ -61,24 +61,6 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb) return security_sb_show_options(m, sb); } -static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt) -{ - static const struct proc_fs_info mnt_info[] = { - { MNT_NOSUID, ",nosuid" }, - { MNT_NODEV, ",nodev" }, - { MNT_NOEXEC, ",noexec" }, - { MNT_NOATIME, ",noatime" }, - { MNT_NODIRATIME, ",nodiratime" }, - { MNT_RELATIME, ",relatime" }, - { 0, NULL } - }; - const struct proc_fs_info *fs_infop; - - for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) { - if (mnt->mnt_flags & fs_infop->flag) - seq_puts(m, fs_infop->str); - } -} static inline void mangle(struct seq_file *m, const char *s) { @@ -120,7 +102,7 @@ static int show_vfsmnt(struct seq_file *m, struct vfsmount *mnt) err = show_sb_opts(m, sb); if (err) goto out; - show_mnt_opts(m, mnt); + seq_mnt_opts(m, mnt->mnt_flags); if (sb->s_op->show_options) err = sb->s_op->show_options(m, mnt_path.dentry); seq_puts(m, " 0 0\n"); @@ -153,7 +135,7 @@ static int show_mountinfo(struct seq_file *m, struct vfsmount *mnt) goto out; seq_puts(m, mnt->mnt_flags & MNT_READONLY ? " ro" : " rw"); - show_mnt_opts(m, mnt); + seq_mnt_opts(m, mnt->mnt_flags); /* Tagged fields ("foo:X" or "bar") */ if (IS_MNT_SHARED(r)) diff --git a/fs/seq_file.c b/fs/seq_file.c index 1600034a929b..9726baba1732 100644 --- a/fs/seq_file.c +++ b/fs/seq_file.c @@ -15,6 +15,7 @@ #include <linux/cred.h> #include <linux/mm.h> #include <linux/printk.h> +#include <linux/mount.h> #include <linux/string_helpers.h> #include <linux/uaccess.h> @@ -548,6 +549,28 @@ int seq_dentry(struct seq_file *m, struct dentry *dentry, const char *esc) } EXPORT_SYMBOL(seq_dentry); +void seq_mnt_opts(struct seq_file *m, int mnt_flags) +{ + unsigned int i; + static const struct { + int flag; + const char *str; + } mnt_info[] = { + { MNT_NOSUID, ",nosuid" }, + { MNT_NODEV, ",nodev" }, + { MNT_NOEXEC, ",noexec" }, + { MNT_NOATIME, ",noatime" }, + { MNT_NODIRATIME, ",nodiratime" }, + { MNT_RELATIME, ",relatime" }, + { 0, NULL } + }; + + for (i = 0; mnt_info[i].flag; i++) { + if (mnt_flags & mnt_info[i].flag) + seq_puts(m, mnt_info[i].str); + } +} + static void *single_start(struct seq_file *p, loff_t *pos) { return NULL + (*pos == 0); diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h index 770c2bf3aa43..9dd7812eb777 100644 --- a/include/linux/seq_file.h +++ b/include/linux/seq_file.h @@ -138,6 +138,7 @@ int seq_file_path(struct seq_file *, struct file *, const char *); int seq_dentry(struct seq_file *, struct dentry *, const char *); int seq_path_root(struct seq_file *m, const struct path *path, const struct path *root, const char *esc); +void seq_mnt_opts(struct seq_file *m, int mnt_flags); int single_open(struct file *, int (*)(struct seq_file *, void *), void *); int single_open_size(struct file *, int (*)(struct seq_file *, void *), void *, size_t); diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile index 19be60ab950e..78deb8483d27 100644 --- a/samples/vfs/Makefile +++ b/samples/vfs/Makefile @@ -4,6 +4,7 @@ hostprogs := \ test-fsinfo \ test-fsmount \ + test-fsinfo-perf \ test-mntinfo \ test-statx @@ -12,6 +13,7 @@ always-y := $(hostprogs) HOSTCFLAGS_test-fsinfo.o += -I$(objtree)/usr/include HOSTLDLIBS_test-fsinfo += -static -lm HOSTCFLAGS_test-mntinfo.o += -I$(objtree)/usr/include +HOSTCFLAGS_test-fsinfo-perf.o += -I$(objtree)/usr/include HOSTCFLAGS_test-fsmount.o += -I$(objtree)/usr/include HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include diff --git a/samples/vfs/test-fsinfo-perf.c b/samples/vfs/test-fsinfo-perf.c new file mode 100644 index 000000000000..fba40737f768 --- /dev/null +++ b/samples/vfs/test-fsinfo-perf.c @@ -0,0 +1,361 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* Test the fsinfo() system call + * + * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved. + * Written by David Howells (dhowells@redhat.com) + */ + +#define _GNU_SOURCE +#define _ATFILE_SOURCE +#include <stdbool.h> +#include <stdio.h> +#include <stdlib.h> +#include <stdint.h> +#include <string.h> +#include <unistd.h> +#include <ctype.h> +#include <errno.h> +#include <time.h> +#include <math.h> +#include <fcntl.h> +#include <sys/syscall.h> +#include <sys/stat.h> +#include <sys/mount.h> +#include <sys/time.h> +#include <linux/fsinfo.h> + +#ifndef __NR_fsinfo +#define __NR_fsinfo -1 +#endif + +#define ERR(ret, what) do { if ((long)(ret) == -1) { perror(what); exit(1); } } while(0) +#define OOM(ret) do { if (!(ret)) { perror(NULL); exit(1); } } while(0) + +static int nr_mounts = 3; +static const char *base_path; + +static __attribute__((unused)) +ssize_t fsinfo(int dfd, const char *filename, + struct fsinfo_params *params, size_t params_size, + void *result_buffer, size_t result_buf_size) +{ + return syscall(__NR_fsinfo, dfd, filename, + params, params_size, + result_buffer, result_buf_size); +} + +static void iterate(void (*func)(int i, const char *)) +{ + char name[4096]; + int i; + + for (i = 0; i < nr_mounts; i++) { + sprintf(name, "%s/%d", base_path, i); + func(i, name); + } +} + +static void make_mount(int ix, const char *path) +{ + ERR(mkdir(path, 0755), "mkdir"); + ERR(mount("none", path, "tmpfs", 0, NULL), "mount"); + ERR(mount("none", path, NULL, MS_PRIVATE, NULL), "mount"); +} + +static void do_umount(void) +{ + printf("--- umount ---\n"); + if (umount2(base_path, MNT_DETACH) == -1) + perror("umount"); +} + +static unsigned long sum_mnt_id; + +static void get_mntid_by_fsinfo(int ix, const char *path) +{ + struct fsinfo_mount_info r; + struct fsinfo_params params = { + .flags = FSINFO_FLAGS_QUERY_PATH, + .request = FSINFO_ATTR_MOUNT_INFO, + }; + + ERR(fsinfo(AT_FDCWD, path, ¶ms, sizeof(params), &r, sizeof(r)), + "fsinfo"); + //printf("[%u] %u\n", ix, r.mnt_id); + sum_mnt_id += r.mnt_id; +} + +static void get_mntid_by_proc(int ix, const char *path) +{ + unsigned int mnt_id; + ssize_t len; + char procfile[100], buffer[4096], *p, *nl; + int fd, fd2; + + fd = open(path, O_PATH); + ERR(fd, "open/path"); + sprintf(procfile, "/proc/self/fdinfo/%u", fd); + fd2 = open(procfile, O_RDONLY); + ERR(fd2, "open/proc"); + len = read(fd2, buffer, sizeof(buffer) - 1); + ERR(len, "read"); + buffer[len] = 0; + close(fd2); + close(fd); + + p = buffer; + do { + nl = strchr(p, '\n'); + if (nl) + *nl++ = '\0'; + else + nl = NULL; + + if (strncmp(p, "mnt_id:", 7) != 0) + continue; + p += 7; + while (isblank(*p)) + p++; + /* Have to allow for extra numbers being added to the line */ + if (sscanf(p, "%u", &mnt_id) != 1) { + fprintf(stderr, "Bad format %s\n", procfile); + exit(3); + } + break; + + } while ((p = nl)); + + if (!p) { + fprintf(stderr, "Missing field %s\n", procfile); + exit(3); + } + + sum_mnt_id += mnt_id; + //printf("[%u] %u\n", ix, mnt_id); +} + +static void get_mntid_by_fsinfo_2(void) +{ + struct fsinfo_mount_child *children, *c, *end; + struct fsinfo_mount_info r; + struct fsinfo_params params = { + .flags = FSINFO_FLAGS_QUERY_PATH, + .request = FSINFO_ATTR_MOUNT_INFO, + }; + unsigned int base_mnt_id; + size_t s_children, n_children; + char name[32]; + int i; + + /* Convert path to mount ID */ + ERR(fsinfo(AT_FDCWD, base_path, ¶ms, sizeof(params), &r, sizeof(r)), + "fsinfo/base"); + base_mnt_id = r.mnt_id; + //printf("[B] %u\n", base_mnt_id); + + /* Get a list of all the children of this mount ID */ + s_children = (nr_mounts + 1) * sizeof(*children); + children = malloc(s_children); + OOM(children); + + params.flags = FSINFO_FLAGS_QUERY_MOUNT; + params.request = FSINFO_ATTR_MOUNT_CHILDREN; + sprintf(name, "%u", base_mnt_id); + s_children = fsinfo(AT_FDCWD, name, ¶ms, sizeof(params), children, s_children); + ERR(s_children, "fsinfo/children"); + + /* Query each child */ + n_children = s_children / sizeof(*c) - 1; // Parent is added at end + c = children; + end = c + n_children; + for (i = 0; c < end; c++, i++) { + //printf("[%u] %u\n", i, c->mnt_id); + params.flags = FSINFO_FLAGS_QUERY_MOUNT; + params.request = FSINFO_ATTR_MOUNT_INFO; + sprintf(name, "%u", c->mnt_id); + ERR(fsinfo(AT_FDCWD, name, ¶ms, sizeof(params), &r, sizeof(r)), + "fsinfo/child"); + sum_mnt_id += r.mnt_id; + } +} + +static void get_mntid_by_mountfs(void) +{ + unsigned int base_mnt_id, mnt_id, x; + ssize_t len, s_children; + char procfile[100], buffer[100], *children, *p, *q, *nl, *comma; + int fd, fd2, mntfd, i; + + /* Start off by reading the mount ID from the base path */ + fd = open(base_path, O_PATH); + ERR(fd, "open/path"); + sprintf(procfile, "/proc/self/fdinfo/%u", fd); + fd2 = open(procfile, O_RDONLY); + ERR(fd2, "open/proc"); + len = read(fd2, buffer, sizeof(buffer) - 1); + ERR(len, "read"); + buffer[len] = 0; + close(fd2); + close(fd); + + p = buffer; + do { + nl = strchr(p, '\n'); + if (nl) + *nl++ = '\0'; + else + nl = NULL; + + if (strncmp(p, "mnt_id:", 7) != 0) + continue; + p += 7; + while (isblank(*p)) + p++; + /* Have to allow for extra numbers being added to the line */ + if (sscanf(p, "%u", &base_mnt_id) != 1) { + fprintf(stderr, "Bad format %s\n", procfile); + exit(3); + } + break; + + } while ((p = nl)); + + if (!p) { + fprintf(stderr, "Missing field %s\n", procfile); + exit(3); + } + + if (0) printf("[B] %u\n", base_mnt_id); + + mntfd = open("/mnt", O_PATH); + ERR(fd, "open/mountfs"); + + /* Get a list of all the children of this mount ID */ + s_children = (nr_mounts) * 12; + children = malloc(s_children); + OOM(children); + + sprintf(procfile, "%u/children", base_mnt_id); + fd = openat(mntfd, procfile, O_RDONLY); + ERR(fd, "open/children"); + s_children = read(fd, children, s_children - 1); + ERR(s_children, "read/children"); + close(fd); + if (s_children > 0 && children[s_children - 1] == '\n') + s_children--; + children[s_children] = 0; + + /* Query each child */ + p = children; + if (!*p) + return; + i = 0; + do { + mnt_id = strtoul(p, &comma, 10); + if (*comma) { + if (*comma != ',') { + fprintf(stderr, "Bad format in mountfs-children\n"); + exit(3); + } + comma++; + } + + sprintf(procfile, "%u/id", mnt_id); + fd = openat(mntfd, procfile, O_RDONLY); + ERR(fd, procfile); + len = read(fd, buffer, sizeof(buffer) - 1); + ERR(len, "read/id"); + close(fd); + if (len > 0 && buffer[len - 1] == '\n') + len--; + buffer[len] = 0; + + x = strtoul(buffer, &q, 10); + + if (*q) { + fprintf(stderr, "Bad format in %s '%s'\n", procfile, buffer); + exit(3); + } + + if (0) printf("[%u] %u\n", i++, x); + sum_mnt_id += x; + } while (p = comma, *comma); +} + +static unsigned long duration(struct timeval *before, struct timeval *after) +{ + unsigned long a, b; + + a = after->tv_sec * 1000000 + after->tv_usec; + b = before->tv_sec * 1000000 + before->tv_usec; + return a - b; +} + +int main(int argc, char **argv) +{ + struct timeval f_before, f_after; + struct timeval f2_before, f2_after; + struct timeval p_before, p_after; + struct timeval p2_before, p2_after; + const char *path; + unsigned long f_dur, f2_dur, p_dur, p2_dur; + + if (argc < 2) { + fprintf(stderr, "Format: %s <path> [nr_mounts]\n", argv[0]); + exit(2); + } + + if (argc == 3) + nr_mounts = atoi(argv[2]); + + path = argv[1]; + ERR(mount("none", path, "tmpfs", 0, NULL), "mount"); + ERR(mount("none", path, NULL, MS_PRIVATE, NULL), "mount"); + base_path = path; + atexit(do_umount); + + printf("--- make mounts ---\n"); + iterate(make_mount); + + printf("--- test fsinfo by path ---\n"); + sum_mnt_id = 0; + ERR(gettimeofday(&f_before, NULL), "gettimeofday"); + iterate(get_mntid_by_fsinfo); + ERR(gettimeofday(&f_after, NULL), "gettimeofday"); + printf("sum(mnt_id) = %lu\n", sum_mnt_id); + + printf("--- test fsinfo by mnt_id ---\n"); + sum_mnt_id = 0; + ERR(gettimeofday(&f2_before, NULL), "gettimeofday"); + get_mntid_by_fsinfo_2(); + ERR(gettimeofday(&f2_after, NULL), "gettimeofday"); + printf("sum(mnt_id) = %lu\n", sum_mnt_id); + + printf("--- test /proc/fdinfo ---\n"); + sum_mnt_id = 0; + ERR(gettimeofday(&p_before, NULL), "gettimeofday"); + iterate(get_mntid_by_proc); + ERR(gettimeofday(&p_after, NULL), "gettimeofday"); + printf("sum(mnt_id) = %lu\n", sum_mnt_id); + + printf("--- test mountfs ---\n"); + sum_mnt_id = 0; + ERR(gettimeofday(&p2_before, NULL), "gettimeofday"); + get_mntid_by_mountfs(); + ERR(gettimeofday(&p2_after, NULL), "gettimeofday"); + printf("sum(mnt_id) = %lu\n", sum_mnt_id); + + f_dur = duration(&f_before, &f_after); + f2_dur = duration(&f2_before, &f2_after); + p_dur = duration(&p_before, &p_after); + p2_dur = duration(&p2_before, &p2_after); + //printf("fsinfo duration %10luus for %d mounts\n", f_dur, nr_mounts); + //printf("procfd duration %10luus for %d mounts\n", p_dur, nr_mounts); + + printf("For %7d mounts, f=%10luus f2=%10luus p=%10luus p2=%10luus; p=%.1f*f p=%.1f*f2 p=%.1f*p2\n", + nr_mounts, f_dur, f2_dur, p_dur, p2_dur, + (double)p_dur / (double)f_dur, + (double)p_dur / (double)f2_dur, + (double)p_dur / (double)p2_dur); + return 0; +} ^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-31 17:31 ` David Howells 2020-03-31 19:42 ` Miklos Szeredi 2020-03-31 19:47 ` David Howells @ 2020-03-31 21:14 ` David Howells 2020-03-31 21:23 ` David Howells 3 siblings, 0 replies; 99+ messages in thread From: David Howells @ 2020-03-31 21:14 UTC (permalink / raw) To: Miklos Szeredi Cc: dhowells, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, Christian Brauner, Lennart Poettering, keyrings, linux-fsdevel, linux-kernel Miklos Szeredi <miklos@szeredi.hu> wrote: > So even the p2 method will give at least 80k queries/s, which is quite > good, considering that the need to rescan the complete mount tree > should be exceedingly rare (and in case it mattered, could be > optimized by priming from /proc/self/mountinfo). One thing to note is that the test is actually a little biased in favour of the "p" test, where the mnt_id is looked up by path from /proc/fdinfo. That's not all that useful, except as an index into mountfs. I'm not sure how much use it as a check on whether the mount is the same mount or not since mount IDs can get reused. If I instead use the parent_id all round as the desired target value, I then see: For 10000 mounts, f"899us f2\x18240us p\x101054us p2\x117273us <-- prev email For 10000 mounts, f$853us f2 453us p#5581us p2= 59798us <-- parent_id Some observations: (1) fsinfo() gets a bit slower, reflecting the extra locking that must be done to access the topology information (it's using a different attribute). (2) Going via /proc/fdinfo now includes further a access into mountfs - and this makes the access ~2.3x slower than it was before. (3) Going via mount ID directly into mountfs (the "p2" test) appears faster than it did (when it shouldn't have changed), though it's still slower than fsinfo. This I ascribe to the caching of the inode and dentry from the "p" test. The attached patch adjusts the test program. David --- commit e9844e27f3061e4ef2d1511786b5ea60338dc610 Author: David Howells <dhowells@redhat.com> Date: Tue Mar 31 21:14:58 2020 +0100 Get parent ID instead diff --git a/samples/vfs/test-fsinfo-perf.c b/samples/vfs/test-fsinfo-perf.c index fba40737f768..2bcde06ee78b 100644 --- a/samples/vfs/test-fsinfo-perf.c +++ b/samples/vfs/test-fsinfo-perf.c @@ -69,27 +69,27 @@ static void do_umount(void) perror("umount"); } -static unsigned long sum_mnt_id; +static unsigned long sum_check, sum_check_2; -static void get_mntid_by_fsinfo(int ix, const char *path) +static void get_id_by_fsinfo(int ix, const char *path) { - struct fsinfo_mount_info r; + struct fsinfo_mount_topology r; struct fsinfo_params params = { .flags = FSINFO_FLAGS_QUERY_PATH, - .request = FSINFO_ATTR_MOUNT_INFO, + .request = FSINFO_ATTR_MOUNT_TOPOLOGY, }; ERR(fsinfo(AT_FDCWD, path, ¶ms, sizeof(params), &r, sizeof(r)), "fsinfo"); - //printf("[%u] %u\n", ix, r.mnt_id); - sum_mnt_id += r.mnt_id; + sum_check += r.parent_id; + sum_check_2 += r.mnt_topology_changes; } -static void get_mntid_by_proc(int ix, const char *path) +static void get_id_by_proc(int ix, const char *path) { - unsigned int mnt_id; + unsigned int mnt_id, x; ssize_t len; - char procfile[100], buffer[4096], *p, *nl; + char procfile[100], buffer[4096], *p, *q, *nl; int fd, fd2; fd = open(path, O_PATH); @@ -130,12 +130,31 @@ static void get_mntid_by_proc(int ix, const char *path) exit(3); } - sum_mnt_id += mnt_id; - //printf("[%u] %u\n", ix, mnt_id); + /* Now look the ID up on mountfs */ + sprintf(procfile, "/mnt/%u/parent", mnt_id); + fd = open(procfile, O_RDONLY); + ERR(fd, procfile); + len = read(fd, buffer, sizeof(buffer) - 1); + ERR(len, "read/parent"); + close(fd); + if (len > 0 && buffer[len - 1] = '\n') + len--; + buffer[len] = 0; + + x = strtoul(buffer, &q, 10); + + if (*q) { + fprintf(stderr, "Bad format in %s '%s'\n", procfile, buffer); + exit(3); + } + + sum_check += x; + //printf("[%u] %u\n", ix, x); } -static void get_mntid_by_fsinfo_2(void) +static void get_id_by_fsinfo_2(void) { + struct fsinfo_mount_topology t; struct fsinfo_mount_child *children, *c, *end; struct fsinfo_mount_info r; struct fsinfo_params params = { @@ -171,15 +190,16 @@ static void get_mntid_by_fsinfo_2(void) for (i = 0; c < end; c++, i++) { //printf("[%u] %u\n", i, c->mnt_id); params.flags = FSINFO_FLAGS_QUERY_MOUNT; - params.request = FSINFO_ATTR_MOUNT_INFO; + params.request = FSINFO_ATTR_MOUNT_TOPOLOGY; sprintf(name, "%u", c->mnt_id); - ERR(fsinfo(AT_FDCWD, name, ¶ms, sizeof(params), &r, sizeof(r)), + ERR(fsinfo(AT_FDCWD, name, ¶ms, sizeof(params), &t, sizeof(t)), "fsinfo/child"); - sum_mnt_id += r.mnt_id; + sum_check += t.parent_id; + sum_check_2 += t.mnt_topology_changes; } } -static void get_mntid_by_mountfs(void) +static void get_id_by_mountfs(void) { unsigned int base_mnt_id, mnt_id, x; ssize_t len, s_children; @@ -260,11 +280,11 @@ static void get_mntid_by_mountfs(void) comma++; } - sprintf(procfile, "%u/id", mnt_id); + sprintf(procfile, "%u/parent", mnt_id); fd = openat(mntfd, procfile, O_RDONLY); ERR(fd, procfile); len = read(fd, buffer, sizeof(buffer) - 1); - ERR(len, "read/id"); + ERR(len, "read/parent"); close(fd); if (len > 0 && buffer[len - 1] = '\n') len--; @@ -278,7 +298,7 @@ static void get_mntid_by_mountfs(void) } if (0) printf("[%u] %u\n", i++, x); - sum_mnt_id += x; + sum_check += x; } while (p = comma, *comma); } @@ -318,32 +338,32 @@ int main(int argc, char **argv) iterate(make_mount); printf("--- test fsinfo by path ---\n"); - sum_mnt_id = 0; + sum_check = 0; ERR(gettimeofday(&f_before, NULL), "gettimeofday"); - iterate(get_mntid_by_fsinfo); + iterate(get_id_by_fsinfo); ERR(gettimeofday(&f_after, NULL), "gettimeofday"); - printf("sum(mnt_id) = %lu\n", sum_mnt_id); + printf("sum(mnt_id) = %lu\n", sum_check); printf("--- test fsinfo by mnt_id ---\n"); - sum_mnt_id = 0; + sum_check = 0; ERR(gettimeofday(&f2_before, NULL), "gettimeofday"); - get_mntid_by_fsinfo_2(); + get_id_by_fsinfo_2(); ERR(gettimeofday(&f2_after, NULL), "gettimeofday"); - printf("sum(mnt_id) = %lu\n", sum_mnt_id); + printf("sum(mnt_id) = %lu\n", sum_check); printf("--- test /proc/fdinfo ---\n"); - sum_mnt_id = 0; + sum_check = 0; ERR(gettimeofday(&p_before, NULL), "gettimeofday"); - iterate(get_mntid_by_proc); + iterate(get_id_by_proc); ERR(gettimeofday(&p_after, NULL), "gettimeofday"); - printf("sum(mnt_id) = %lu\n", sum_mnt_id); + printf("sum(mnt_id) = %lu\n", sum_check); printf("--- test mountfs ---\n"); - sum_mnt_id = 0; + sum_check = 0; ERR(gettimeofday(&p2_before, NULL), "gettimeofday"); - get_mntid_by_mountfs(); + get_id_by_mountfs(); ERR(gettimeofday(&p2_after, NULL), "gettimeofday"); - printf("sum(mnt_id) = %lu\n", sum_mnt_id); + printf("sum(mnt_id) = %lu\n", sum_check); f_dur = duration(&f_before, &f_after); f2_dur = duration(&f2_before, &f2_after); ^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-31 17:31 ` David Howells ` (2 preceding siblings ...) 2020-03-31 21:14 ` David Howells @ 2020-03-31 21:23 ` David Howells 3 siblings, 0 replies; 99+ messages in thread From: David Howells @ 2020-03-31 21:23 UTC (permalink / raw) To: Miklos Szeredi Cc: dhowells, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, Christian Brauner, Lennart Poettering, keyrings, linux-fsdevel, linux-kernel David Howells <dhowells@redhat.com> wrote: > > So even the p2 method will give at least 80k queries/s, which is quite > > good, considering that the need to rescan the complete mount tree > > should be exceedingly rare (and in case it mattered, could be > > optimized by priming from /proc/self/mountinfo). > > One thing to note is that the test is actually a little biased in favour of > the "p" test, where the mnt_id is looked up by path from /proc/fdinfo. That's > not all that useful, except as an index into mountfs. I'm not sure how much > use it as a check on whether the mount is the same mount or not since mount > IDs can get reused. However, to deal with an overrun, you're going to have to read multiple attributes. So I've added an attribute file to expose the topology change counter and it now reads that as well. For 10000 mounts, f=22899us f2=18240us p=101054us p2=117273us <-- prev email For 10000 mounts, f=24853us f2=20453us p=235581us p2= 59798us <-- parent_id For 10000 mounts, f=24621us f2=20528us p=320164us p2=111416us <-- counter Probably unsurprisingly, this doesn't affect fsinfo() significantly since I've tried to expose the change counters in relevant places. It does, however, significantly affect mountfs because you seem to want every value to be exposed through its own file. Now this can be worked around by having files that bundle up several values that are of interest to a particular operation (e.g. rescanning after a notification queue overrun). See the attached additional patch. Note that the sum_check_2 += r.mnt_topology_changes; bits in the fsinfo() tests accidentally got left in the preceding patch and so aren't in this one. David --- commit 6c62787aec41f67c1d5a55a0d59578854bcef6f8 Author: David Howells <dhowells@redhat.com> Date: Tue Mar 31 21:53:11 2020 +0100 Add a mountfs file to export the topology counter diff --git a/fs/mountfs/super.c b/fs/mountfs/super.c index 82c01eb6154d..58c05feb4fdd 100644 --- a/fs/mountfs/super.c +++ b/fs/mountfs/super.c @@ -22,7 +22,7 @@ struct mountfs_entry { static const char *mountfs_attrs[] = { "root", "mountpoint", "id", "parent", "options", "children", - "group", "master", "propagate_from" + "group", "master", "propagate_from", "counter" }; #define MOUNTFS_INO(id) (((unsigned long) id + 1) * \ @@ -128,6 +128,8 @@ static int mountfs_attr_show(struct seq_file *sf, void *v) if (tmp) seq_printf(sf, "%i\n", tmp); } + } else if (strcmp(name, "counter") == 0) { + seq_printf(sf, "%u\n", atomic_read(&mnt->mnt_topology_changes)); } else { WARN_ON(1); err = -EIO; diff --git a/samples/vfs/test-fsinfo-perf.c b/samples/vfs/test-fsinfo-perf.c index 2bcde06ee78b..2b7606a53c2d 100644 --- a/samples/vfs/test-fsinfo-perf.c +++ b/samples/vfs/test-fsinfo-perf.c @@ -149,6 +149,26 @@ static void get_id_by_proc(int ix, const char *path) } sum_check += x; + + /* And now the topology change counter */ + sprintf(procfile, "/mnt/%u/counter", mnt_id); + fd = open(procfile, O_RDONLY); + ERR(fd, procfile); + len = read(fd, buffer, sizeof(buffer) - 1); + ERR(len, "read/counter"); + close(fd); + if (len > 0 && buffer[len - 1] == '\n') + len--; + buffer[len] = 0; + + x = strtoul(buffer, &q, 10); + + if (*q) { + fprintf(stderr, "Bad format in %s '%s'\n", procfile, buffer); + exit(3); + } + + sum_check_2 += x; //printf("[%u] %u\n", ix, x); } @@ -204,7 +224,7 @@ static void get_id_by_mountfs(void) unsigned int base_mnt_id, mnt_id, x; ssize_t len, s_children; char procfile[100], buffer[100], *children, *p, *q, *nl, *comma; - int fd, fd2, mntfd, i; + int fd, fd2, mntfd; /* Start off by reading the mount ID from the base path */ fd = open(base_path, O_PATH); @@ -269,7 +289,6 @@ static void get_id_by_mountfs(void) p = children; if (!*p) return; - i = 0; do { mnt_id = strtoul(p, &comma, 10); if (*comma) { @@ -297,8 +316,26 @@ static void get_id_by_mountfs(void) exit(3); } - if (0) printf("[%u] %u\n", i++, x); sum_check += x; + + sprintf(procfile, "%u/counter", mnt_id); + fd = openat(mntfd, procfile, O_RDONLY); + ERR(fd, procfile); + len = read(fd, buffer, sizeof(buffer) - 1); + ERR(len, "read/counter"); + close(fd); + if (len > 0 && buffer[len - 1] == '\n') + len--; + buffer[len] = 0; + + x = strtoul(buffer, &q, 10); + + if (*q) { + fprintf(stderr, "Bad format in %s '%s'\n", procfile, buffer); + exit(3); + } + + sum_check_2 += x; } while (p = comma, *comma); } ^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-30 13:58 Upcoming: Notifications, FS notifications and fsinfo() David Howells ` (5 preceding siblings ...) 2020-03-31 17:31 ` David Howells @ 2020-03-31 21:52 ` David Howells 2020-04-01 9:04 ` Karel Zak 2020-04-01 14:41 ` Lennart Poettering 6 siblings, 2 replies; 99+ messages in thread From: David Howells @ 2020-03-31 21:52 UTC (permalink / raw) To: Christian Brauner Cc: dhowells, torvalds, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven, andres, keyrings, linux-fsdevel, linux-kernel, lennart, cyphar Christian Brauner <christian.brauner@ubuntu.com> wrote: > querying all properties of a mount atomically all-at-once, I don't actually offer that, per se. Having an atomic all-at-once query for a single mount is actually quite a burden on the system. There's potentially a lot of state involved, much of which you don't necessarily need. I've tried to avoid the need to do that by adding change counters that can be queried cheaply. You read the counters, then you check mounts and superblocks for which the counters have changed, and then you re-read the counters. I've added multiple counters, assigned to different purposes, to make it easier to pin down what has changed - and so reduce the amount of checking required. What I have added to fsinfo() is a way to atomically retrieve a list of all the children of a mount, including, for each mount, the mount ID (which may have been reused), a uniquifier (which shouldn't wrap over the kernel lifetime) and the sum of the mount object and superblock change counters. This should allow you to quickly rescan the mount tree as fsinfo() can look up mounts by mount ID instead of by path or fd. Below is a sample file from the kernel that scans by this method, displaying an ascii art tree of all the mounts under a path or mount. David --- // SPDX-License-Identifier: GPL-2.0-or-later /* Test the fsinfo() system call * * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved. * Written by David Howells (dhowells@redhat.com) */ #define _GNU_SOURCE #define _ATFILE_SOURCE #include <stdbool.h> #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <string.h> #include <unistd.h> #include <ctype.h> #include <errno.h> #include <time.h> #include <math.h> #include <sys/syscall.h> #include <linux/fsinfo.h> #include <linux/socket.h> #include <linux/fcntl.h> #include <sys/stat.h> #include <arpa/inet.h> #ifndef __NR_fsinfo #define __NR_fsinfo -1 #endif static __attribute__((unused)) ssize_t fsinfo(int dfd, const char *filename, struct fsinfo_params *params, size_t params_size, void *result_buffer, size_t result_buf_size) { return syscall(__NR_fsinfo, dfd, filename, params, params_size, result_buffer, result_buf_size); } static char tree_buf[4096]; static char bar_buf[4096]; static unsigned int children_list_interval; /* * Get an fsinfo attribute in a statically allocated buffer. */ static void get_attr(unsigned int mnt_id, unsigned int attr, unsigned int Nth, void *buf, size_t buf_size) { struct fsinfo_params params = { .flags = FSINFO_FLAGS_QUERY_MOUNT, .request = attr, .Nth = Nth, }; char file[32]; long ret; sprintf(file, "%u", mnt_id); memset(buf, 0xbd, buf_size); ret = fsinfo(AT_FDCWD, file, ¶ms, sizeof(params), buf, buf_size); if (ret == -1) { fprintf(stderr, "mount-%s: %m\n", file); exit(1); } } /* * Get an fsinfo attribute in a dynamically allocated buffer. */ static void *get_attr_alloc(unsigned int mnt_id, unsigned int attr, unsigned int Nth, size_t *_size) { struct fsinfo_params params = { .flags = FSINFO_FLAGS_QUERY_MOUNT, .request = attr, .Nth = Nth, }; size_t buf_size = 4096; char file[32]; void *r; long ret; sprintf(file, "%u", mnt_id); for (;;) { r = malloc(buf_size); if (!r) { perror("malloc"); exit(1); } memset(r, 0xbd, buf_size); ret = fsinfo(AT_FDCWD, file, ¶ms, sizeof(params), r, buf_size); if (ret == -1) { fprintf(stderr, "mount-%s: %x,%x,%x %m\n", file, params.request, params.Nth, params.Mth); exit(1); } if (ret <= buf_size) { *_size = ret; break; } buf_size = (ret + 4096 - 1) & ~(4096 - 1); } return r; } /* * Display a mount and then recurse through its children. */ static void display_mount(unsigned int mnt_id, unsigned int depth, char *path) { struct fsinfo_mount_topology top; struct fsinfo_mount_child child; struct fsinfo_mount_info info; struct fsinfo_ids ids; void *children; unsigned int d; size_t ch_size, p_size; char dev[64]; int i, n, s; get_attr(mnt_id, FSINFO_ATTR_MOUNT_TOPOLOGY, 0, &top, sizeof(top)); get_attr(mnt_id, FSINFO_ATTR_MOUNT_INFO, 0, &info, sizeof(info)); get_attr(mnt_id, FSINFO_ATTR_IDS, 0, &ids, sizeof(ids)); if (depth > 0) printf("%s", tree_buf); s = strlen(path); printf("%s", !s ? "\"\"" : path); if (!s) s += 2; s += depth; if (s < 38) s = 38 - s; else s = 1; printf("%*.*s", s, s, ""); sprintf(dev, "%x:%x", ids.f_dev_major, ids.f_dev_minor); printf("%10u %8x %2x %x %5s %s", info.mnt_id, (info.sb_changes + info.sb_notifications + info.mnt_attr_changes + info.mnt_topology_changes + info.mnt_subtree_notifications), info.attr, top.propagation, dev, ids.f_fs_name); putchar('\n'); children = get_attr_alloc(mnt_id, FSINFO_ATTR_MOUNT_CHILDREN, 0, &ch_size); n = ch_size / children_list_interval - 1; bar_buf[depth + 1] = '|'; if (depth > 0) { tree_buf[depth - 4 + 1] = bar_buf[depth - 4 + 1]; tree_buf[depth - 4 + 2] = ' '; } tree_buf[depth + 0] = ' '; tree_buf[depth + 1] = '\\'; tree_buf[depth + 2] = '_'; tree_buf[depth + 3] = ' '; tree_buf[depth + 4] = 0; d = depth + 4; memset(&child, 0, sizeof(child)); for (i = 0; i < n; i++) { void *p = children + i * children_list_interval; if (sizeof(child) >= children_list_interval) memcpy(&child, p, children_list_interval); else memcpy(&child, p, sizeof(child)); if (i == n - 1) bar_buf[depth + 1] = ' '; path = get_attr_alloc(child.mnt_id, FSINFO_ATTR_MOUNT_POINT, 0, &p_size); display_mount(child.mnt_id, d, path + 1); free(path); } free(children); if (depth > 0) { tree_buf[depth - 4 + 1] = '\\'; tree_buf[depth - 4 + 2] = '_'; } tree_buf[depth] = 0; } /* * Find the ID of whatever is at the nominated path. */ static unsigned int lookup_mnt_by_path(const char *path) { struct fsinfo_mount_info mnt; struct fsinfo_params params = { .flags = FSINFO_FLAGS_QUERY_PATH, .request = FSINFO_ATTR_MOUNT_INFO, }; if (fsinfo(AT_FDCWD, path, ¶ms, sizeof(params), &mnt, sizeof(mnt)) == -1) { perror(path); exit(1); } return mnt.mnt_id; } /* * Determine the element size for the mount child list. */ static unsigned int query_list_element_size(int mnt_id, unsigned int attr) { struct fsinfo_attribute_info attr_info; get_attr(mnt_id, FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, attr, &attr_info, sizeof(attr_info)); return attr_info.size; } /* * */ int main(int argc, char **argv) { unsigned int mnt_id; char *path; bool use_mnt_id = false; int opt; while ((opt = getopt(argc, argv, "m"))) { switch (opt) { case 'm': use_mnt_id = true; continue; } break; } argc -= optind; argv += optind; switch (argc) { case 0: mnt_id = lookup_mnt_by_path("/"); path = "ROOT"; break; case 1: path = argv[0]; if (use_mnt_id) { mnt_id = strtoul(argv[0], NULL, 0); break; } mnt_id = lookup_mnt_by_path(argv[0]); break; default: printf("Format: test-mntinfo\n"); printf("Format: test-mntinfo <path>\n"); printf("Format: test-mntinfo -m <mnt_id>\n"); exit(2); } children_list_interval = query_list_element_size(mnt_id, FSINFO_ATTR_MOUNT_CHILDREN); printf("MOUNT MOUNT ID CHANGE# AT P DEV TYPE\n"); printf("------------------------------------- ---------- -------- -- - ----- --------\n"); display_mount(mnt_id, 0, path); return 0; } ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-31 21:52 ` David Howells @ 2020-04-01 9:04 ` Karel Zak 2020-04-01 13:34 ` Miklos Szeredi ` (2 more replies) 2020-04-01 14:41 ` Lennart Poettering 1 sibling, 3 replies; 99+ messages in thread From: Karel Zak @ 2020-04-01 9:04 UTC (permalink / raw) To: David Howells Cc: Christian Brauner, torvalds, viro, dray, mszeredi, swhiteho, jlayton, raven, andres, keyrings, linux-fsdevel, linux-kernel, lennart, cyphar On Tue, Mar 31, 2020 at 10:52:52PM +0100, David Howells wrote: > Christian Brauner <christian.brauner@ubuntu.com> wrote: > > > querying all properties of a mount atomically all-at-once, > > I don't actually offer that, per se. > > Having an atomic all-at-once query for a single mount is actually quite a > burden on the system. There's potentially a lot of state involved, much of > which you don't necessarily need. If all means "all possible attributes" than it is unnecessary, for example ext4 timestamps or volume uuid/label are rarely necessary. We usually need together (as consistent set): source mountpoint FS type FS root (FSINFO_ATTR_MOUNT_PATH) FS options (FSINFO_ATTR_CONFIGURATION) VFS attributes VFS propagation flags mount ID parent ID devno (or maj:min) Karel -- Karel Zak <kzak@redhat.com> http://karelzak.blogspot.com ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-01 9:04 ` Karel Zak @ 2020-04-01 13:34 ` Miklos Szeredi 2020-04-01 13:55 ` David Howells 2020-04-01 13:58 ` David Howells 2 siblings, 0 replies; 99+ messages in thread From: Miklos Szeredi @ 2020-04-01 13:34 UTC (permalink / raw) To: Karel Zak Cc: David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Lennart Poettering, Aleksa Sarai [-- Attachment #1: Type: text/plain, Size: 1346 bytes --] On Wed, Apr 1, 2020 at 11:05 AM Karel Zak <kzak@redhat.com> wrote: > > On Tue, Mar 31, 2020 at 10:52:52PM +0100, David Howells wrote: > > Christian Brauner <christian.brauner@ubuntu.com> wrote: > > > > > querying all properties of a mount atomically all-at-once, > > > > I don't actually offer that, per se. > > > > Having an atomic all-at-once query for a single mount is actually quite a > > burden on the system. There's potentially a lot of state involved, much of > > which you don't necessarily need. > > If all means "all possible attributes" than it is unnecessary, for > example ext4 timestamps or volume uuid/label are rarely necessary. > We usually need together (as consistent set): > > source > mountpoint > FS type > FS root (FSINFO_ATTR_MOUNT_PATH) > FS options (FSINFO_ATTR_CONFIGURATION) > VFS attributes > VFS propagation flags > mount ID > parent ID > devno (or maj:min) This is trivial with mountfs (reuse format of /proc/PID/mountinfo): # cat /mnt/30/info 30 20 0:14 / /mnt rw,relatime - mountfs none rw Attached patch applies against readfile patch. We might want something more generic, and it won't get any simpler: mount.h | 1 + mountfs/super.c | 12 +++++++++++- proc_namespace.c | 2 +- 3 files changed, 13 insertions(+), 2 deletions(-) Thanks, Miklos [-- Attachment #2: mountfs-info.patch --] [-- Type: text/x-patch, Size: 1769 bytes --] --- fs/mount.h | 1 + fs/mountfs/super.c | 12 +++++++++++- fs/proc_namespace.c | 2 +- 3 files changed, 13 insertions(+), 2 deletions(-) --- a/fs/mount.h +++ b/fs/mount.h @@ -186,3 +186,4 @@ void mountfs_create(struct mount *mnt); extern void mountfs_remove(struct mount *mnt); int mountfs_lookup_internal(struct vfsmount *m, struct path *path); +int show_mountinfo(struct seq_file *m, struct vfsmount *mnt); --- a/fs/mountfs/super.c +++ b/fs/mountfs/super.c @@ -22,7 +22,7 @@ struct mountfs_entry { static const char *mountfs_attrs[] = { "root", "mountpoint", "id", "parent", "options", "children", - "group", "master", "propagate_from", "counter" + "group", "master", "propagate_from", "counter", "info" }; #define MOUNTFS_INO(id) (((unsigned long) id + 1) * \ @@ -126,11 +126,21 @@ static int mountfs_attr_show(struct seq_ if (IS_MNT_SLAVE(mnt)) { get_fs_root(current->fs, &root); tmp = get_dominating_id(mnt, &root); + path_put(&root); if (tmp) seq_printf(sf, "%i\n", tmp); } } else if (strcmp(name, "counter") == 0) { seq_printf(sf, "%u\n", atomic_read(&mnt->mnt_topology_changes)); + } else if (strcmp(name, "info") == 0) { + struct proc_mounts p = {}; + + WARN_ON(sf->private); + sf->private = &p; + get_fs_root(current->fs, &p.root); + err = show_mountinfo(sf, m); + path_put(&p.root); + sf->private = NULL; } else { WARN_ON(1); err = -EIO; --- a/fs/proc_namespace.c +++ b/fs/proc_namespace.c @@ -110,7 +110,7 @@ static int show_vfsmnt(struct seq_file * return err; } -static int show_mountinfo(struct seq_file *m, struct vfsmount *mnt) +int show_mountinfo(struct seq_file *m, struct vfsmount *mnt) { struct proc_mounts *p = m->private; struct mount *r = real_mount(mnt); ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-01 9:04 ` Karel Zak 2020-04-01 13:34 ` Miklos Szeredi @ 2020-04-01 13:55 ` David Howells 2020-04-01 13:58 ` David Howells 2 siblings, 0 replies; 99+ messages in thread From: David Howells @ 2020-04-01 13:55 UTC (permalink / raw) To: Miklos Szeredi Cc: dhowells, Karel Zak, Christian Brauner, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Lennart Poettering, Aleksa Sarai Miklos Szeredi <miklos@szeredi.hu> wrote: > Attached patch applies against readfile patch. But doesn't actually do what Karel asked for. show_mountinfo() itself does not give you what Karel asked for. Plus there's more information you need to add to it. David ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-01 9:04 ` Karel Zak 2020-04-01 13:34 ` Miklos Szeredi 2020-04-01 13:55 ` David Howells @ 2020-04-01 13:58 ` David Howells 2020-04-01 15:25 ` Miklos Szeredi 2020-04-01 16:01 ` David Howells 2 siblings, 2 replies; 99+ messages in thread From: David Howells @ 2020-04-01 13:58 UTC (permalink / raw) Cc: dhowells, Miklos Szeredi, Karel Zak, Christian Brauner, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Lennart Poettering, Aleksa Sarai David Howells <dhowells@redhat.com> wrote: > > Attached patch applies against readfile patch. > > But doesn't actually do what Karel asked for. show_mountinfo() itself does > not give you what Karel asked for. Plus there's more information you need to > add to it. And arguably, it's worse than just reading /proc/mounts. If you get a notification that something changed (ie. you poll /proc/mounts or mount notifications gives you an overrun) you now have to read *every* /mountfs/*/info file. That is way more expensive. David ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-01 13:58 ` David Howells @ 2020-04-01 15:25 ` Miklos Szeredi 2020-04-03 9:11 ` Karel Zak 2020-04-01 16:01 ` David Howells 1 sibling, 1 reply; 99+ messages in thread From: Miklos Szeredi @ 2020-04-01 15:25 UTC (permalink / raw) To: David Howells Cc: Karel Zak, Christian Brauner, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Lennart Poettering, Aleksa Sarai [-- Attachment #1: Type: text/plain, Size: 1653 bytes --] On Wed, Apr 1, 2020 at 3:58 PM David Howells <dhowells@redhat.com> wrote: > > David Howells <dhowells@redhat.com> wrote: > > > > Attached patch applies against readfile patch. > > > > But doesn't actually do what Karel asked for. show_mountinfo() itself does > > not give you what Karel asked for. Not sure what you mean. I think it shows precisely the information Karel asked for. > Plus there's more information you need to > > add to it. The mountinfo format is extensible (see Documentation/filesystems/proc.txt) so for example adding the change counters would be simple. > And arguably, it's worse than just reading /proc/mounts. If you get a > notification that something changed (ie. you poll /proc/mounts or mount > notifications gives you an overrun) you now have to read *every* > /mountfs/*/info file. That is way more expensive. fsinfo(2) will never be substantially cheaper than reading and parsing /mnt/MNT_ID/info. In fact reading a large part of the mount table using fsinfo(2) will be substantially slower than parsing /proc/self/mountinfo (this doesn't actually do the parsing but that would add a very small amount of overhead): root@kvm:~# ./test-fsinfo-perf /tmp/a 30000 --- make mounts --- --- test fsinfo by path --- sum(mnt_id) = 960000 --- test fsinfo by mnt_id --- sum(mnt_id) = 960000 --- test /proc/fdinfo --- sum(mnt_id) = 960000 --- test mountfs --- sum(mnt_id) = 960000 --- test mountinfo --- sum(mnt_id) = 960000 For 30000 mounts, f= 154963us f2= 148337us p= 1803699us p2= 257019us; m= 53996us; p=11.6*f p=12.2*f2 p=7.0*p2 p=33.4*m --- umount --- Yes, that's 33 times faster! Thanks, Miklos [-- Attachment #2: test-fsinfo-perf-mountinfo.patch --] [-- Type: text/x-patch, Size: 3709 bytes --] --- samples/vfs/test-fsinfo-perf.c | 91 +++++++++++++++++++++++++++++++++++++++-- 1 file changed, 87 insertions(+), 4 deletions(-) --- a/samples/vfs/test-fsinfo-perf.c +++ b/samples/vfs/test-fsinfo-perf.c @@ -339,6 +339,79 @@ static void get_id_by_mountfs(void) } while (p = comma, *comma); } +static void get_id_by_mountinfo(void) +{ + unsigned int base_mnt_id, mnt_id, x; + ssize_t len; + char procfile[100], buffer[4096], *p, *nl; + int fd, fd2, mntfd; + + /* Start off by reading the mount ID from the base path */ + fd = open(base_path, O_PATH); + ERR(fd, "open/path"); + sprintf(procfile, "/proc/self/fdinfo/%u", fd); + fd2 = open(procfile, O_RDONLY); + ERR(fd2, "open/proc"); + len = read(fd2, buffer, sizeof(buffer) - 1); + ERR(len, "read"); + buffer[len] = 0; + close(fd2); + close(fd); + + p = buffer; + do { + nl = strchr(p, '\n'); + if (nl) + *nl++ = '\0'; + else + nl = NULL; + + if (strncmp(p, "mnt_id:", 7) != 0) + continue; + p += 7; + while (isblank(*p)) + p++; + /* Have to allow for extra numbers being added to the line */ + if (sscanf(p, "%u", &base_mnt_id) != 1) { + fprintf(stderr, "Bad format %s\n", procfile); + exit(3); + } + break; + + } while ((p = nl)); + + if (!p) { + fprintf(stderr, "Missing field %s\n", procfile); + exit(3); + } + + if (0) printf("[B] %u\n", base_mnt_id); + + mntfd = open("/proc/self/mountinfo", O_RDONLY); + ERR(mntfd, "/proc/self/mountinfo"); + + while ((len = read(mntfd, buffer, sizeof(buffer)))) { + ERR(len, "read/mountinfo"); + + for (p = buffer; p < buffer + len; p = nl + 1) { + nl = strchr(p, '\n'); + if (!nl) { + fprintf(stderr, "error parsing mountinfo\n"); + exit(3); + } + *nl = '\0'; + if (sscanf(p, "%i %i", &mnt_id, &x) != 2) { + fprintf(stderr, "error parsing mountinfo\n"); + exit(3); + } + if (x == base_mnt_id) + sum_check += x; + } + } + + close(mntfd); +} + static unsigned long duration(struct timeval *before, struct timeval *after) { unsigned long a, b; @@ -354,8 +427,9 @@ int main(int argc, char **argv) struct timeval f2_before, f2_after; struct timeval p_before, p_after; struct timeval p2_before, p2_after; + struct timeval m_before, m_after; const char *path; - unsigned long f_dur, f2_dur, p_dur, p2_dur; + unsigned long f_dur, f2_dur, p_dur, p2_dur, m_dur; if (argc < 2) { fprintf(stderr, "Format: %s <path> [nr_mounts]\n", argv[0]); @@ -402,17 +476,26 @@ int main(int argc, char **argv) ERR(gettimeofday(&p2_after, NULL), "gettimeofday"); printf("sum(mnt_id) = %lu\n", sum_check); + printf("--- test mountinfo ---\n"); + sum_check = 0; + ERR(gettimeofday(&m_before, NULL), "gettimeofday"); + get_id_by_mountinfo(); + ERR(gettimeofday(&m_after, NULL), "gettimeofday"); + printf("sum(mnt_id) = %lu\n", sum_check); + f_dur = duration(&f_before, &f_after); f2_dur = duration(&f2_before, &f2_after); p_dur = duration(&p_before, &p_after); p2_dur = duration(&p2_before, &p2_after); + m_dur = duration(&m_before, &m_after); //printf("fsinfo duration %10luus for %d mounts\n", f_dur, nr_mounts); //printf("procfd duration %10luus for %d mounts\n", p_dur, nr_mounts); - printf("For %7d mounts, f=%10luus f2=%10luus p=%10luus p2=%10luus; p=%.1f*f p=%.1f*f2 p=%.1f*p2\n", - nr_mounts, f_dur, f2_dur, p_dur, p2_dur, + printf("For %7d mounts, f=%10luus f2=%10luus p=%10luus p2=%10luus; m=%10luus; p=%.1f*f p=%.1f*f2 p=%.1f*p2 p=%.1f*m\n", + nr_mounts, f_dur, f2_dur, p_dur, p2_dur, m_dur, (double)p_dur / (double)f_dur, (double)p_dur / (double)f2_dur, - (double)p_dur / (double)p2_dur); + (double)p_dur / (double)p2_dur, + (double)p_dur / (double)m_dur); return 0; } ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-01 15:25 ` Miklos Szeredi @ 2020-04-03 9:11 ` Karel Zak 0 siblings, 0 replies; 99+ messages in thread From: Karel Zak @ 2020-04-03 9:11 UTC (permalink / raw) To: Miklos Szeredi Cc: David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Lennart Poettering, Aleksa Sarai On Wed, Apr 01, 2020 at 05:25:54PM +0200, Miklos Szeredi wrote: > fsinfo(2) will never be substantially cheaper than reading and parsing > /mnt/MNT_ID/info. In fact reading a large part of the mount table > using fsinfo(2) will be substantially slower than parsing > /proc/self/mountinfo (this doesn't actually do the parsing but that > would add a very small amount of overhead): I think nobody wants to use fsinfo() or mountfs as replacement to whole /proc/self/mountinfo. It does not make sense. We need per-mountpoint API, for whole mount table use-cases (like findmnt or lsblk) the current mountinfo is good enough. Karel -- Karel Zak <kzak@redhat.com> http://karelzak.blogspot.com ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-01 13:58 ` David Howells 2020-04-01 15:25 ` Miklos Szeredi @ 2020-04-01 16:01 ` David Howells 2020-04-01 16:30 ` Miklos Szeredi ` (2 more replies) 1 sibling, 3 replies; 99+ messages in thread From: David Howells @ 2020-04-01 16:01 UTC (permalink / raw) To: Miklos Szeredi Cc: dhowells, Karel Zak, Christian Brauner, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Lennart Poettering, Aleksa Sarai Miklos Szeredi <miklos@szeredi.hu> wrote: > > > But doesn't actually do what Karel asked for. show_mountinfo() itself does > > > not give you what Karel asked for. > > Not sure what you mean. I think it shows precisely the information > Karel asked for. It's not atomic. David ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-01 16:01 ` David Howells @ 2020-04-01 16:30 ` Miklos Szeredi 2020-04-02 15:22 ` David Howells 2020-04-02 15:24 ` David Howells 2 siblings, 0 replies; 99+ messages in thread From: Miklos Szeredi @ 2020-04-01 16:30 UTC (permalink / raw) To: David Howells Cc: Karel Zak, Christian Brauner, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Lennart Poettering, Aleksa Sarai On Wed, Apr 1, 2020 at 6:02 PM David Howells <dhowells@redhat.com> wrote: > > Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > But doesn't actually do what Karel asked for. show_mountinfo() itself does > > > > not give you what Karel asked for. > > > > Not sure what you mean. I think it shows precisely the information > > Karel asked for. > > It's not atomic. Yes it is. Thanks, Miklos ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-01 16:01 ` David Howells 2020-04-01 16:30 ` Miklos Szeredi @ 2020-04-02 15:22 ` David Howells 2020-04-02 15:24 ` Miklos Szeredi 2020-04-02 15:42 ` David Howells 2020-04-02 15:24 ` David Howells 2 siblings, 2 replies; 99+ messages in thread From: David Howells @ 2020-04-02 15:22 UTC (permalink / raw) To: Miklos Szeredi Cc: dhowells, Karel Zak, Christian Brauner, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Lennart Poettering, Aleksa Sarai Miklos Szeredi <miklos@szeredi.hu> wrote: > > > Not sure what you mean. I think it shows precisely the information > > > Karel asked for. > > > > It's not atomic. > > Yes it is. No, it really isn't - though it could be made so. ext4_show_mount(), for example, doesn't lock against "mount -o remount", so the configuration can be changing whilst it's being rendered to text. David ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-02 15:22 ` David Howells @ 2020-04-02 15:24 ` Miklos Szeredi 2020-04-02 15:42 ` David Howells 1 sibling, 0 replies; 99+ messages in thread From: Miklos Szeredi @ 2020-04-02 15:24 UTC (permalink / raw) To: David Howells Cc: Karel Zak, Christian Brauner, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Lennart Poettering, Aleksa Sarai On Thu, Apr 2, 2020 at 5:23 PM David Howells <dhowells@redhat.com> wrote: > > Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > Not sure what you mean. I think it shows precisely the information > > > > Karel asked for. > > > > > > It's not atomic. > > > > Yes it is. > > No, it really isn't - though it could be made so. > > ext4_show_mount(), for example, doesn't lock against "mount -o remount", so > the configuration can be changing whilst it's being rendered to text. Does s_umount nest inside namespace_sem? I really don't see the relation of those locks. Thanks, Miklos ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-02 15:22 ` David Howells 2020-04-02 15:24 ` Miklos Szeredi @ 2020-04-02 15:42 ` David Howells 1 sibling, 0 replies; 99+ messages in thread From: David Howells @ 2020-04-02 15:42 UTC (permalink / raw) To: Miklos Szeredi Cc: dhowells, Karel Zak, Christian Brauner, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Lennart Poettering, Aleksa Sarai Miklos Szeredi <miklos@szeredi.hu> wrote: > > ext4_show_mount(), for example, doesn't lock against "mount -o remount", so > > the configuration can be changing whilst it's being rendered to text. > > Does s_umount nest inside namespace_sem? I really don't see the > relation of those locks. If I understand aright what Al has told me, it's a bad idea to do any blocking operation inside of namespace_sem apart from kmalloc(GFP_KERNEL). David ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-01 16:01 ` David Howells 2020-04-01 16:30 ` Miklos Szeredi 2020-04-02 15:22 ` David Howells @ 2020-04-02 15:24 ` David Howells 2 siblings, 0 replies; 99+ messages in thread From: David Howells @ 2020-04-02 15:24 UTC (permalink / raw) Cc: dhowells, Miklos Szeredi, Karel Zak, Christian Brauner, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Lennart Poettering, Aleksa Sarai David Howells <dhowells@redhat.com> wrote: > ext4_show_mount() ext4_show_options(), sorry. David ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-03-31 21:52 ` David Howells 2020-04-01 9:04 ` Karel Zak @ 2020-04-01 14:41 ` Lennart Poettering 2020-04-01 15:33 ` Miklos Szeredi 2020-04-01 16:06 ` David Howells 1 sibling, 2 replies; 99+ messages in thread From: Lennart Poettering @ 2020-04-01 14:41 UTC (permalink / raw) To: David Howells Cc: Christian Brauner, torvalds, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven, andres, keyrings, linux-fsdevel, linux-kernel, cyphar On Di, 31.03.20 22:52, David Howells (dhowells@redhat.com) wrote: > Christian Brauner <christian.brauner@ubuntu.com> wrote: > > > querying all properties of a mount atomically all-at-once, > > I don't actually offer that, per se. > > Having an atomic all-at-once query for a single mount is actually quite a > burden on the system. There's potentially a lot of state involved, much of > which you don't necessarily need. Hmm, do it like with statx() and specify a mask for the fields userspace wants? Then it would be as lightweight as it possibly could be? Lennart -- Lennart Poettering, Berlin ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-01 14:41 ` Lennart Poettering @ 2020-04-01 15:33 ` Miklos Szeredi 2020-04-01 16:06 ` David Howells 1 sibling, 0 replies; 99+ messages in thread From: Miklos Szeredi @ 2020-04-01 15:33 UTC (permalink / raw) To: Lennart Poettering Cc: David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Wed, Apr 1, 2020 at 4:41 PM Lennart Poettering <mzxreary@0pointer.de> wrote: > > On Di, 31.03.20 22:52, David Howells (dhowells@redhat.com) wrote: > > > Christian Brauner <christian.brauner@ubuntu.com> wrote: > > > > > querying all properties of a mount atomically all-at-once, > > > > I don't actually offer that, per se. > > > > Having an atomic all-at-once query for a single mount is actually quite a > > burden on the system. There's potentially a lot of state involved, much of > > which you don't necessarily need. > > Hmm, do it like with statx() and specify a mask for the fields userspace > wants? Then it would be as lightweight as it possibly could be? Yes, however binary structures mixed with variable length fields are not going to be pretty. Again, if we want something even halfway sane for a syscall interface, go with a string key/value vector. If that's really needed. I've still not heard a convincing argument in favor of a syscall. Thanks, Miklos ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-01 14:41 ` Lennart Poettering 2020-04-01 15:33 ` Miklos Szeredi @ 2020-04-01 16:06 ` David Howells 2020-04-01 16:40 ` Miklos Szeredi 1 sibling, 1 reply; 99+ messages in thread From: David Howells @ 2020-04-01 16:06 UTC (permalink / raw) To: Miklos Szeredi Cc: dhowells, Lennart Poettering, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai Miklos Szeredi <miklos@szeredi.hu> wrote: > I've still not heard a convincing argument in favor of a syscall. From your own results, scanning 10000 mounts through mountfs and reading just two values from each is an order of magnitude slower without the effect of the dentry/inode caches. It gets faster on the second run because the mountfs dentries and inodes are cached - but at a cost of >205MiB of RAM. And it's *still* slower than fsinfo(). David ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-01 16:06 ` David Howells @ 2020-04-01 16:40 ` Miklos Szeredi 2020-04-02 2:52 ` Ian Kent 0 siblings, 1 reply; 99+ messages in thread From: Miklos Szeredi @ 2020-04-01 16:40 UTC (permalink / raw) To: David Howells Cc: Lennart Poettering, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Wed, Apr 1, 2020 at 6:07 PM David Howells <dhowells@redhat.com> wrote: > > Miklos Szeredi <miklos@szeredi.hu> wrote: > > > I've still not heard a convincing argument in favor of a syscall. > > From your own results, scanning 10000 mounts through mountfs and reading just > two values from each is an order of magnitude slower without the effect of the > dentry/inode caches. It gets faster on the second run because the mountfs > dentries and inodes are cached - but at a cost of >205MiB of RAM. And it's > *still* slower than fsinfo(). Already told you that we can just delete the dentry on dput_final, so the memory argument is immaterial. And the speed argument also, because there's no use case where that would make a difference. You keep bringing up the notification queue overrun when watching a subtree, but that's going to be painful with fsinfo(2) as well. If that's a relevant use case (not saying it's true), might as well add a /mnt/MNT_ID/subtree_info (trivial again) that contains all information for the subtree. Have fun implementing that with fsinfo(2). Thanks, Miklos ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-01 16:40 ` Miklos Szeredi @ 2020-04-02 2:52 ` Ian Kent 2020-04-02 13:52 ` Miklos Szeredi 0 siblings, 1 reply; 99+ messages in thread From: Ian Kent @ 2020-04-02 2:52 UTC (permalink / raw) To: Miklos Szeredi, David Howells Cc: Lennart Poettering, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Wed, 2020-04-01 at 18:40 +0200, Miklos Szeredi wrote: > On Wed, Apr 1, 2020 at 6:07 PM David Howells <dhowells@redhat.com> > wrote: > > Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > I've still not heard a convincing argument in favor of a syscall. > > > > From your own results, scanning 10000 mounts through mountfs and > > reading just > > two values from each is an order of magnitude slower without the > > effect of the > > dentry/inode caches. It gets faster on the second run because the > > mountfs > > dentries and inodes are cached - but at a cost of >205MiB of > > RAM. And it's > > *still* slower than fsinfo(). > > Already told you that we can just delete the dentry on dput_final, so > the memory argument is immaterial. > > And the speed argument also, because there's no use case where that > would make a difference. You keep bringing up the notification queue > overrun when watching a subtree, but that's going to be painful with > fsinfo(2) as well. If that's a relevant use case (not saying it's > true), might as well add a /mnt/MNT_ID/subtree_info (trivial again) > that contains all information for the subtree. Have fun implementing > that with fsinfo(2). Forgive me for not trawling through your patch to work this out but how does a poll on a path get what's needed to get mount info. Or, more specifically, how does one get what's needed to go directly to the place to get mount info. when something in the tree under the polled path changes (mount/umount). IIUC poll alone won't do subtree change monitoring? Don't get me wrong, neither the proc nor the fsinfo implementations deal with the notification storms that cause much of the problem we see now. IMHO that's a separate and very difficult problem in itself that can't even be considered until getting the information efficiently is resolved. Ian ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-02 2:52 ` Ian Kent @ 2020-04-02 13:52 ` Miklos Szeredi 2020-04-02 14:36 ` Lennart Poettering 2020-04-03 1:44 ` Ian Kent 0 siblings, 2 replies; 99+ messages in thread From: Miklos Szeredi @ 2020-04-02 13:52 UTC (permalink / raw) To: Ian Kent Cc: David Howells, Lennart Poettering, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Thu, Apr 2, 2020 at 4:52 AM Ian Kent <raven@themaw.net> wrote: > > On Wed, 2020-04-01 at 18:40 +0200, Miklos Szeredi wrote: > > On Wed, Apr 1, 2020 at 6:07 PM David Howells <dhowells@redhat.com> > > wrote: > > > Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > > > I've still not heard a convincing argument in favor of a syscall. > > > > > > From your own results, scanning 10000 mounts through mountfs and > > > reading just > > > two values from each is an order of magnitude slower without the > > > effect of the > > > dentry/inode caches. It gets faster on the second run because the > > > mountfs > > > dentries and inodes are cached - but at a cost of >205MiB of > > > RAM. And it's > > > *still* slower than fsinfo(). > > > > Already told you that we can just delete the dentry on dput_final, so > > the memory argument is immaterial. > > > > And the speed argument also, because there's no use case where that > > would make a difference. You keep bringing up the notification queue > > overrun when watching a subtree, but that's going to be painful with > > fsinfo(2) as well. If that's a relevant use case (not saying it's > > true), might as well add a /mnt/MNT_ID/subtree_info (trivial again) > > that contains all information for the subtree. Have fun implementing > > that with fsinfo(2). > > Forgive me for not trawling through your patch to work this out > but how does a poll on a path get what's needed to get mount info. > > Or, more specifically, how does one get what's needed to go directly > to the place to get mount info. when something in the tree under the > polled path changes (mount/umount). IIUC poll alone won't do subtree > change monitoring? The mechanisms are basically the same as with fsinfo(2). You can get to the mountfs entry through the mount ID or through a proc/fd/ type symlink. So if you have a path, there are two options: - find out the mount ID belonging to that path and go to /mountfs/$mntid/ - open the path with fd = open(path, O_PATH) and the go to /proc/self/fdmount/$fd/ Currently the only way to find the mount id from a path is by parsing /proc/self/fdinfo/$fd. It is trivial, however, to extend statx(2) to return it directly from a path. Also the mount notification queue that David implemented contains the mount ID of the changed mount. > Don't get me wrong, neither the proc nor the fsinfo implementations > deal with the notification storms that cause much of the problem we > see now. > > IMHO that's a separate and very difficult problem in itself that > can't even be considered until getting the information efficiently > is resolved. This mount notification storm issue got me thinking. If I understand correctly, systemd wants mount notifications so that it can do the desktop pop-up thing. Is that correct? But that doesn't apply to automounts at all. A new mount performed by automount is uninteresting to to desktops, since it's triggered by crossing the automount point (i.e. a normal path lookup), not an external event like inserting a usb stick, etc... Am I missing something? Maybe the solution is to just allow filtering out such notifications at the source, so automount triggers don't generate events for systemd. Thanks, Miklos ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-02 13:52 ` Miklos Szeredi @ 2020-04-02 14:36 ` Lennart Poettering 2020-04-02 15:22 ` Miklos Szeredi 2020-04-03 1:44 ` Ian Kent 1 sibling, 1 reply; 99+ messages in thread From: Lennart Poettering @ 2020-04-02 14:36 UTC (permalink / raw) To: Miklos Szeredi Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Do, 02.04.20 15:52, Miklos Szeredi (miklos@szeredi.hu) wrote: > > Don't get me wrong, neither the proc nor the fsinfo implementations > > deal with the notification storms that cause much of the problem we > > see now. > > > > IMHO that's a separate and very difficult problem in itself that > > can't even be considered until getting the information efficiently > > is resolved. > > This mount notification storm issue got me thinking. If I understand > correctly, systemd wants mount notifications so that it can do the > desktop pop-up thing. Is that correct? This has little to do with the desktop. Startup scheduling is mostly about figuring out when we can do the next step of startup, and to a big amount this means issuing a mount command of some form, then waiting until it is established, then invoking the next and so on, and when the right mounts are established start the right services that require them and so on. And with today's system complexity with storage daemons and so on this all becomes a complex network of concurrent dependencies. Most mounts are established on behalf of pid 1 itself, for those we could just wait until the mount syscall/command completes (and we do). But there's plenty cases where that's not the case, hence we need to make sure we follow system mount table state as a whole, regardless if its systemd itself that triggers some mount or something else (for example some shell script, udisks, …). > But that doesn't apply to automounts at all. A new mount performed by > automount is uninteresting to to desktops, since it's triggered by > crossing the automount point (i.e. a normal path lookup), not an > external event like inserting a usb stick, etc... systemd does not propagate mount events to desktops. You appear to be thinking about the "udisks" project or so? Lennart -- Lennart Poettering, Berlin ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-02 14:36 ` Lennart Poettering @ 2020-04-02 15:22 ` Miklos Szeredi 2020-04-02 15:28 ` Lennart Poettering ` (2 more replies) 0 siblings, 3 replies; 99+ messages in thread From: Miklos Szeredi @ 2020-04-02 15:22 UTC (permalink / raw) To: Lennart Poettering Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Thu, Apr 2, 2020 at 4:36 PM Lennart Poettering <mzxreary@0pointer.de> wrote: > You appear to be thinking about the "udisks" project or so? Probably. The real question is: is there a sane way to filter mount notifications so that systemd receives only those which it is interested in, rather than the tens of thousands that for example autofs is managing and has nothing to do with systemd? Is there a specific mountpoint or mountpoints that systemd is waiting for? How exactly does this work? Thanks, Miklos ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-02 15:22 ` Miklos Szeredi @ 2020-04-02 15:28 ` Lennart Poettering 2020-04-02 15:35 ` Miklos Szeredi 2020-04-02 15:51 ` David Howells 2020-04-02 15:56 ` David Howells 2 siblings, 1 reply; 99+ messages in thread From: Lennart Poettering @ 2020-04-02 15:28 UTC (permalink / raw) To: Miklos Szeredi Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Do, 02.04.20 17:22, Miklos Szeredi (miklos@szeredi.hu) wrote: > On Thu, Apr 2, 2020 at 4:36 PM Lennart Poettering <mzxreary@0pointer.de> wrote: > > > You appear to be thinking about the "udisks" project or so? > > Probably. > > The real question is: is there a sane way to filter mount > notifications so that systemd receives only those which it is > interested in, rather than the tens of thousands that for example > autofs is managing and has nothing to do with systemd? systemd cares about all mount points in PID1's mount namespace. The fact that mount tables can grow large is why we want something better than constantly reparsing the whole /proc/self/mountinfo. But filtering subsets of that is something we don't really care about. Lennart -- Lennart Poettering, Berlin ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-02 15:28 ` Lennart Poettering @ 2020-04-02 15:35 ` Miklos Szeredi 2020-04-02 15:50 ` Lennart Poettering 0 siblings, 1 reply; 99+ messages in thread From: Miklos Szeredi @ 2020-04-02 15:35 UTC (permalink / raw) To: Lennart Poettering Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Thu, Apr 2, 2020 at 5:28 PM Lennart Poettering <mzxreary@0pointer.de> wrote: > > On Do, 02.04.20 17:22, Miklos Szeredi (miklos@szeredi.hu) wrote: > > > On Thu, Apr 2, 2020 at 4:36 PM Lennart Poettering <mzxreary@0pointer.de> wrote: > > > > > You appear to be thinking about the "udisks" project or so? > > > > Probably. > > > > The real question is: is there a sane way to filter mount > > notifications so that systemd receives only those which it is > > interested in, rather than the tens of thousands that for example > > autofs is managing and has nothing to do with systemd? > > systemd cares about all mount points in PID1's mount namespace. > > The fact that mount tables can grow large is why we want something > better than constantly reparsing the whole /proc/self/mountinfo. But > filtering subsets of that is something we don't really care about. I can accept that, but you haven't given a reason why that's so. What does it do with the fact that an automount point was crossed, for example? How does that affect the operation of systemd? Thanks, Miklos ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-02 15:35 ` Miklos Szeredi @ 2020-04-02 15:50 ` Lennart Poettering 2020-04-02 17:20 ` Miklos Szeredi 0 siblings, 1 reply; 99+ messages in thread From: Lennart Poettering @ 2020-04-02 15:50 UTC (permalink / raw) To: Miklos Szeredi Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Do, 02.04.20 17:35, Miklos Szeredi (miklos@szeredi.hu) wrote: > > systemd cares about all mount points in PID1's mount namespace. > > > > The fact that mount tables can grow large is why we want something > > better than constantly reparsing the whole /proc/self/mountinfo. But > > filtering subsets of that is something we don't really care about. > > I can accept that, but you haven't given a reason why that's so. > > What does it do with the fact that an automount point was crossed, for > example? How does that affect the operation of systemd? We don't care how a mount point came to be. If it's autofs or something else, we don't care. We don't access these mount points ourselves ever, we just watch their existance. I mean, it's not just about startup it's also about shutdown. At shutdown we need to unmount everything from the leaves towards the root so that all file systems are in a clean state. And that means *all* mounts, even autofs ones, even udisks ones, or whatever else established them, we don't care. I mean, the autofs daemon can die any time, we still must be able to sensibly shutdown, and thus unmount all mounts inside some autofs hierarchy at the right time, before unmounting the autofs top-level dir and then what might be further up the tree. systemd needs to know the whole tree, to figure out deps properly for things like that, hence we aren't interested in filtering, we are interested in minimizing what we do when something changes. Lennart -- Lennart Poettering, Berlin ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-02 15:50 ` Lennart Poettering @ 2020-04-02 17:20 ` Miklos Szeredi 2020-04-03 11:08 ` Lennart Poettering 0 siblings, 1 reply; 99+ messages in thread From: Miklos Szeredi @ 2020-04-02 17:20 UTC (permalink / raw) To: Lennart Poettering Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Thu, Apr 2, 2020 at 5:50 PM Lennart Poettering <mzxreary@0pointer.de> wrote: > > On Do, 02.04.20 17:35, Miklos Szeredi (miklos@szeredi.hu) wrote: > > > > systemd cares about all mount points in PID1's mount namespace. > > > > > > The fact that mount tables can grow large is why we want something > > > better than constantly reparsing the whole /proc/self/mountinfo. But > > > filtering subsets of that is something we don't really care about. > > > > I can accept that, but you haven't given a reason why that's so. > > > > What does it do with the fact that an automount point was crossed, for > > example? How does that affect the operation of systemd? > > We don't care how a mount point came to be. If it's autofs or > something else, we don't care. We don't access these mount points > ourselves ever, we just watch their existance. > > I mean, it's not just about startup it's also about shutdown. At > shutdown we need to unmount everything from the leaves towards the > root so that all file systems are in a clean state. Unfortunately that's not guaranteed by umounting all filesystems from the init namespace. A filesystem is shut down when all references to it are gone. Perhaps you instead want to lazy unmount root (yeah, that may not actually be allowed, but anyway, lazy unmounting the top level ones should do) and watch for super block shutdown events instead. Does that make any sense? Thanks, Miklos ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-02 17:20 ` Miklos Szeredi @ 2020-04-03 11:08 ` Lennart Poettering 2020-04-03 11:48 ` Miklos Szeredi 0 siblings, 1 reply; 99+ messages in thread From: Lennart Poettering @ 2020-04-03 11:08 UTC (permalink / raw) To: Miklos Szeredi Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Do, 02.04.20 19:20, Miklos Szeredi (miklos@szeredi.hu) wrote: > On Thu, Apr 2, 2020 at 5:50 PM Lennart Poettering <mzxreary@0pointer.de> wrote: > > > > On Do, 02.04.20 17:35, Miklos Szeredi (miklos@szeredi.hu) wrote: > > > > > > systemd cares about all mount points in PID1's mount namespace. > > > > > > > > The fact that mount tables can grow large is why we want something > > > > better than constantly reparsing the whole /proc/self/mountinfo. But > > > > filtering subsets of that is something we don't really care about. > > > > > > I can accept that, but you haven't given a reason why that's so. > > > > > > What does it do with the fact that an automount point was crossed, for > > > example? How does that affect the operation of systemd? > > > > We don't care how a mount point came to be. If it's autofs or > > something else, we don't care. We don't access these mount points > > ourselves ever, we just watch their existance. > > > > I mean, it's not just about startup it's also about shutdown. At > > shutdown we need to unmount everything from the leaves towards the > > root so that all file systems are in a clean state. > > Unfortunately that's not guaranteed by umounting all filesystems from > the init namespace. A filesystem is shut down when all references to > it are gone. Perhaps you instead want to lazy unmount root (yeah, > that may not actually be allowed, but anyway, lazy unmounting the top > level ones should do) and watch for super block shutdown events > instead. > > Does that make any sense? When all mounts in the init mount namespace are unmounted and all remaining processes killed we switch root back to the initrd, so that even the root fs can be unmounted, and then we disassemble any backing complex storage if there is, i.e. lvm, luks, raid, … Because the initrd is its own little root fs independent of the actual root we can fully disassemble everything this way, as we do not retain any references to it anymore in any way. Lennart -- Lennart Poettering, Berlin ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-03 11:08 ` Lennart Poettering @ 2020-04-03 11:48 ` Miklos Szeredi 2020-04-03 15:01 ` Lennart Poettering 0 siblings, 1 reply; 99+ messages in thread From: Miklos Szeredi @ 2020-04-03 11:48 UTC (permalink / raw) To: Lennart Poettering Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Fri, Apr 3, 2020 at 1:08 PM Lennart Poettering <mzxreary@0pointer.de> wrote: > > On Do, 02.04.20 19:20, Miklos Szeredi (miklos@szeredi.hu) wrote: > > > On Thu, Apr 2, 2020 at 5:50 PM Lennart Poettering <mzxreary@0pointer.de> wrote: > > > > > > On Do, 02.04.20 17:35, Miklos Szeredi (miklos@szeredi.hu) wrote: > > > > > > > > systemd cares about all mount points in PID1's mount namespace. > > > > > > > > > > The fact that mount tables can grow large is why we want something > > > > > better than constantly reparsing the whole /proc/self/mountinfo. But > > > > > filtering subsets of that is something we don't really care about. > > > > > > > > I can accept that, but you haven't given a reason why that's so. > > > > > > > > What does it do with the fact that an automount point was crossed, for > > > > example? How does that affect the operation of systemd? > > > > > > We don't care how a mount point came to be. If it's autofs or > > > something else, we don't care. We don't access these mount points > > > ourselves ever, we just watch their existance. > > > > > > I mean, it's not just about startup it's also about shutdown. At > > > shutdown we need to unmount everything from the leaves towards the > > > root so that all file systems are in a clean state. > > > > Unfortunately that's not guaranteed by umounting all filesystems from > > the init namespace. A filesystem is shut down when all references to > > it are gone. Perhaps you instead want to lazy unmount root (yeah, > > that may not actually be allowed, but anyway, lazy unmounting the top > > level ones should do) and watch for super block shutdown events > > instead. > > > > Does that make any sense? > > When all mounts in the init mount namespace are unmounted and all > remaining processes killed we switch root back to the initrd, so that > even the root fs can be unmounted, and then we disassemble any backing > complex storage if there is, i.e. lvm, luks, raid, … I think it could be done the other way round, much simpler: - switch back to initrd - umount root, keeping the tree intact (UMOUNT_DETACHED) - kill all remaining processes, wait for all to exit I think that should guarantee that all super blocks have been shut down. Al? The advantage would be that there's no need to walk the mount tree unmounting individual leafs, since it's all done automagically. Thanks, Miklos ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-03 11:48 ` Miklos Szeredi @ 2020-04-03 15:01 ` Lennart Poettering 2020-04-06 9:22 ` Miklos Szeredi 0 siblings, 1 reply; 99+ messages in thread From: Lennart Poettering @ 2020-04-03 15:01 UTC (permalink / raw) To: Miklos Szeredi Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Fr, 03.04.20 13:48, Miklos Szeredi (miklos@szeredi.hu) wrote: > > > Does that make any sense? > > > > When all mounts in the init mount namespace are unmounted and all > > remaining processes killed we switch root back to the initrd, so that > > even the root fs can be unmounted, and then we disassemble any backing > > complex storage if there is, i.e. lvm, luks, raid, … > > I think it could be done the other way round, much simpler: > > - switch back to initrd > - umount root, keeping the tree intact (UMOUNT_DETACHED) > - kill all remaining processes, wait for all to exit Nah. What I wrote above is drastically simplified. It's IRL more complex. Specific services need to be killed between certain mounts are unmounted, since they are a backend for another mount. NFS, or FUSE or stuff like that usually has some processes backing them around, and we need to stop the mounts they provide before these services, and then the mounts these services reside on after that, and so on. It's a complex dependency tree of stuff that needs to be done in order, so that we can deal with arbitrarily nested mounts, storage subsystems, and backing services. Anyway, this all works fine in systemd, the dependency logic is there. We want a more efficient way to watch mounts, that's all. Subscribing and constantly reparsing /proc/self/mountinfo is awful, that's all. Lennart -- Lennart Poettering, Berlin ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-03 15:01 ` Lennart Poettering @ 2020-04-06 9:22 ` Miklos Szeredi 2020-04-06 17:29 ` Lennart Poettering 0 siblings, 1 reply; 99+ messages in thread From: Miklos Szeredi @ 2020-04-06 9:22 UTC (permalink / raw) To: Lennart Poettering Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Fri, Apr 3, 2020 at 5:01 PM Lennart Poettering <mzxreary@0pointer.de> wrote: > > On Fr, 03.04.20 13:48, Miklos Szeredi (miklos@szeredi.hu) wrote: > > > > > Does that make any sense? > > > > > > When all mounts in the init mount namespace are unmounted and all > > > remaining processes killed we switch root back to the initrd, so that > > > even the root fs can be unmounted, and then we disassemble any backing > > > complex storage if there is, i.e. lvm, luks, raid, … > > > > I think it could be done the other way round, much simpler: > > > > - switch back to initrd > > - umount root, keeping the tree intact (UMOUNT_DETACHED) > > - kill all remaining processes, wait for all to exit > > Nah. What I wrote above is drastically simplified. It's IRL more > complex. Specific services need to be killed between certain mounts > are unmounted, since they are a backend for another mount. NFS, or > FUSE or stuff like that usually has some processes backing them > around, and we need to stop the mounts they provide before these > services, and then the mounts these services reside on after that, and > so on. It's a complex dependency tree of stuff that needs to be done > in order, so that we can deal with arbitrarily nested mounts, storage > subsystems, and backing services. That still doesn't explain why you need to keep track of all mounts in the system. If you are aware of the dependency, then you need to keep track of that particular mount. If not, then why? What I'm starting to see is that there's a fundamental conflict between how systemd people want to deal with new mounts and how some other people want to use mounts (i.e. tens of thousands of mounts in an automount map). I'm really curious how much the mount notification ring + per mount query (any implementation) can help that use case. > Anyway, this all works fine in systemd, the dependency logic is > there. We want a more efficient way to watch mounts, that's > all. Subscribing and constantly reparsing /proc/self/mountinfo is > awful, that's all. I'm not sure that is all. To handle storms of tens of thousands of mounts, my guess is that the fundamental way of dealing with these changes will need to be updated in systemd. Thanks, Miklos ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-06 9:22 ` Miklos Szeredi @ 2020-04-06 17:29 ` Lennart Poettering 2020-04-07 2:21 ` Ian Kent 0 siblings, 1 reply; 99+ messages in thread From: Lennart Poettering @ 2020-04-06 17:29 UTC (permalink / raw) To: Miklos Szeredi Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Mo, 06.04.20 11:22, Miklos Szeredi (miklos@szeredi.hu) wrote: > > Nah. What I wrote above is drastically simplified. It's IRL more > > complex. Specific services need to be killed between certain mounts > > are unmounted, since they are a backend for another mount. NFS, or > > FUSE or stuff like that usually has some processes backing them > > around, and we need to stop the mounts they provide before these > > services, and then the mounts these services reside on after that, and > > so on. It's a complex dependency tree of stuff that needs to be done > > in order, so that we can deal with arbitrarily nested mounts, storage > > subsystems, and backing services. > > That still doesn't explain why you need to keep track of all mounts in > the system. > > If you are aware of the dependency, then you need to keep track of > that particular mount. If not, then why? it works the other way round in systemd: something happens, i.e. a device pops up or a mount is established and systemd figures our if there's something to do. i.e. whether services shall be pulled in or so. It's that way for a reason: there are plenty services that want to instantiated once for each object of a certain kind to pop up (this happens very often for devices, but could also happen for any other kind of "unit" systemd manages, and one of those kinds are mount units). For those we don't know the unit to pull in yet (because it's not going to be a well-named singleton, but an instance incorporating some identifier from the source unit) when the unit that pops up does so, thus we can only wait for the the latter to determine what to pull in. > What I'm starting to see is that there's a fundamental conflict > between how systemd people want to deal with new mounts and how some > other people want to use mounts (i.e. tens of thousands of mounts in > an automount map). Well, I am not sure what automount has to do with anything. You can have 10K mounts with or without automount, it's orthogonal to that. In fact, I assumed the point of automount was to pretend there are 10K mounts but not actually have them most of the time, no? I mean, whether there's room to optimize D-Bus IPC or not is entirely orthogonal to anything discussed here regarding fsinfo(). Don't make this about systemd sending messages over D-Bus, that's a very different story, and a non-issue if you ask me: Right now, when you have n mounts, and any mount changes, or one is added or removed then we have to parse the whole mount table again, asynchronously, processing all n entries again, every frickin time. This means the work to process n mounts popping up at boot is O(nē). That sucks, it should be obvious to anyone. Now if we get that fixed, by some mount API that can send us minimal notifications about what happened and where, then this becomes O(n), which is totally OK. You keep talking about filtering, which will just lower the "n" a bit in particular cases to some value "m" maybe (with m < n), it does not address the fact that O(mē) is still a big problem. hence, filtering is great, no problem, add it if you want it. I personally don't care about filtering though, and I doubt we'd use it in systemd, I just care about the O(nē) issue. If you ask me if D-Bus can handle 10K messages sent over the bus during boot, then yes, it totally can handle that. Can systemd nicely process O(nē) mounts internally though equally well? No, obviously not, if n grows too large. Anyone computer scientist should understand that.. Anyway, I have the suspicion this discussion has stopped being useful. I think you are trying to fix problems that userspce actually doesn't have. I can just tell you what we understand the problems are, but if you are out trying to fix other percieved ones, then great, but I mostly lost interest. Lennart -- Lennart Poettering, Berlin ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-06 17:29 ` Lennart Poettering @ 2020-04-07 2:21 ` Ian Kent 2020-04-07 13:59 ` Miklos Szeredi 0 siblings, 1 reply; 99+ messages in thread From: Ian Kent @ 2020-04-07 2:21 UTC (permalink / raw) To: Lennart Poettering, Miklos Szeredi Cc: David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Mon, 2020-04-06 at 19:29 +0200, Lennart Poettering wrote: > On Mo, 06.04.20 11:22, Miklos Szeredi (miklos@szeredi.hu) wrote: > > > > Nah. What I wrote above is drastically simplified. It's IRL more > > > complex. Specific services need to be killed between certain > > > mounts > > > are unmounted, since they are a backend for another mount. NFS, > > > or > > > FUSE or stuff like that usually has some processes backing them > > > around, and we need to stop the mounts they provide before these > > > services, and then the mounts these services reside on after > > > that, and > > > so on. It's a complex dependency tree of stuff that needs to be > > > done > > > in order, so that we can deal with arbitrarily nested mounts, > > > storage > > > subsystems, and backing services. > > > > That still doesn't explain why you need to keep track of all mounts > > in > > the system. > > > > If you are aware of the dependency, then you need to keep track of > > that particular mount. If not, then why? > > it works the other way round in systemd: something happens, i.e. a > device pops up or a mount is established and systemd figures our if > there's something to do. i.e. whether services shall be pulled in or > so. > > It's that way for a reason: there are plenty services that want to > instantiated once for each object of a certain kind to pop up (this > happens very often for devices, but could also happen for any other > kind of "unit" systemd manages, and one of those kinds are mount > units). For those we don't know the unit to pull in yet (because it's > not going to be a well-named singleton, but an instance incorporating > some identifier from the source unit) when the unit that pops up does > so, thus we can only wait for the the latter to determine what to > pull > in. > > > What I'm starting to see is that there's a fundamental conflict > > between how systemd people want to deal with new mounts and how > > some > > other people want to use mounts (i.e. tens of thousands of mounts > > in > > an automount map). > > Well, I am not sure what automount has to do with anything. You can > have 10K mounts with or without automount, it's orthogonal to that. > In > fact, I assumed the point of automount was to pretend there are 10K > mounts but not actually have them most of the time, no? Yes, but automount, when using a large direct mount map will, be the source of lots of mounts which of an autofs file system. > > I mean, whether there's room to optimize D-Bus IPC or not is entirely > orthogonal to anything discussed here regarding fsinfo(). Don't make > this about systemd sending messages over D-Bus, that's a very > different story, and a non-issue if you ask me: Quite probably, yes, that's something you can care about if it really is an issue but isn't something I care about myself either. > > Right now, when you have n mounts, and any mount changes, or one is > added or removed then we have to parse the whole mount table again, > asynchronously, processing all n entries again, every frickin > time. This means the work to process n mounts popping up at boot is > O(n²). That sucks, it should be obvious to anyone. Now if we get that > fixed, by some mount API that can send us minimal notifications about > what happened and where, then this becomes O(n), which is totally OK. But this is clearly a problem and is what I do care about and the infrastructure being proposed here can be used to achieve this. Unfortunately, and I was mistaken about what systemd does, I don't see a simple way of improving this. This is because it appears that systemd, having had to scan the entire mount table every time has, necessarily, lead to code that can't easily accommodate the ability to directly get the info immediately for a single mount. So to improve this I think quite a few changes will be needed in systemd and libmount. I'm not quite sure how to get that started. After all it needs to be done how Karel would like to see it done in libmount and how systemd folks would like to see it done in systemd which is very probably not how I would approach it myself. > > You keep talking about filtering, which will just lower the "n" a bit > in particular cases to some value "m" maybe (with m < n), it does not > address the fact that O(m²) is still a big problem. > > hence, filtering is great, no problem, add it if you want it. I > personally don't care about filtering though, and I doubt we'd use it > in systemd, I just care about the O(n²) issue. > > If you ask me if D-Bus can handle 10K messages sent over the bus > during boot, then yes, it totally can handle that. Can systemd nicely > process O(n²) mounts internally though equally well? No, obviously > not, > if n grows too large. Anyone computer scientist should understand > that.. > > Anyway, I have the suspicion this discussion has stopped being > useful. I think you are trying to fix problems that userspce actually > doesn't have. I can just tell you what we understand the problems > are, > but if you are out trying to fix other percieved ones, then great, > but > I mostly lost interest. Yes, filtering sounds like we've wandered off topic, ;) Ian ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-07 2:21 ` Ian Kent @ 2020-04-07 13:59 ` Miklos Szeredi 2020-04-07 15:53 ` Lennart Poettering 0 siblings, 1 reply; 99+ messages in thread From: Miklos Szeredi @ 2020-04-07 13:59 UTC (permalink / raw) To: Ian Kent Cc: Lennart Poettering, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai [-- Attachment #1: Type: text/plain, Size: 2797 bytes --] On Tue, Apr 7, 2020 at 4:22 AM Ian Kent <raven@themaw.net> wrote: > > Right now, when you have n mounts, and any mount changes, or one is > > added or removed then we have to parse the whole mount table again, > > asynchronously, processing all n entries again, every frickin > > time. This means the work to process n mounts popping up at boot is > > O(n²). That sucks, it should be obvious to anyone. Now if we get that > > fixed, by some mount API that can send us minimal notifications about > > what happened and where, then this becomes O(n), which is totally OK. Something's not right with the above statement. Hint: if there are lots of events in quick succession, you can batch them quite easily to prevent overloading the system. Wrote a pair of utilities to check out the capabilities of the current API. The first one just creates N mounts, optionally sleeping between each. The second one watches /proc/self/mountinfo and generates individual (add/del/change) events based on POLLPRI and comparing contents with previous instance. First use case: create 10,000 mounts, then start the watcher and create 1000 mounts with a 50ms sleep between them. Total time (user + system) consumed by the watcher: 25s. This is indeed pretty dismal, and a per-mount query will help tremendously. But it's still "just" 25ms per mount, so if the mounts are far apart (which is what this test is about), this won't thrash the system. Note, how this is self regulating: if the load is high, it will automatically batch more requests, preventing overload. It is also prone to lose pairs of add + remove in these case (and so is the ring buffer based one from David). Second use case: start the watcher and create 50,000 mounts with no sleep between them. Total time consumed by the watcher: 0.154s or 3.08us/event. Note, the same test case adds about 5ms for the 50,000 umount events, which is 0.1us/event. Real life will probably be between these extremes, but it's clear that there's room for improvement in userspace as well as kernel interfaces. The current kernel interface is very efficient in retrieving a lot of state in one go. It is not efficient in handling small differences. > > Anyway, I have the suspicion this discussion has stopped being > > useful. I think you are trying to fix problems that userspce actually > > doesn't have. I can just tell you what we understand the problems > > are, > > but if you are out trying to fix other percieved ones, then great, > > but > > I mostly lost interest. I was, and still am, trying to see the big picture. Whatever. I think it's your turn to show some numbers about how the new API improves performance of systemd with a large number of mounts. Thanks, Miklos [-- Attachment #2: many-mounts.c --] [-- Type: text/x-csrc, Size: 1155 bytes --] #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <err.h> #include <sys/stat.h> #include <sys/mount.h> int main(int argc, char *argv[]) { char *base_path = argv[1]; char name[4096]; int nr_mounts, i, sleep_ms = 0; if (argc < 3 || argc > 4) errx(1, "usage: %s base_path nr_mounts [sleep_ms]", argv[0]); nr_mounts = atoi(argv[2]); if (argc > 3) sleep_ms = atoi(argv[3]); fprintf(stderr, "Mounting...\n"); if (mount("none", base_path, "tmpfs", 0, NULL) == -1) err(1, "mount/tmpfs"); if (mount("none", base_path, NULL, MS_PRIVATE, NULL) == -1) err(1, "mount/MS_PRIVATE"); for (i = 0; i < nr_mounts; i++) { sprintf(name, "%s/%d", base_path, i); if (mkdir(name, 0755) == -1) err(1, "mkdir"); if (mount("none", name, "tmpfs", 0, NULL) == -1) err(1, "mount/tmpfs"); if (mount("none", name, NULL, MS_PRIVATE, NULL) == -1) err(1, "mount/MS_PRIVATE"); if (sleep_ms) usleep(sleep_ms * 1000); } fprintf(stderr, "Press ENTER\n"); getchar(); fprintf(stderr, "Unmounting...\n"); if (umount2(base_path, MNT_DETACH) == -1) err(1, "umount"); fprintf(stderr, "Done\n"); return 0; } [-- Attachment #3: watch_mounts.c --] [-- Type: text/x-csrc, Size: 3380 bytes --] #include <stdio.h> #include <fcntl.h> #include <string.h> #include <stdlib.h> #include <unistd.h> #include <poll.h> #include <err.h> struct index { struct index *next; struct index *prev; const char *line; }; struct state { size_t bufsize; char *buf; size_t index_size; struct index *index; struct index head; }; static void read_mountinfo(struct pollfd *pfd, char *buf, size_t bufsize) { int readcnt, backoff = 0, retry = 0; size_t len; ssize_t res; retry: if (lseek(pfd->fd, 0, SEEK_SET) == (off_t) -1) err(1, "lseek"); len = 0; readcnt = 0; do { if (len >= bufsize - 4096) errx(1, "buffer overrun"); res = read(pfd->fd, buf + len, bufsize - len); if (res == -1) err(1, "read"); len += res; if (!res || !(++readcnt % 16)) { if (poll(pfd, 1, 0) == -1) err(1, "poll/0"); if (pfd->revents & POLLPRI) { if (!backoff) { backoff++; goto retry; } if (!retry) { fprintf(stderr, "retry."); retry = 1; } do { usleep(backoff * 1000); if (backoff < 128) backoff *= 2; if (poll(pfd, 1, 0) == -1) err(1, "poll/0"); } while (pfd->revents & POLLPRI); goto retry; } } } while (res); buf[len] = '\0'; if (retry) { fprintf(stderr, "..\n"); retry = 0; } } static void add_index(struct state *s, struct index *this, const char *line) { struct index *prev = s->head.prev, *next = &s->head; if (this->line) errx(1, "index corruption"); this->line = line; this->next = next; this->prev = prev; prev->next = next->prev = this; } static void del_index(struct index *this) { struct index *prev = this->prev, *next = this->next; this->line = NULL; prev->next = next; next->prev = prev; } static void diff_mountinfo(struct state *old, struct state *cur) { char *line, *end; struct index *this; int mntid; cur->head.next = cur->head.prev = &cur->head; for (line = cur->buf; line[0]; line = end + 1) { end = strchr(line, '\n'); if (!end) errx(1, "parsing (1)"); *end = '\0'; if (sscanf(line, "%i", &mntid) != 1) errx(1, "parsing (2)"); if (mntid < 0 || (size_t) mntid >= cur->index_size) errx(1, "index overflow"); add_index(cur, &cur->index[mntid], line); this = &old->index[mntid]; if (this->line) { if (strcmp(this->line, line)) printf("* %s\n", line); del_index(this); } else { printf("+ %s\n", line); } } while (old->head.next != &old->head) { this = old->head.next; printf("- %s\n", this->line); del_index(this); } fflush(stdout); } int main(void) { struct state state[2], *old = &state[0], *cur = &state[1], *tmp; struct pollfd pfd = { .events = POLLPRI }; old->index_size = cur->index_size = 131072; old->bufsize = cur->bufsize = cur->index_size * 128; old->index = calloc(old->index_size, sizeof(struct index)); cur->index = calloc(cur->index_size, sizeof(struct index)); old->buf = malloc(old->bufsize); cur->buf = malloc(cur->bufsize); if (!old->index || !cur->index || !old->buf || !cur->buf) err(1, "allocating buffers"); old->buf[0] = '\0'; old->head.prev = old->head.next = &old->head; pfd.fd = open("/proc/self/mountinfo", O_RDONLY); if (pfd.fd == -1) err(1, "open"); while (1) { read_mountinfo(&pfd, cur->buf, cur->bufsize); diff_mountinfo(old, cur); tmp = old; old = cur; cur = tmp; if (poll(&pfd, 1, -1) == -1) err(1, "poll/inf"); } } ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-07 13:59 ` Miklos Szeredi @ 2020-04-07 15:53 ` Lennart Poettering 2020-04-07 16:06 ` Miklos Szeredi 0 siblings, 1 reply; 99+ messages in thread From: Lennart Poettering @ 2020-04-07 15:53 UTC (permalink / raw) To: Miklos Szeredi Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Di, 07.04.20 15:59, Miklos Szeredi (miklos@szeredi.hu) wrote: > On Tue, Apr 7, 2020 at 4:22 AM Ian Kent <raven@themaw.net> wrote: > > > Right now, when you have n mounts, and any mount changes, or one is > > > added or removed then we have to parse the whole mount table again, > > > asynchronously, processing all n entries again, every frickin > > > time. This means the work to process n mounts popping up at boot is > > > O(n²). That sucks, it should be obvious to anyone. Now if we get that > > > fixed, by some mount API that can send us minimal notifications about > > > what happened and where, then this becomes O(n), which is totally OK. > > Something's not right with the above statement. Hint: if there are > lots of events in quick succession, you can batch them quite easily to > prevent overloading the system. > > Wrote a pair of utilities to check out the capabilities of the current > API. The first one just creates N mounts, optionally sleeping > between each. The second one watches /proc/self/mountinfo and > generates individual (add/del/change) events based on POLLPRI and > comparing contents with previous instance. > > First use case: create 10,000 mounts, then start the watcher and > create 1000 mounts with a 50ms sleep between them. Total time (user + > system) consumed by the watcher: 25s. This is indeed pretty dismal, > and a per-mount query will help tremendously. But it's still "just" > 25ms per mount, so if the mounts are far apart (which is what this > test is about), this won't thrash the system. Note, how this is self > regulating: if the load is high, it will automatically batch more > requests, preventing overload. It is also prone to lose pairs of add > + remove in these case (and so is the ring buffer based one from > David). We will batch requests too in systemd, of course, necessarily, given that the /p/s/mi inotify stuff is async. Thing though is that this means we buy lower CPU usage — working around the O(n²) issue — by introducing artifical higher latencies. We usually want to boot quickly, and not artificially slow. Sure one can come up with some super smart scheme how to tweak the artifical latencies, how to grow them, how to shrink them, depending on a perceived flood of events, some backing off scheme. But that's just polishing a turd, if all we want is proper queued change notification without the O(n²) behaviour. I mean, the fix for an O(n²) algorithm is to make it O(n) or so. By coalescing wake-up events you just lower the n again, probably linearly, but that still means we pay O(n²), which sucks. Lennart -- Lennart Poettering, Berlin ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-07 15:53 ` Lennart Poettering @ 2020-04-07 16:06 ` Miklos Szeredi 0 siblings, 0 replies; 99+ messages in thread From: Miklos Szeredi @ 2020-04-07 16:06 UTC (permalink / raw) To: Lennart Poettering Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Tue, Apr 7, 2020 at 5:53 PM Lennart Poettering <mzxreary@0pointer.de> wrote: > > On Di, 07.04.20 15:59, Miklos Szeredi (miklos@szeredi.hu) wrote: > > > On Tue, Apr 7, 2020 at 4:22 AM Ian Kent <raven@themaw.net> wrote: > > > > Right now, when you have n mounts, and any mount changes, or one is > > > > added or removed then we have to parse the whole mount table again, > > > > asynchronously, processing all n entries again, every frickin > > > > time. This means the work to process n mounts popping up at boot is > > > > O(n²). That sucks, it should be obvious to anyone. Now if we get that > > > > fixed, by some mount API that can send us minimal notifications about > > > > what happened and where, then this becomes O(n), which is totally OK. > > > > Something's not right with the above statement. Hint: if there are > > lots of events in quick succession, you can batch them quite easily to > > prevent overloading the system. > > > > Wrote a pair of utilities to check out the capabilities of the current > > API. The first one just creates N mounts, optionally sleeping > > between each. The second one watches /proc/self/mountinfo and > > generates individual (add/del/change) events based on POLLPRI and > > comparing contents with previous instance. > > > > First use case: create 10,000 mounts, then start the watcher and > > create 1000 mounts with a 50ms sleep between them. Total time (user + > > system) consumed by the watcher: 25s. This is indeed pretty dismal, > > and a per-mount query will help tremendously. But it's still "just" > > 25ms per mount, so if the mounts are far apart (which is what this > > test is about), this won't thrash the system. Note, how this is self > > regulating: if the load is high, it will automatically batch more > > requests, preventing overload. It is also prone to lose pairs of add > > + remove in these case (and so is the ring buffer based one from > > David). > > We will batch requests too in systemd, of course, necessarily, given > that the /p/s/mi inotify stuff is async. Thing though is that this > means we buy lower CPU usage — working around the O(n²) issue — by > introducing artifical higher latencies. We usually want to boot > quickly, and not artificially slow. > > Sure one can come up with some super smart scheme how to tweak the > artifical latencies, how to grow them, how to shrink them, depending > on a perceived flood of events, some backing off scheme. But that's > just polishing a turd, if all we want is proper queued change > notification without the O(n²) behaviour. > > I mean, the fix for an O(n²) algorithm is to make it O(n) or so. By > coalescing wake-up events you just lower the n again, probably > linearly, but that still means we pay O(n²), which sucks. Obviously. But you keep ignoring event queue overflows; it's basically guaranteed to happen with a sizable mount storm and then you are back to O(n^2). Give it some testing please, as Linus is not going to take any solution without an actual use case and testing. When you come back and say that fsinfo(2) works fine with systemd and a 100k mount/umount storm, then we can have a look at the performance budget and revisit the fine points of API design. Thanks, Miklos ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-02 15:22 ` Miklos Szeredi 2020-04-02 15:28 ` Lennart Poettering @ 2020-04-02 15:51 ` David Howells 2020-04-02 15:56 ` David Howells 2 siblings, 0 replies; 99+ messages in thread From: David Howells @ 2020-04-02 15:51 UTC (permalink / raw) To: Lennart Poettering Cc: dhowells, Miklos Szeredi, Ian Kent, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai Lennart Poettering <mzxreary@0pointer.de> wrote: > systemd cares about all mount points in PID1's mount namespace. > > The fact that mount tables can grow large is why we want something > better than constantly reparsing the whole /proc/self/mountinfo. But > filtering subsets of that is something we don't really care about. With the notifications stuff I've done, you can do, for example: pipe2(pipefd, O_NOTIFICATION_PIPE); ioctl(pipefd[0], IOC_WATCH_QUEUE_SET_SIZE, 256); watch_mount(AT_FDCWD, "/", 0, pipefd[0], 0x02); And that will catch all mount object changes in the subtree rooted at the given path, in this case "/". If you want to limit it to just the notifications on that mount, you would need to install a filter: struct watch_notification_filter filter = { .nr_filters = 1, .filters = { [0] = { .type = WATCH_TYPE_MOUNT_NOTIFY, .subtype_filter[0]= UINT_MAX, .info_mask = NOTIFY_MOUNT_IS_RECURSIVE, .info_filter = 0, }, }, }; ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter); Note that this doesn't monitor for superblock changes and events. They must be watched individually with something like: watch_sb(AT_FDCWD, "/afs", AT_NO_AUTOMOUNT, pipefd[0], 0x27); David ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-02 15:22 ` Miklos Szeredi 2020-04-02 15:28 ` Lennart Poettering 2020-04-02 15:51 ` David Howells @ 2020-04-02 15:56 ` David Howells 2 siblings, 0 replies; 99+ messages in thread From: David Howells @ 2020-04-02 15:56 UTC (permalink / raw) To: Lennart Poettering Cc: dhowells, Miklos Szeredi, Ian Kent, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai David Howells <dhowells@redhat.com> wrote: > .info_mask = NOTIFY_MOUNT_IS_RECURSIVE, Sorry, I meant NOTIFY_MOUNT_IN_SUBTREE; NOTIFY_MOUNT_IS_RECURSIVE indicates that the operation was recursive in nature. David ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-02 13:52 ` Miklos Szeredi 2020-04-02 14:36 ` Lennart Poettering @ 2020-04-03 1:44 ` Ian Kent 2020-04-03 11:11 ` Lennart Poettering 1 sibling, 1 reply; 99+ messages in thread From: Ian Kent @ 2020-04-03 1:44 UTC (permalink / raw) To: Miklos Szeredi Cc: David Howells, Lennart Poettering, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Thu, 2020-04-02 at 15:52 +0200, Miklos Szeredi wrote: > On Thu, Apr 2, 2020 at 4:52 AM Ian Kent <raven@themaw.net> wrote: > > On Wed, 2020-04-01 at 18:40 +0200, Miklos Szeredi wrote: > > > On Wed, Apr 1, 2020 at 6:07 PM David Howells <dhowells@redhat.com > > > > > > > wrote: > > > > Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > > > > > I've still not heard a convincing argument in favor of a > > > > > syscall. > > > > > > > > From your own results, scanning 10000 mounts through mountfs > > > > and > > > > reading just > > > > two values from each is an order of magnitude slower without > > > > the > > > > effect of the > > > > dentry/inode caches. It gets faster on the second run because > > > > the > > > > mountfs > > > > dentries and inodes are cached - but at a cost of >205MiB of > > > > RAM. And it's > > > > *still* slower than fsinfo(). > > > > > > Already told you that we can just delete the dentry on > > > dput_final, so > > > the memory argument is immaterial. > > > > > > And the speed argument also, because there's no use case where > > > that > > > would make a difference. You keep bringing up the notification > > > queue > > > overrun when watching a subtree, but that's going to be painful > > > with > > > fsinfo(2) as well. If that's a relevant use case (not saying > > > it's > > > true), might as well add a /mnt/MNT_ID/subtree_info (trivial > > > again) > > > that contains all information for the subtree. Have fun > > > implementing > > > that with fsinfo(2). > > > > Forgive me for not trawling through your patch to work this out > > but how does a poll on a path get what's needed to get mount info. > > > > Or, more specifically, how does one get what's needed to go > > directly > > to the place to get mount info. when something in the tree under > > the > > polled path changes (mount/umount). IIUC poll alone won't do > > subtree > > change monitoring? > > The mechanisms are basically the same as with fsinfo(2). You can > get > to the mountfs entry through the mount ID or through a proc/fd/ type > symlink. So if you have a path, there are two options: > > - find out the mount ID belonging to that path and go to > /mountfs/$mntid/ > - open the path with fd = open(path, O_PATH) and the go to > /proc/self/fdmount/$fd/ > > Currently the only way to find the mount id from a path is by parsing > /proc/self/fdinfo/$fd. It is trivial, however, to extend statx(2) to > return it directly from a path. Also the mount notification queue > that David implemented contains the mount ID of the changed mount. I'm aware the mount id comes through David's notifications, I was wondering how to get that via your recommendation, thanks. In your scheme it sounds like the mount id doesn't hold the importance it deserves, it's central to the whole idea of getting information about these mounts. But it sounds like you need to open fds to paths you might not know to find it ... Your explanation wasn't clear on how one gets notifications of events within a tree under a mount you've opened an fd on to get events? > > > Don't get me wrong, neither the proc nor the fsinfo implementations > > deal with the notification storms that cause much of the problem we > > see now. > > > > IMHO that's a separate and very difficult problem in itself that > > can't even be considered until getting the information efficiently > > is resolved. > > This mount notification storm issue got me thinking. If I > understand > correctly, systemd wants mount notifications so that it can do the > desktop pop-up thing. Is that correct? > > But that doesn't apply to automounts at all. A new mount performed > by > automount is uninteresting to to desktops, since it's triggered by > crossing the automount point (i.e. a normal path lookup), not an > external event like inserting a usb stick, etc... > > Am I missing something? Yeah, you're not missing anything. Unfortunately, in a recent discussion on the autofs mailing list, an investigation showed that systemd does want/get events for autofs mounts and proceeds to issue around a 100 or so events on the d-bus for every one. > > Maybe the solution is to just allow filtering out such notifications > at the source, so automount triggers don't generate events for > systemd. Except that autofs automounts might be expected to be seen on a desktop, that's not out of the question I guess. Ian ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-03 1:44 ` Ian Kent @ 2020-04-03 11:11 ` Lennart Poettering 2020-04-03 11:38 ` Miklos Szeredi 0 siblings, 1 reply; 99+ messages in thread From: Lennart Poettering @ 2020-04-03 11:11 UTC (permalink / raw) To: Ian Kent Cc: Miklos Szeredi, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Fr, 03.04.20 09:44, Ian Kent (raven@themaw.net) wrote: > > Currently the only way to find the mount id from a path is by parsing > > /proc/self/fdinfo/$fd. It is trivial, however, to extend statx(2) to > > return it directly from a path. Also the mount notification queue > > that David implemented contains the mount ID of the changed mount. I would love to have the mount ID exposed via statx(). In systemd we generally try name_to_handle_at() to query the mount ID first. It returns both the actual fhandle and the mount ID after all, and we then throw the fhandle away. It's not available on all fs though, but it has the benefit that it works without procfs and on a number of older kernels that didn't expose the mnt id in fdinfo. Lennart -- Lennart Poettering, Berlin ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-03 11:11 ` Lennart Poettering @ 2020-04-03 11:38 ` Miklos Szeredi 2020-04-03 12:05 ` Richard Weinberger ` (2 more replies) 0 siblings, 3 replies; 99+ messages in thread From: Miklos Szeredi @ 2020-04-03 11:38 UTC (permalink / raw) To: Lennart Poettering Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai [-- Attachment #1: Type: text/plain, Size: 535 bytes --] On Fri, Apr 3, 2020 at 1:11 PM Lennart Poettering <mzxreary@0pointer.de> wrote: > > On Fr, 03.04.20 09:44, Ian Kent (raven@themaw.net) wrote: > > > > Currently the only way to find the mount id from a path is by parsing > > > /proc/self/fdinfo/$fd. It is trivial, however, to extend statx(2) to > > > return it directly from a path. Also the mount notification queue > > > that David implemented contains the mount ID of the changed mount. > > I would love to have the mount ID exposed via statx(). Here's a patch. Thanks, Miklos [-- Attachment #2: statx-add-mount-id.patch --] [-- Type: text/x-patch, Size: 2253 bytes --] From: Miklos Szeredi <mszeredi@redhat.com> Subject: statx: add mount ID Systemd is hacking around to get it and it's trivial to add to statx, so... Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> --- fs/stat.c | 6 ++++++ include/linux/stat.h | 1 + include/uapi/linux/stat.h | 6 +++++- 3 files changed, 12 insertions(+), 1 deletion(-) --- a/fs/stat.c +++ b/fs/stat.c @@ -22,6 +22,7 @@ #include <asm/unistd.h> #include "internal.h" +#include "mount.h" /** * generic_fillattr - Fill in the basic attributes from the inode struct @@ -199,6 +200,10 @@ int vfs_statx(int dfd, const char __user goto out; error = vfs_getattr(&path, stat, request_mask, flags); + if (!error) { + stat->mnt_id = real_mount(path.mnt)->mnt_id; + stat->result_mask |= STATX_MNT_ID; + } path_put(&path); if (retry_estale(error, lookup_flags)) { lookup_flags |= LOOKUP_REVAL; @@ -563,6 +568,7 @@ cp_statx(const struct kstat *stat, struc tmp.stx_rdev_minor = MINOR(stat->rdev); tmp.stx_dev_major = MAJOR(stat->dev); tmp.stx_dev_minor = MINOR(stat->dev); + tmp.stx_mnt_id = stat->mnt_id; return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0; } --- a/include/linux/stat.h +++ b/include/linux/stat.h @@ -47,6 +47,7 @@ struct kstat { struct timespec64 ctime; struct timespec64 btime; /* File creation time */ u64 blocks; + u64 mnt_id; }; #endif --- a/include/uapi/linux/stat.h +++ b/include/uapi/linux/stat.h @@ -123,7 +123,10 @@ struct statx { __u32 stx_dev_major; /* ID of device containing file [uncond] */ __u32 stx_dev_minor; /* 0x90 */ - __u64 __spare2[14]; /* Spare space for future expansion */ + __u64 stx_mnt_id; + __u64 __spare2; + /* 0xa0 */ + __u64 __spare3[12]; /* Spare space for future expansion */ /* 0x100 */ }; @@ -148,6 +151,7 @@ struct statx { #define STATX_BLOCKS 0x00000400U /* Want/got stx_blocks */ #define STATX_BASIC_STATS 0x000007ffU /* The stuff in the normal stat struct */ #define STATX_BTIME 0x00000800U /* Want/got stx_btime */ +#define STATX_MNT_ID 0x00001000U /* Got stx_mnt_id */ #define STATX_ALL 0x00000fffU /* All currently supported flags */ #define STATX__RESERVED 0x80000000U /* Reserved for future struct statx expansion */ ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-03 11:38 ` Miklos Szeredi @ 2020-04-03 12:05 ` Richard Weinberger 2020-04-03 15:12 ` Lennart Poettering 2020-04-03 15:36 ` David Howells 2 siblings, 0 replies; 99+ messages in thread From: Richard Weinberger @ 2020-04-03 12:05 UTC (permalink / raw) To: Miklos Szeredi Cc: Lennart Poettering, Ian Kent, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, LKML, Aleksa Sarai On Fri, Apr 3, 2020 at 1:40 PM Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Fri, Apr 3, 2020 at 1:11 PM Lennart Poettering <mzxreary@0pointer.de> wrote: > > > > On Fr, 03.04.20 09:44, Ian Kent (raven@themaw.net) wrote: > > > > > > Currently the only way to find the mount id from a path is by parsing > > > > /proc/self/fdinfo/$fd. It is trivial, however, to extend statx(2) to > > > > return it directly from a path. Also the mount notification queue > > > > that David implemented contains the mount ID of the changed mount. > > > > I would love to have the mount ID exposed via statx(). > > Here's a patch. I was looking more than once for a nice way to get the mount id. Having it exposed via statx() would be great! -- Thanks, //richard ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-03 11:38 ` Miklos Szeredi 2020-04-03 12:05 ` Richard Weinberger @ 2020-04-03 15:12 ` Lennart Poettering 2020-04-03 20:30 ` J. Bruce Fields 2020-04-03 15:36 ` David Howells 2 siblings, 1 reply; 99+ messages in thread From: Lennart Poettering @ 2020-04-03 15:12 UTC (permalink / raw) To: Miklos Szeredi Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Fr, 03.04.20 13:38, Miklos Szeredi (miklos@szeredi.hu) wrote: > On Fri, Apr 3, 2020 at 1:11 PM Lennart Poettering <mzxreary@0pointer.de> wrote: > > > > On Fr, 03.04.20 09:44, Ian Kent (raven@themaw.net) wrote: > > > > > > Currently the only way to find the mount id from a path is by parsing > > > > /proc/self/fdinfo/$fd. It is trivial, however, to extend statx(2) to > > > > return it directly from a path. Also the mount notification queue > > > > that David implemented contains the mount ID of the changed mount. > > > > I would love to have the mount ID exposed via statx(). > > Here's a patch. Oh, this is excellent. I love it, thanks! BTW, while we are at it: one more thing I'd love to see exposed by statx() is a simple flag whether the inode is a mount point. There's plenty code that implements a test like this all over the place, and it usually isn't very safe. There's one implementation in util-linux for example (in the /usr/bin/mountpoint binary), and another one in systemd. Would be awesome to just have a statx() return flag for that, that would make things *so* much easier and more robust. because in fact most code isn't very good that implements this, as much of it just compares st_dev of the specified file and its parent. Better code compares the mount ID, but as mentioned that's not as pretty as it could be so far... Lennart -- Lennart Poettering, Berlin ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-03 15:12 ` Lennart Poettering @ 2020-04-03 20:30 ` J. Bruce Fields 2020-04-06 8:35 ` Miklos Szeredi 2020-04-06 9:17 ` Karel Zak 0 siblings, 2 replies; 99+ messages in thread From: J. Bruce Fields @ 2020-04-03 20:30 UTC (permalink / raw) To: Lennart Poettering Cc: Miklos Szeredi, Ian Kent, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Fri, Apr 03, 2020 at 05:12:23PM +0200, Lennart Poettering wrote: > BTW, while we are at it: one more thing I'd love to see exposed by > statx() is a simple flag whether the inode is a mount point. There's > plenty code that implements a test like this all over the place, and > it usually isn't very safe. There's one implementation in util-linux > for example (in the /usr/bin/mountpoint binary), and another one in > systemd. Would be awesome to just have a statx() return flag for that, > that would make things *so* much easier and more robust. because in > fact most code isn't very good that implements this, as much of it > just compares st_dev of the specified file and its parent. Better code > compares the mount ID, but as mentioned that's not as pretty as it > could be so far... nfs-utils/support/misc/mountpoint.c:check_is_mountpoint() stats the file and ".." and returns true if they have different st_dev or the same st_ino. Comparing mount ids sounds better. So anyway, yes, everybody reinvents the wheel here, and this would be useful. (And, yes, we want to know for the vfsmount, we don't care whether the same inode is used as a mountpoint someplace else.) --b. ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-03 20:30 ` J. Bruce Fields @ 2020-04-06 8:35 ` Miklos Szeredi 2020-04-06 16:07 ` J. Bruce Fields 2020-04-06 9:17 ` Karel Zak 1 sibling, 1 reply; 99+ messages in thread From: Miklos Szeredi @ 2020-04-06 8:35 UTC (permalink / raw) To: J. Bruce Fields Cc: Lennart Poettering, Ian Kent, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai [-- Attachment #1: Type: text/plain, Size: 2010 bytes --] On Fri, Apr 3, 2020 at 10:30 PM J. Bruce Fields <bfields@fieldses.org> wrote: > > On Fri, Apr 03, 2020 at 05:12:23PM +0200, Lennart Poettering wrote: > > BTW, while we are at it: one more thing I'd love to see exposed by > > statx() is a simple flag whether the inode is a mount point. There's > > plenty code that implements a test like this all over the place, and > > it usually isn't very safe. There's one implementation in util-linux > > for example (in the /usr/bin/mountpoint binary), and another one in > > systemd. Would be awesome to just have a statx() return flag for that, > > that would make things *so* much easier and more robust. because in > > fact most code isn't very good that implements this, as much of it > > just compares st_dev of the specified file and its parent. Better code > > compares the mount ID, but as mentioned that's not as pretty as it > > could be so far... > > nfs-utils/support/misc/mountpoint.c:check_is_mountpoint() stats the file > and ".." and returns true if they have different st_dev or the same > st_ino. Comparing mount ids sounds better. > > So anyway, yes, everybody reinvents the wheel here, and this would be > useful. (And, yes, we want to know for the vfsmount, we don't care > whether the same inode is used as a mountpoint someplace else.) Attaching a patch. There's some ambiguity about what is a "mountpoint" and what these tools are interested in. My guess is that they are not interested in an object being a mount point (something where another object is mounted) but being a mount root (this is the object mounted at the mount point). I.e fd = open("/mnt", O_PATH); mount("/bin", "/mnt", NULL, MS_BIND, NULL); statx(AT_FDCWD, "/mnt", 0, 0, &stx1); statx(fd, "", AT_EMPTY_PATH, 0, &stx2); printf("mount_root(/mnt) = %c, mount_root(fd) = %c\n", stx1.stx_attributes & STATX_ATTR_MOUNT_ROOT ? 'y' : 'n', stx2.stx_attributes & STATX_ATTR_MOUNT_ROOT ? 'y' : 'n'); Would print: mount_root(/mnt) = y, mount_root(fd) = n Thanks, Miklos [-- Attachment #2: statx-add-mount_root.patch --] [-- Type: text/x-patch, Size: 1520 bytes --] From: Miklos Szeredi <mszeredi@redhat.com> Subject: statx: add mount_root Determining whether a path or file descriptor refers to a mountpoint (or more precisely a mount root) is not trivial using current tools. Add a flag to statx that indicates whether the path or fd refers to the root of a mount or not. Reported-by: Lennart Poettering <mzxreary@0pointer.de> Reported-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> --- fs/stat.c | 3 +++ include/uapi/linux/stat.h | 1 + 2 files changed, 4 insertions(+) --- a/include/uapi/linux/stat.h +++ b/include/uapi/linux/stat.h @@ -172,6 +172,7 @@ struct statx { #define STATX_ATTR_NODUMP 0x00000040 /* [I] File is not to be dumped */ #define STATX_ATTR_ENCRYPTED 0x00000800 /* [I] File requires key to decrypt in fs */ #define STATX_ATTR_AUTOMOUNT 0x00001000 /* Dir: Automount trigger */ +#define STATX_ATTR_MOUNT_ROOT 0x00002000 /* Root of a mount */ #define STATX_ATTR_VERITY 0x00100000 /* [I] Verity protected file */ --- a/fs/stat.c +++ b/fs/stat.c @@ -202,6 +202,9 @@ int vfs_statx(int dfd, const char __user error = vfs_getattr(&path, stat, request_mask, flags); stat->mnt_id = real_mount(path.mnt)->mnt_id; stat->result_mask |= STATX_MNT_ID; + if (path.mnt->mnt_root == path.dentry) + stat->attributes |= STATX_ATTR_MOUNT_ROOT; + stat->attributes_mask |= STATX_ATTR_MOUNT_ROOT; path_put(&path); if (retry_estale(error, lookup_flags)) { lookup_flags |= LOOKUP_REVAL; ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-06 8:35 ` Miklos Szeredi @ 2020-04-06 16:07 ` J. Bruce Fields 0 siblings, 0 replies; 99+ messages in thread From: J. Bruce Fields @ 2020-04-06 16:07 UTC (permalink / raw) To: Miklos Szeredi Cc: Lennart Poettering, Ian Kent, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai The patch makes sense to me, thanks! In the NFS case it's implementing the "mountpoint" export option: mountpoint=path mp This option makes it possible to only export a directory if it has successfully been mounted. If no path is given (e.g. mountpoint or mp) then the export point must also be a mount point. If it isn't then the export point is not exported. This allows you to be sure that the directory underneath a mountpoint will never be exported by accident if, for example, the filesysâ tem failed to mount due to a disc error. If a path is given (e.g. mountpoint=/path or mp=/path) then the nominated path must be a mountpoint for the exportpoint to be exported. --b. On Mon, Apr 06, 2020 at 10:35:55AM +0200, Miklos Szeredi wrote: > From: Miklos Szeredi <mszeredi@redhat.com> > Subject: statx: add mount_root > > Determining whether a path or file descriptor refers to a mountpoint (or > more precisely a mount root) is not trivial using current tools. > > Add a flag to statx that indicates whether the path or fd refers to the > root of a mount or not. > > Reported-by: Lennart Poettering <mzxreary@0pointer.de> > Reported-by: J. Bruce Fields <bfields@fieldses.org> > Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> > --- > fs/stat.c | 3 +++ > include/uapi/linux/stat.h | 1 + > 2 files changed, 4 insertions(+) > > --- a/include/uapi/linux/stat.h > +++ b/include/uapi/linux/stat.h > @@ -172,6 +172,7 @@ struct statx { > #define STATX_ATTR_NODUMP 0x00000040 /* [I] File is not to be dumped */ > #define STATX_ATTR_ENCRYPTED 0x00000800 /* [I] File requires key to decrypt in fs */ > #define STATX_ATTR_AUTOMOUNT 0x00001000 /* Dir: Automount trigger */ > +#define STATX_ATTR_MOUNT_ROOT 0x00002000 /* Root of a mount */ > #define STATX_ATTR_VERITY 0x00100000 /* [I] Verity protected file */ > > > --- a/fs/stat.c > +++ b/fs/stat.c > @@ -202,6 +202,9 @@ int vfs_statx(int dfd, const char __user > error = vfs_getattr(&path, stat, request_mask, flags); > stat->mnt_id = real_mount(path.mnt)->mnt_id; > stat->result_mask |= STATX_MNT_ID; > + if (path.mnt->mnt_root = path.dentry) > + stat->attributes |= STATX_ATTR_MOUNT_ROOT; > + stat->attributes_mask |= STATX_ATTR_MOUNT_ROOT; > path_put(&path); > if (retry_estale(error, lookup_flags)) { > lookup_flags |= LOOKUP_REVAL; ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-03 20:30 ` J. Bruce Fields 2020-04-06 8:35 ` Miklos Szeredi @ 2020-04-06 9:17 ` Karel Zak 2020-04-06 16:34 ` Linus Torvalds 1 sibling, 1 reply; 99+ messages in thread From: Karel Zak @ 2020-04-06 9:17 UTC (permalink / raw) To: J. Bruce Fields Cc: Lennart Poettering, Miklos Szeredi, Ian Kent, David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Fri, Apr 03, 2020 at 04:30:24PM -0400, J. Bruce Fields wrote: > On Fri, Apr 03, 2020 at 05:12:23PM +0200, Lennart Poettering wrote: > > BTW, while we are at it: one more thing I'd love to see exposed by > > statx() is a simple flag whether the inode is a mount point. There's > > plenty code that implements a test like this all over the place, and > > it usually isn't very safe. There's one implementation in util-linux > > for example (in the /usr/bin/mountpoint binary), and another one in > > systemd. Would be awesome to just have a statx() return flag for that, > > that would make things *so* much easier and more robust. because in > > fact most code isn't very good that implements this, as much of it > > just compares st_dev of the specified file and its parent. Better code > > compares the mount ID, but as mentioned that's not as pretty as it > > could be so far... > > nfs-utils/support/misc/mountpoint.c:check_is_mountpoint() stats the file > and ".." and returns true if they have different st_dev or the same > st_ino. Comparing mount ids sounds better. BTW, this traditional st_dev+st_ino way is not reliable for bind mounts. For mountpoint(1) we search the directory in /proc/self/mountinfo. > So anyway, yes, everybody reinvents the wheel here, and this would be > useful. +1 Karel -- Karel Zak <kzak@redhat.com> http://karelzak.blogspot.com ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-06 9:17 ` Karel Zak @ 2020-04-06 16:34 ` Linus Torvalds 2020-04-06 18:46 ` J. Bruce Fields 2020-04-06 18:48 ` Lennart Poettering 0 siblings, 2 replies; 99+ messages in thread From: Linus Torvalds @ 2020-04-06 16:34 UTC (permalink / raw) To: Karel Zak Cc: J. Bruce Fields, Lennart Poettering, Miklos Szeredi, Ian Kent, David Howells, Christian Brauner, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, Linux Kernel Mailing List, Aleksa Sarai On Mon, Apr 6, 2020 at 2:17 AM Karel Zak <kzak@redhat.com> wrote: > > On Fri, Apr 03, 2020 at 04:30:24PM -0400, J. Bruce Fields wrote: > > > > nfs-utils/support/misc/mountpoint.c:check_is_mountpoint() stats the file > > and ".." and returns true if they have different st_dev or the same > > st_ino. Comparing mount ids sounds better. > > BTW, this traditional st_dev+st_ino way is not reliable for bind mounts. > For mountpoint(1) we search the directory in /proc/self/mountinfo. These days you should probably use openat2() with RESOLVE_NO_XDEV. No need for any mountinfo or anything like that. Just look up the pathname and say "don't cross mount-points", and you'll get an error if it's a mount crossing lookup. So this kind of thing is _not_ an argument for another kernel querying interface. We got a new (and better) model for a lot of this. Linus ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-06 16:34 ` Linus Torvalds @ 2020-04-06 18:46 ` J. Bruce Fields 2020-04-06 18:48 ` Lennart Poettering 1 sibling, 0 replies; 99+ messages in thread From: J. Bruce Fields @ 2020-04-06 18:46 UTC (permalink / raw) To: Linus Torvalds Cc: Karel Zak, Lennart Poettering, Miklos Szeredi, Ian Kent, David Howells, Christian Brauner, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, Linux Kernel Mailing List, Aleksa Sarai On Mon, Apr 06, 2020 at 09:34:08AM -0700, Linus Torvalds wrote: > On Mon, Apr 6, 2020 at 2:17 AM Karel Zak <kzak@redhat.com> wrote: > > > > On Fri, Apr 03, 2020 at 04:30:24PM -0400, J. Bruce Fields wrote: > > > > > > nfs-utils/support/misc/mountpoint.c:check_is_mountpoint() stats the file > > > and ".." and returns true if they have different st_dev or the same > > > st_ino. Comparing mount ids sounds better. > > > > BTW, this traditional st_dev+st_ino way is not reliable for bind mounts. > > For mountpoint(1) we search the directory in /proc/self/mountinfo. > > These days you should probably use openat2() with RESOLVE_NO_XDEV. > > No need for any mountinfo or anything like that. Just look up the > pathname and say "don't cross mount-points", and you'll get an error > if it's a mount crossing lookup. OK, I can't see why that wouldn't work, thanks. --b. > > So this kind of thing is _not_ an argument for another kernel querying > interface. We got a new (and better) model for a lot of this. > > Linus ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-06 16:34 ` Linus Torvalds 2020-04-06 18:46 ` J. Bruce Fields @ 2020-04-06 18:48 ` Lennart Poettering 2020-04-08 3:36 ` Linus Torvalds 1 sibling, 1 reply; 99+ messages in thread From: Lennart Poettering @ 2020-04-06 18:48 UTC (permalink / raw) To: Linus Torvalds Cc: Karel Zak, J. Bruce Fields, Miklos Szeredi, Ian Kent, David Howells, Christian Brauner, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, Linux Kernel Mailing List, Aleksa Sarai On Mo, 06.04.20 09:34, Linus Torvalds (torvalds@linux-foundation.org) wrote: > On Mon, Apr 6, 2020 at 2:17 AM Karel Zak <kzak@redhat.com> wrote: > > > > On Fri, Apr 03, 2020 at 04:30:24PM -0400, J. Bruce Fields wrote: > > > > > > nfs-utils/support/misc/mountpoint.c:check_is_mountpoint() stats the file > > > and ".." and returns true if they have different st_dev or the same > > > st_ino. Comparing mount ids sounds better. > > > > BTW, this traditional st_dev+st_ino way is not reliable for bind mounts. > > For mountpoint(1) we search the directory in /proc/self/mountinfo. > > These days you should probably use openat2() with RESOLVE_NO_XDEV. Note that opening a file is relatively "heavy" i.e. typically triggers autofs and stuff, and results in security checks (which can fail and such, and show up in audit). statx() doesn't do that, and that's explicitly documented (i.e. AT_NO_AUTOMOUNT and stuff). Hence, unless openat2() has some mechanism of doing something like an "open() but not really" (O_PATH isn't really sufficient for this, no?) I don't think it could be a good replacement for a statx() type check if something is a mount point or not. I mean, think about usecases: a common usecase for "is this a mountpoint" checks are tools that traverse directory trees and want to stop at submounts. They generally try to minimize operations and hence stat stuff but don't open anything unless its what they look foor (or a subdir they identified as a non-submount). Doing an extra openat2() in between there doesn't sound so attractive, since you pay heavily... Lennart -- Lennart Poettering, Berlin ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-06 18:48 ` Lennart Poettering @ 2020-04-08 3:36 ` Linus Torvalds 0 siblings, 0 replies; 99+ messages in thread From: Linus Torvalds @ 2020-04-08 3:36 UTC (permalink / raw) To: Lennart Poettering Cc: Karel Zak, J. Bruce Fields, Miklos Szeredi, Ian Kent, David Howells, Christian Brauner, Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, Linux Kernel Mailing List, Aleksa Sarai On Mon, Apr 6, 2020 at 11:48 AM Lennart Poettering <mzxreary@0pointer.de> wrote: > > On Mo, 06.04.20 09:34, Linus Torvalds (torvalds@linux-foundation.org) wrote: > > > On Mon, Apr 6, 2020 at 2:17 AM Karel Zak <kzak@redhat.com> wrote: > > > > > > On Fri, Apr 03, 2020 at 04:30:24PM -0400, J. Bruce Fields wrote: > > > > > > > > nfs-utils/support/misc/mountpoint.c:check_is_mountpoint() stats the file > > > > and ".." and returns true if they have different st_dev or the same > > > > st_ino. Comparing mount ids sounds better. > > > > > > BTW, this traditional st_dev+st_ino way is not reliable for bind mounts. > > > For mountpoint(1) we search the directory in /proc/self/mountinfo. > > > > These days you should probably use openat2() with RESOLVE_NO_XDEV. > > Note that opening a file is relatively "heavy" i.e. typically triggers > autofs and stuff, and results in security checks (which can fail and > such, and show up in audit). For the use that Bruce outlined, openat2() with RESOLVE_NO_XDEV is absolutely the right thing. He already did the stat() of the file (and ".."), RESOLVE_NO_XDEV is only an improvement. It's also a lot better than trying to parse mountinfo. Now, I don't disagree that a statx() flag to also indicate "that's a top-level mount" might be a good idea, and may be the right answer for other cases. I'm just saying that considering what Bruce does now, RESOLVE_NO_XDEV sounds like the nobrainer approach, and needs no new support outside of what we already had for other reasons. (And O_PATH _may_ or may not be part of what you want to do, it's an independent separate issue, but automount behavior wrt a O_PATH lookup is somewhat unclear - see Al's other emails on that subject) Linus ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-03 11:38 ` Miklos Szeredi 2020-04-03 12:05 ` Richard Weinberger 2020-04-03 15:12 ` Lennart Poettering @ 2020-04-03 15:36 ` David Howells 2020-04-03 15:41 ` Lennart Poettering 2 siblings, 1 reply; 99+ messages in thread From: David Howells @ 2020-04-03 15:36 UTC (permalink / raw) To: Lennart Poettering Cc: dhowells, Miklos Szeredi, Ian Kent, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai Lennart Poettering <mzxreary@0pointer.de> wrote: > BTW, while we are at it: one more thing I'd love to see exposed by > statx() is a simple flag whether the inode is a mount point. Note that an inode or a dentry might be a mount point in one namespace, but not in another. Do you actually mean an inode - or do you actually mean the (mount,dentry) pair that you're looking at? (Ie. should it be namespace specific?) David ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: Upcoming: Notifications, FS notifications and fsinfo() 2020-04-03 15:36 ` David Howells @ 2020-04-03 15:41 ` Lennart Poettering 0 siblings, 0 replies; 99+ messages in thread From: Lennart Poettering @ 2020-04-03 15:41 UTC (permalink / raw) To: David Howells Cc: Miklos Szeredi, Ian Kent, Christian Brauner, Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai On Fr, 03.04.20 16:36, David Howells (dhowells@redhat.com) wrote: > Lennart Poettering <mzxreary@0pointer.de> wrote: > > > BTW, while we are at it: one more thing I'd love to see exposed by > > statx() is a simple flag whether the inode is a mount point. > > Note that an inode or a dentry might be a mount point in one namespace, but > not in another. Do you actually mean an inode - or do you actually mean the > (mount,dentry) pair that you're looking at? (Ie. should it be namespace > specific?) yes, it should be specific to the mount hierarchy in the current namespace. Lennart -- Lennart Poettering, Berlin ^ permalink raw reply [flat|nested] 99+ messages in thread
* [GIT PULL] General notification queue and key notifications @ 2020-06-02 15:51 David Howells 2020-06-02 15:54 ` David Howells 0 siblings, 1 reply; 99+ messages in thread From: David Howells @ 2020-06-02 15:51 UTC (permalink / raw) To: torvalds Cc: dhowells, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven, andres, christian.brauner, jarkko.sakkinen, keyrings, linux-fsdevel, linux-kernel Hi Linus, Can you pull this, please? It adds a general notification queue concept and adds an event source for keys/keyrings, such as linking and unlinking keys and changing their attributes. Thanks to Debarshi Ray, we do have a pull request to use this to fix a problem with gnome-online-accounts - as mentioned last time: https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47 Without this, g-o-a has to constantly poll a keyring-based kerberos cache to find out if kinit has changed anything. [[ With regard to the mount/sb notifications and fsinfo(), Karel Zak and Ian Kent have been working on making libmount use them, preparatory to working on systemd: https://github.com/karelzak/util-linux/commits/topic/fsinfo https://github.com/raven-au/util-linux/commits/topic/fsinfo.public Development has stalled briefly due to other commitments, so I'm not sure I can ask you to pull those parts of the series for now. Christian Brauner would like to use them in lxc, but hasn't started. ]] LSM hooks are included: (1) A set of hooks are provided that allow an LSM to rule on whether or not a watch may be set. Each of these hooks takes a different "watched object" parameter, so they're not really shareable. The LSM should use current's credentials. [Wanted by SELinux & Smack] (2) A hook is provided to allow an LSM to rule on whether or not a particular message may be posted to a particular queue. This is given the credentials from the event generator (which may be the system) and the watch setter. [Wanted by Smack] I've provided SELinux and Smack with implementations of some of these hooks. WHY === Key/keyring notifications are desirable because if you have your kerberos tickets in a file/directory, your Gnome desktop will monitor that using something like fanotify and tell you if your credentials cache changes. However, we also have the ability to cache your kerberos tickets in the session, user or persistent keyring so that it isn't left around on disk across a reboot or logout. Keyrings, however, cannot currently be monitored asynchronously, so the desktop has to poll for it - not so good on a laptop. This facility will allow the desktop to avoid the need to poll. DESIGN DECISIONS ================ (1) The notification queue is built on top of a standard pipe. Messages are effectively spliced in. The pipe is opened with a special flag: pipe2(fds, O_NOTIFICATION_PIPE); The special flag has the same value as O_EXCL (which doesn't seem like it will ever be applicable in this context)[?]. It is given up front to make it a lot easier to prohibit splice and co. from accessing the pipe. [?] Should this be done some other way? I'd rather not use up a new O_* flag if I can avoid it - should I add a pipe3() system call instead? The pipe is then configured:: ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth); ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter); Messages are then read out of the pipe using read(). (2) It should be possible to allow write() to insert data into the notification pipes too, but this is currently disabled as the kernel has to be able to insert messages into the pipe *without* holding pipe->mutex and the code to make this work needs careful auditing. (3) sendfile(), splice() and vmsplice() are disabled on notification pipes because of the pipe->mutex issue and also because they sometimes want to revert what they just did - but one or more notification messages might've been interleaved in the ring. (4) The kernel inserts messages with the wait queue spinlock held. This means that pipe_read() and pipe_write() have to take the spinlock to update the queue pointers. (5) Records in the buffer are binary, typed and have a length so that they can be of varying size. This allows multiple heterogeneous sources to share a common buffer; there are 16 million types available, of which I've used just a few, so there is scope for others to be used. Tags may be specified when a watchpoint is created to help distinguish the sources. (6) Records are filterable as types have up to 256 subtypes that can be individually filtered. Other filtration is also available. (7) Notification pipes don't interfere with each other; each may be bound to a different set of watches. Any particular notification will be copied to all the queues that are currently watching for it - and only those that are watching for it. (8) When recording a notification, the kernel will not sleep, but will rather mark a queue as having lost a message if there's insufficient space. read() will fabricate a loss notification message at an appropriate point later. (9) The notification pipe is created and then watchpoints are attached to it, using one of: keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01); watch_mount(AT_FDCWD, "/", 0, fd, 0x02); watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03); where in both cases, fd indicates the queue and the number after is a tag between 0 and 255. (10) Watches are removed if either the notification pipe is destroyed or the watched object is destroyed. In the latter case, a message will be generated indicating the enforced watch removal. Things I want to avoid: (1) Introducing features that make the core VFS dependent on the network stack or networking namespaces (ie. usage of netlink). (2) Dumping all this stuff into dmesg and having a daemon that sits there parsing the output and distributing it as this then puts the responsibility for security into userspace and makes handling namespaces tricky. Further, dmesg might not exist or might be inaccessible inside a container. (3) Letting users see events they shouldn't be able to see. TESTING AND MANPAGES ==================== (*) The keyutils tree has a pipe-watch branch that has keyctl commands for making use of notifications. Proposed manual pages can also be found on this branch, though a couple of them really need to go to the main manpages repository instead. If the kernel supports the watching of keys, then running "make test" on that branch will cause the testing infrastructure to spawn a monitoring process on the side that monitors a notifications pipe for all the key/keyring changes induced by the tests and they'll all be checked off to make sure they happened. https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch (*) A test program is provided (samples/watch_queue/watch_test) that can be used to monitor for keyrings, mount and superblock events. Information on the notifications is simply logged to stdout. Thanks, David ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [GIT PULL] General notification queue and key notifications 2020-06-02 15:51 [GIT PULL] General notification queue and key notifications David Howells @ 2020-06-02 15:54 ` David Howells 0 siblings, 0 replies; 99+ messages in thread From: David Howells @ 2020-06-02 15:54 UTC (permalink / raw) Cc: dhowells, torvalds, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven, andres, christian.brauner, jarkko.sakkinen, keyrings, linux-fsdevel, linux-kernel Oops - I forgot to include the pull request. Will resend. David ^ permalink raw reply [flat|nested] 99+ messages in thread
end of thread, other threads:[~2020-06-24 1:17 UTC | newest] Thread overview: 99+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-03-30 13:58 Upcoming: Notifications, FS notifications and fsinfo() David Howells 2020-03-30 14:31 ` [GIT PULL] General notification queue and key notifications David Howells 2020-03-31 6:51 ` Stephen Rothwell 2020-06-02 15:55 ` David Howells 2020-06-03 2:15 ` Ian Kent 2020-06-08 0:49 ` Ian Kent 2020-06-10 9:56 ` Christian Brauner 2020-06-10 11:12 ` Karel Zak 2020-06-12 21:32 ` Linus Torvalds 2020-06-12 22:01 ` Linus Torvalds 2020-06-13 13:04 ` David Howells 2020-06-13 16:47 ` Linus Torvalds 2020-06-13 17:03 ` Linus Torvalds 2020-06-13 19:22 ` Miklos Szeredi 2020-06-13 13:24 ` David Howells 2020-06-13 18:00 ` pr-tracker-bot 2020-06-17 1:15 ` Williams, Dan J 2020-06-23 23:38 ` Dan Williams 2020-06-24 0:55 ` David Howells 2020-06-24 1:03 ` Dan Williams 2020-06-24 1:17 ` David Howells 2020-03-30 14:36 ` [GIT PULL] Mount and superblock notifications David Howells 2020-04-04 21:13 ` Linus Torvalds 2020-04-05 22:52 ` Andres Freund 2020-03-30 14:43 ` [GIT PULL] fsinfo: Filesystem information query David Howells 2020-03-30 20:28 ` Upcoming: Notifications, FS notifications and fsinfo() Miklos Szeredi 2020-03-31 9:21 ` Karel Zak 2020-03-30 21:17 ` Christian Brauner 2020-03-31 5:11 ` Miklos Szeredi 2020-03-31 8:15 ` Christian Brauner 2020-03-31 8:34 ` Miklos Szeredi 2020-03-31 8:34 ` Karel Zak 2020-03-31 8:56 ` Miklos Szeredi 2020-03-31 9:49 ` Karel Zak 2020-03-31 12:25 ` Lennart Poettering 2020-03-31 15:10 ` Miklos Szeredi 2020-03-31 15:24 ` Lennart Poettering 2020-03-31 21:56 ` David Howells 2020-03-31 21:54 ` David Howells 2020-04-01 8:43 ` Karel Zak 2020-03-31 7:22 ` Lennart Poettering 2020-03-31 17:31 ` David Howells 2020-03-31 19:42 ` Miklos Szeredi 2020-03-31 19:47 ` David Howells 2020-03-31 21:14 ` David Howells 2020-03-31 21:23 ` David Howells 2020-03-31 21:52 ` David Howells 2020-04-01 9:04 ` Karel Zak 2020-04-01 13:34 ` Miklos Szeredi 2020-04-01 13:55 ` David Howells 2020-04-01 13:58 ` David Howells 2020-04-01 15:25 ` Miklos Szeredi 2020-04-03 9:11 ` Karel Zak 2020-04-01 16:01 ` David Howells 2020-04-01 16:30 ` Miklos Szeredi 2020-04-02 15:22 ` David Howells 2020-04-02 15:24 ` Miklos Szeredi 2020-04-02 15:42 ` David Howells 2020-04-02 15:24 ` David Howells 2020-04-01 14:41 ` Lennart Poettering 2020-04-01 15:33 ` Miklos Szeredi 2020-04-01 16:06 ` David Howells 2020-04-01 16:40 ` Miklos Szeredi 2020-04-02 2:52 ` Ian Kent 2020-04-02 13:52 ` Miklos Szeredi 2020-04-02 14:36 ` Lennart Poettering 2020-04-02 15:22 ` Miklos Szeredi 2020-04-02 15:28 ` Lennart Poettering 2020-04-02 15:35 ` Miklos Szeredi 2020-04-02 15:50 ` Lennart Poettering 2020-04-02 17:20 ` Miklos Szeredi 2020-04-03 11:08 ` Lennart Poettering 2020-04-03 11:48 ` Miklos Szeredi 2020-04-03 15:01 ` Lennart Poettering 2020-04-06 9:22 ` Miklos Szeredi 2020-04-06 17:29 ` Lennart Poettering 2020-04-07 2:21 ` Ian Kent 2020-04-07 13:59 ` Miklos Szeredi 2020-04-07 15:53 ` Lennart Poettering 2020-04-07 16:06 ` Miklos Szeredi 2020-04-02 15:51 ` David Howells 2020-04-02 15:56 ` David Howells 2020-04-03 1:44 ` Ian Kent 2020-04-03 11:11 ` Lennart Poettering 2020-04-03 11:38 ` Miklos Szeredi 2020-04-03 12:05 ` Richard Weinberger 2020-04-03 15:12 ` Lennart Poettering 2020-04-03 20:30 ` J. Bruce Fields 2020-04-06 8:35 ` Miklos Szeredi 2020-04-06 16:07 ` J. Bruce Fields 2020-04-06 9:17 ` Karel Zak 2020-04-06 16:34 ` Linus Torvalds 2020-04-06 18:46 ` J. Bruce Fields 2020-04-06 18:48 ` Lennart Poettering 2020-04-08 3:36 ` Linus Torvalds 2020-04-03 15:36 ` David Howells 2020-04-03 15:41 ` Lennart Poettering 2020-06-02 15:51 [GIT PULL] General notification queue and key notifications David Howells 2020-06-02 15:54 ` David Howells
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).