keyrings.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Upcoming: Notifications, FS notifications and fsinfo()
@ 2020-03-30 13:58 David Howells
  2020-03-30 14:31 ` [GIT PULL] General notification queue and key notifications David Howells
                   ` (6 more replies)
  0 siblings, 7 replies; 99+ messages in thread
From: David Howells @ 2020-03-30 13:58 UTC (permalink / raw)
  To: torvalds
  Cc: dhowells, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven,
	andres, christian.brauner, keyrings, linux-fsdevel, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="maccentraleurope", Size: 4045 bytes --]


Hi Linus,

I have three sets of patches I'd like to push your way, if you (and Al) are
willing to consider them.

 (1) General notification queue plus key/keyring notifications.

     This adds the core of the notification queue built on pipes, and adds
     the ability to watch for changes to keys.

 (2) Mount and superblock notifications.

     This builds on (1) to provide notifications of mount topology changes
     and implements a framework for superblock events (configuration
     changes, I/O errors, quota/space overruns and network status changes).

 (3) Filesystem information retrieval.

     This provides an extensible way to retrieve informational attributes
     about mount objects and filesystems.  This includes providing
     information intended to make recovering from a notification queue
     overrun much easier.

We need (1) for Gnome to efficiently watch for changes in kerberos
keyrings.  Debarshi Ray has patches ready to go for gnome-online-accounts
so that it can make use of the facility.

Sets (2) and (3) can make libmount more efficient.  Karel Zak is working on
making use of this to avoid reading /proc/mountinfo.

We need something to make systemd's watching of the mount topology more
efficient, and (2) and (3) can help with this by making it faster to narrow
down what changed.  I think Karel has this in his sights, but hasn't yet
managed to work on it.

Set (2) should be able to make it easier to watch for mount options inside
a container, and set (3) should make it easier to examine the mounts inside
another mount namespace inside a container in a way that can't be done with
/proc/mounts.  This is requested by Christian Brauner.

Jeff Layton has a tentative addition to (3) to expose error state to
userspace, and Andres Freund would like this for Postgres.

Set (3) further allows the information returned by such as statx() and
ioctl(FS_IOC_GETFLAGS) to be qualified by indicating which bits are/aren't
supported.

Further, for (3), I also allow filesystem-specific overrides/extensions to
fsinfo() and have a use for it to AFS to expose information about server
preference for a particular volume (something that is necessary for
implementing the toolset).  I've provided example code that does similar
for NFS and some that exposes superblock info from Ext4.  At Vault, Steve
expressed an interest in this for CIFS and Ted Ts'o expressed a possible
interest for Ext4.

Notes:

 (*) These patches will conflict with apparently upcoming refactoring of
     the security core, but the fixup doesn't look too bad:

	https://lore.kernel.org/linux-next/20200330130636.0846e394@canb.auug.org.au/T/#u

 (*) Miklós Szeredi would much prefer to implement fsinfo() as a magic
     filesystem mounted on /proc/self/fsinfo/ whereby your open fds appear
     as directories under there, each with a set of attribute files
     corresponding to the attributes that fsinfo() would otherwise provide.
     To examine something by filename, you'd have to open it O_PATH and
     then read the individual attribute files in the corresponding per-fd
     directory.  A readfile() system call has been mooted to elide the
     {open,read,close} sequence to make it more efficient.

 (*) James Bottomley would like to deprecate fsopen(), fspick(), fsconfig()
     and fsmount() in favour of a more generic configfs with dedicated
     open, set-config and action syscalls, with an additional get-config
     syscall that would be used instead of fsinfo() - though, as I
     understand it, you'd have to create a config (fspick-equivalent)
     before you could use get-config.

 (*) I don't think Al has particularly looked at fsinfo() or the fs
     notifications patches yet.

 (*) I'm not sure what *your* opinion of fsinfo() is yet.  If you don't
     dislike it too, um, fragrantly, would you be willing to entertain part
     of it for now and prefer the rest to stew a bit longer?  I can drop
     some of the pieces.

Anyway, I'm going to formulate a pull request for each of them.

Thanks,
David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [GIT PULL] General notification queue and key notifications
  2020-03-30 13:58 Upcoming: Notifications, FS notifications and fsinfo() David Howells
@ 2020-03-30 14:31 ` David Howells
  2020-03-31  6:51   ` Stephen Rothwell
  2020-06-02 15:55   ` David Howells
  2020-03-30 14:36 ` [GIT PULL] Mount and superblock notifications David Howells
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 99+ messages in thread
From: David Howells @ 2020-03-30 14:31 UTC (permalink / raw)
  To: torvalds
  Cc: dhowells, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven,
	andres, christian.brauner, jarkko.sakkinen, keyrings,
	linux-fsdevel, linux-kernel

Hi Linus,

Can you pull this, please?  It adds a general notification queue concept
and adds an event source for keys/keyrings, such as linking and unlinking
keys and changing their attributes.  A subsequent pull request will add
mount and superblock event sources.

LSM hooks are included:

 (1) A set of hooks are provided that allow an LSM to rule on whether or
     not a watch may be set.  Each of these hooks takes a different
     "watched object" parameter, so they're not really shareable.  The LSM
     should use current's credentials.  [Wanted by SELinux & Smack]

 (2) A hook is provided to allow an LSM to rule on whether or not a
     particular message may be posted to a particular queue.  This is given
     the credentials from the event generator (which may be the system) and
     the watch setter.  [Wanted by Smack]

I've provided SELinux and Smack with implementations of some of these hooks.


WHY
===

Key/keyring notifications are desirable because if you have your kerberos
tickets in a file/directory, your Gnome desktop will monitor that using
something like fanotify and tell you if your credentials cache changes.

However, we also have the ability to cache your kerberos tickets in the
session, user or persistent keyring so that it isn't left around on disk
across a reboot or logout.  Keyrings, however, cannot currently be
monitored asynchronously, so the desktop has to poll for it - not so good
on a laptop.

This source will allow the desktop to avoid the need to poll.  Here's a
pull request for usage by gnome-online-accounts:

    https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47


DESIGN DECISIONS
================

 (1) The notification queue is built on top of a standard pipe.  Messages
     are effectively spliced in.  The pipe is opened with a special flag:

	pipe2(fds, O_NOTIFICATION_PIPE);

     The special flag has the same value as O_EXCL (which doesn't seem like
     it will ever be applicable in this context)[?].  It is given up front
     to make it a lot easier to prohibit splice and co. from accessing the
     pipe.

     [?] Should this be done some other way?  I'd rather not use up a new
     	 O_* flag if I can avoid it - should I add a pipe3() system call
     	 instead?

     The pipe is then configured::

	ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
	ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter);

     Messages are then read out of the pipe using read().

 (2) It should be possible to allow write() to insert data into the
     notification pipes too, but this is currently disabled as the kernel
     has to be able to insert messages into the pipe *without* holding
     pipe->mutex and the code to make this work needs careful auditing.

 (3) sendfile(), splice() and vmsplice() are disabled on notification pipes
     because of the pipe->mutex issue and also because they sometimes want
     to revert what they just did - but one or more notification messages
     might've been interleaved in the ring.

 (4) The kernel inserts messages with the wait queue spinlock held.  This
     means that pipe_read() and pipe_write() have to take the spinlock to
     update the queue pointers.

 (5) Records in the buffer are binary, typed and have a length so that they
     can be of varying size.

     This allows multiple heterogeneous sources to share a common buffer;
     there are 16 million types available, of which I've used just a few,
     so there is scope for others to be used.  Tags may be specified when a
     watchpoint is created to help distinguish the sources.

 (6) Records are filterable as types have up to 256 subtypes that can be
     individually filtered.  Other filtration is also available.

 (7) Notification pipes don't interfere with each other; each may be bound
     to a different set of watches.  Any particular notification will be
     copied to all the queues that are currently watching for it - and only
     those that are watching for it.

 (8) When recording a notification, the kernel will not sleep, but will
     rather mark a queue as having lost a message if there's insufficient
     space.  read() will fabricate a loss notification message at an
     appropriate point later.

 (9) The notification pipe is created and then watchpoints are attached to
     it, using one of:

	keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
	watch_mount(AT_FDCWD, "/", 0, fd, 0x02);
	watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03);

     where in both cases, fd indicates the queue and the number after is a
     tag between 0 and 255.

(10) Watches are removed if either the notification pipe is destroyed or
     the watched object is destroyed.  In the latter case, a message will
     be generated indicating the enforced watch removal.


Things I want to avoid:

 (1) Introducing features that make the core VFS dependent on the network
     stack or networking namespaces (ie. usage of netlink).

 (2) Dumping all this stuff into dmesg and having a daemon that sits there
     parsing the output and distributing it as this then puts the
     responsibility for security into userspace and makes handling
     namespaces tricky.  Further, dmesg might not exist or might be
     inaccessible inside a container.

 (3) Letting users see events they shouldn't be able to see.


TESTING AND MANPAGES
====================

 (*) The keyutils tree has a pipe-watch branch that has keyctl commands for
     making use of notifications.  Proposed manual pages can also be found
     on this branch, though a couple of them really need to go to the main
     manpages repository instead.

     If the kernel supports the watching of keys, then running "make test"
     on that branch will cause the testing infrastructure to spawn a
     monitoring process on the side that monitors a notifications pipe for
     all the key/keyring changes induced by the tests and they'll all be
     checked off to make sure they happened.

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch

 (*) A test program is provided (samples/watch_queue/watch_test) that can
     be used to monitor for keyrings, mount and superblock events.
     Information on the notifications is simply logged to stdout.

Thanks,
David
---
The following changes since commit f8788d86ab28f61f7b46eb6be375f8a726783636:

  Linux 5.6-rc3 (2020-02-23 16:17:42 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/notifications-20200330

for you to fetch changes up to 694435dbde3d1da79aafaf4cd680802f9eb229b7:

  smack: Implement the watch_key and post_notification hooks (2020-03-19 17:31:09 +0000)

----------------------------------------------------------------
Notifications over pipes

----------------------------------------------------------------
David Howells (11):
      uapi: General notification queue definitions
      security: Add hooks to rule on setting a watch
      security: Add a hook for the point of notification insertion
      pipe: Add O_NOTIFICATION_PIPE
      pipe: Add general notification queue support
      watch_queue: Add a key/keyring notification facility
      Add sample notification program
      pipe: Allow buffers to be marked read-whole-or-error for notifications
      pipe: Add notification lossage handling
      selinux: Implement the watch_key security hook
      smack: Implement the watch_key and post_notification hooks

 Documentation/security/keys/core.rst               |  57 ++
 Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
 Documentation/watch_queue.rst                      | 339 +++++++++++
 fs/pipe.c                                          | 242 +++++---
 fs/splice.c                                        |  12 +-
 include/linux/key.h                                |   3 +
 include/linux/lsm_audit.h                          |   1 +
 include/linux/lsm_hooks.h                          |  38 ++
 include/linux/pipe_fs_i.h                          |  27 +-
 include/linux/security.h                           |  31 +
 include/linux/watch_queue.h                        | 127 ++++
 include/uapi/linux/keyctl.h                        |   2 +
 include/uapi/linux/watch_queue.h                   | 104 ++++
 init/Kconfig                                       |  12 +
 kernel/Makefile                                    |   1 +
 kernel/watch_queue.c                               | 659 +++++++++++++++++++++
 samples/Kconfig                                    |   6 +
 samples/Makefile                                   |   1 +
 samples/watch_queue/Makefile                       |   7 +
 samples/watch_queue/watch_test.c                   | 186 ++++++
 security/keys/Kconfig                              |   9 +
 security/keys/compat.c                             |   3 +
 security/keys/gc.c                                 |   5 +
 security/keys/internal.h                           |  30 +-
 security/keys/key.c                                |  38 +-
 security/keys/keyctl.c                             |  99 +++-
 security/keys/keyring.c                            |  20 +-
 security/keys/request_key.c                        |   4 +-
 security/security.c                                |  23 +
 security/selinux/hooks.c                           |  14 +
 security/smack/smack_lsm.c                         |  83 ++-
 31 files changed, 2079 insertions(+), 105 deletions(-)
 create mode 100644 Documentation/watch_queue.rst
 create mode 100644 include/linux/watch_queue.h
 create mode 100644 include/uapi/linux/watch_queue.h
 create mode 100644 kernel/watch_queue.c
 create mode 100644 samples/watch_queue/Makefile
 create mode 100644 samples/watch_queue/watch_test.c

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [GIT PULL] Mount and superblock notifications
  2020-03-30 13:58 Upcoming: Notifications, FS notifications and fsinfo() David Howells
  2020-03-30 14:31 ` [GIT PULL] General notification queue and key notifications David Howells
@ 2020-03-30 14:36 ` David Howells
  2020-04-04 21:13   ` Linus Torvalds
  2020-03-30 14:43 ` [GIT PULL] fsinfo: Filesystem information query David Howells
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 99+ messages in thread
From: David Howells @ 2020-03-30 14:36 UTC (permalink / raw)
  To: torvalds
  Cc: dhowells, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven,
	andres, christian.brauner, keyrings, linux-fsdevel, linux-kernel

Hi Linus,

If you could consider pulling this - or would you prefer it to go through
Al?  It adds a couple of VFS-related event sources for the general
notification mechanism:

 (1) Mount topology events, such as mounting, unmounting, mount expiry,
     mount reconfiguration.

 (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
     errors (not complete yet).


WHY
===

 (1) Mount notifications.

     This one is wanted to avoid repeated trawling of /proc/mounts or
     similar to work out changes to the mount object attributes and mount
     topology.  I'm told that the proc file holding the namespace_sem is a
     point of contention, especially as the process of generating the text
     descriptions of the mounts/superblocks can be quite involved.

     Whilst you can use poll() on /proc/mounts, it doesn't give you any
     clues as to what changed.  The notification generated here directly
     indicates the mounts involved in any particular event and gives an
     idea of what the change was.

     This is combined with a new fsinfo() system call that allows, amongst
     other things, the ability to retrieve in one go an { id,
     change_counter } tuple from all the children of a specified mount,
     allowing buffer overruns to be dealt with quickly.

     This can be used by systemd to improve efficiency:

	https://lore.kernel.org/linux-fsdevel/20200227151421.3u74ijhqt6ekbiss@ws.net.home/

     And it's not just Red Hat that's potentially interested in this:

	https://lore.kernel.org/linux-fsdevel/293c9bd3-f530-d75e-c353-ddeabac27cf6@6wind.com/

     Also, this can be used to improve management of containers by allowing
     watches to be set in foreign mount namespaces, such as are in a
     container.

 (2) Superblock notifications.

     This one is provided to allow systemd or the desktop to more easily
     detect events such as I/O errors and EDQUOT/ENOSPC.  This would be of
     interest to Postgres:

	https://lore.kernel.org/linux-fsdevel/20200211005626.7yqjf5rbs3vbwagd@alap3.anarazel.de/

     But could also be used to indicate to systemd when a superblock has
     had its configuration changed.

Thanks,
David
---
The following changes since commit 694435dbde3d1da79aafaf4cd680802f9eb229b7:

  smack: Implement the watch_key and post_notification hooks (2020-03-19 17:31:09 +0000)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/notifications-fs-20200330

for you to fetch changes up to 8dbf1aa122da5bbb4ede0f363a8a18dfc723be33:

  watch_queue: sample: Display superblock notifications (2020-03-19 17:31:09 +0000)

----------------------------------------------------------------
Filesystem notifications

----------------------------------------------------------------
David Howells (6):
      watch_queue: Add security hooks to rule on setting mount and sb watches
      watch_queue: Implement mount topology and attribute change notifications
      watch_queue: sample: Display mount tree change notifications
      watch_queue: Introduce a non-repeating system-unique superblock ID
      watch_queue: Add superblock notifications
      watch_queue: sample: Display superblock notifications

 Documentation/watch_queue.rst               |  24 ++-
 arch/alpha/kernel/syscalls/syscall.tbl      |   2 +
 arch/arm/tools/syscall.tbl                  |   2 +
 arch/arm64/include/asm/unistd.h             |   2 +-
 arch/arm64/include/asm/unistd32.h           |   4 +
 arch/ia64/kernel/syscalls/syscall.tbl       |   2 +
 arch/m68k/kernel/syscalls/syscall.tbl       |   2 +
 arch/microblaze/kernel/syscalls/syscall.tbl |   2 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |   2 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |   2 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |   2 +
 arch/parisc/kernel/syscalls/syscall.tbl     |   2 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |   2 +
 arch/s390/kernel/syscalls/syscall.tbl       |   2 +
 arch/sh/kernel/syscalls/syscall.tbl         |   2 +
 arch/sparc/kernel/syscalls/syscall.tbl      |   2 +
 arch/x86/entry/syscalls/syscall_32.tbl      |   2 +
 arch/x86/entry/syscalls/syscall_64.tbl      |   2 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |   2 +
 fs/Kconfig                                  |  21 +++
 fs/Makefile                                 |   1 +
 fs/internal.h                               |   1 +
 fs/mount.h                                  |  21 +++
 fs/mount_notify.c                           | 228 ++++++++++++++++++++++++++++
 fs/namespace.c                              |  22 +++
 fs/super.c                                  | 205 +++++++++++++++++++++++++
 include/linux/dcache.h                      |   1 +
 include/linux/fs.h                          |  62 ++++++++
 include/linux/lsm_hooks.h                   |  24 +++
 include/linux/security.h                    |  16 ++
 include/linux/syscalls.h                    |   4 +
 include/uapi/asm-generic/unistd.h           |   6 +-
 include/uapi/linux/watch_queue.h            |  65 +++++++-
 kernel/sys_ni.c                             |   6 +
 samples/watch_queue/watch_test.c            |  81 +++++++++-
 security/security.c                         |  14 ++
 36 files changed, 835 insertions(+), 5 deletions(-)
 create mode 100644 fs/mount_notify.c

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [GIT PULL] fsinfo: Filesystem information query
  2020-03-30 13:58 Upcoming: Notifications, FS notifications and fsinfo() David Howells
  2020-03-30 14:31 ` [GIT PULL] General notification queue and key notifications David Howells
  2020-03-30 14:36 ` [GIT PULL] Mount and superblock notifications David Howells
@ 2020-03-30 14:43 ` David Howells
  2020-03-30 20:28 ` Upcoming: Notifications, FS notifications and fsinfo() Miklos Szeredi
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 99+ messages in thread
From: David Howells @ 2020-03-30 14:43 UTC (permalink / raw)
  To: torvalds
  Cc: dhowells, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven,
	andres, christian.brauner, keyrings, linux-fsdevel, linux-kernel

Hi Linus,

If you could consider pulling this - or would you prefer it to go through
Al?  It adds a system call, fsinfo(), that allows information about the
VFS, mount topology, superblock and files to be retrieved.

This based on top of the mount and superblock notifications patchset and
allows event counters implemented in the latter to be retrieved to allow
overruns to be efficiently managed.

I've excluded the Ext4 and NFS sample code from this pull request that I've
previously included in my posted patchsets.


=======
THE WHY
=======

Why do we want this?

Using /proc/mounts (or similar) has problems:

 (1) Reading from it holds a global lock (namespace_sem) that prevents
     mounting and unmounting.  Lots of data is encoded and mangled into
     text whilst the lock is held, including superblock option strings and
     mount point paths.  This causes performance problems when there are a
     lot of mount objects in a system.

 (2) Even though namespace_sem is held during a read, reading the whole
     file isn't necessarily atomic with respect to mount-type operations.
     If a read isn't satisfied in one go, then it may return to userspace
     briefly and then continue reading some way into the file.  But changes
     can occur in the interval that may then go unseen.

 (3) Determining what has changed means parsing and comparing consecutive
     outputs of /proc/mounts.

 (4) Querying a specific mount or superblock means searching through
     /proc/mounts and searching by path or mount ID - but we might have an
     fd we want to query.

 (5) Mount topology is not explicit.  One must derive it manually by
     comparing entries.

 (6) Whilst you can poll() it for events, it only tells you that something
     changed in the namespace, not what or whether you can even see the
     change.

To fix the notification issues, the preceding notifications patchset added
mount watch notifications whereby you can watch for notifications in a
specific mount subtree.  The notification messages include the ID(s) of the
affected mounts.

To support notifications, however, we need to be able to handle overruns in
the notification queue.  I added a number of event counters to struct
super_block and struct mount to allow you to pin down the changes, but
there needs to be a way to retrieve them.  Exposing them through /proc
would require adding yet another /proc/mounts-type file.  We could add
per-mount directories full of attributes in sysfs, but that has issues also
(see below).

Adding an extensible system call interface for retrieving filesystem
information also allows other things to be exposed:

 (1) Jeff Layton's error handling changes need a way to allow error event
     information to be retrieved.

 (2) Bits in masks returned by things like statx() and FS_IOC_GETFLAGS are
     actually 3-state { Set, Unset, Not supported }.  It could be useful to
     provide a way to expose information like this[*].

 (3) Limits of the numerical metadata values in a filesystem[*].

 (4) Filesystem capability information[*].  Filesystems don't all have the
     same capabilities, and even different instances may have different
     capabilities, particularly with network filesystems where the set of
     may be server-dependent.  Capabilities might even vary at file
     granularity - though possibly such information should be conveyed
     through statx() instead.

 (5) ID mapping/shifting tables in use for a superblock.

 (6) Filesystem-specific information.  I need something for AFS so that I
     can do pioctl()-emulation, thereby allowing me to implement certain of
     the AFS command line utilities that query state of a particular file.
     This could also have application for other filesystems, such as NFS,
     CIFS and ext4.

 [*] In a lot of cases these are probably fixed and can be memcpy'd from
     static data.

There's a further consideration: I want to make it possible to have
fsconfig(fd, FSCONFIG_CMD_CREATE) be intercepted by a container manager
such that the manager can supervise a mount attempted inside the container.
The manager would be given an fd pointing to the fs_context struct and
would then need some way to query it (fsinfo()) and modify it (fsconfig()).
This could also be used to arbitrate user-requested mounts when containers
are not in play.


================
DESIGN DECISIONS
================

 (1) Information is partitioned into sets of attributes.

 (2) Attribute IDs are integers as they're fast to compare.

 (3) Attribute values are typed (struct, list of structs, string, opaque
     blob).  They type is fixed for a particular attribute.

 (4) For structure types, the length is also a version.  New fields can be
     tacked onto the end.

 (5) When copying a versioned struct to userspace, the core handles a
     version mismatch by truncating or zero-padding the data as necessary.
     None of this is seen by the filesystem.

 (6) The core handles all the buffering and buffer resizing.

 (7) The filesystem never gets any access to the userspace parameter buffer
     or result buffer.

 (8) "Meta" attributes can describe other attributes.


========
OVERVIEW
========

fsinfo() is a system call that allows information about the filesystem at a
particular path point to be queried as a set of attributes.

Attribute values are of four basic types:

 (1) Structure with version-dependent length (the length is the version).

 (2) Variable-length string.

 (3) List of structures (all the same length).

 (4) Opaque blob.

Attributes can have multiple values either as a sequence of values or a
sequence-of-sequences of values and all the values of a particular
attribute must be of the same type.  Values can be up to INT_MAX size,
subject to memory availability.

Note that the values of an attribute *are* allowed to vary between dentries
within a single superblock, depending on the specific dentry that you're
looking at, but the values still have to be of the type for that attribute.

I've tried to make the interface as light as possible, so integer attribute
ID rather than string and the core does all the buffer allocation and
expansion and all the extensibility support work rather than leaving that
to the filesystems.  This means that userspace pointers are not exposed to
the filesystem.


fsinfo() allows a variety of information to be retrieved about a filesystem
and the mount topology:

 (1) General superblock attributes:

     - Filesystem identifiers (UUID, volume label, device numbers, ...)
     - The limits on a filesystem's capabilities
     - Information on supported statx fields and attributes and IOC flags.
     - A variety single-bit flags indicating supported capabilities.
     - Timestamp resolution and range.
     - The amount of space/free space in a filesystem (as statfs()).
     - Superblock notification counter.

 (2) Filesystem-specific superblock attributes:

     - Superblock-level timestamps.
     - Cell name, workgroup or other netfs grouping concept.
     - Server names and addresses.

 (3) VFS information:

     - Mount topology information.
     - Mount attributes.
     - Mount notification counter.
     - Mount point path.

 (4) Information about what the fsinfo() syscall itself supports, including
     the type and struct size of attributes.

The system is extensible:

 (1) New attributes can be added.  There is no requirement that a
     filesystem implement every attribute.  A helper function is provided
     to scan a list of attributes and a filesystem can have multiple such
     lists.

 (2) Version length-dependent structure attributes can be made larger and
     have additional information tacked on the end, provided it keeps the
     layout of the existing fields.  If an older process asks for a shorter
     structure, it will only be given the bits it asks for.  If a newer
     process asks for a longer structure on an older kernel, the extra
     space will be set to 0.  In all cases, the size of the data actually
     available is returned.

     In essence, the size of a structure is that structure's version: a
     smaller size is an earlier version and a later version includes
     everything that the earlier version did.

 (3) New single-bit capability flags can be added.  This is a structure-typed
     attribute and, as such, (2) applies.  Any bits you wanted but the kernel
     doesn't support are automatically set to 0.

fsinfo() may be called like the following, for example:

	struct fsinfo_params params = {
		.at_flags	= AT_SYMLINK_NOFOLLOW,
		.flags		= FSINFO_FLAGS_QUERY_PATH,
		.request	= FSINFO_ATTR_AFS_SERVER_ADDRESSES,
		.Nth		= 2,
	};
	struct fsinfo_server_address address;
	len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", &params,
		     &address, sizeof(address));

The above example would query an AFS filesystem to retrieve the address
list for the 3rd server, and:

	struct fsinfo_params params = {
		.at_flags	= AT_SYMLINK_NOFOLLOW,
		.flags		= FSINFO_FLAGS_QUERY_PATH,
		.request	= FSINFO_ATTR_NFS_SERVER_NAME;
	};
	char server_name[256];
	len = fsinfo(AT_FDCWD, "/home/dhowells/", &params,
		     &server_name, sizeof(server_name));

would retrieve the name of the NFS server as a string.

In future, I want to make fsinfo() capable of querying a context created by
fsopen() or fspick(), e.g.:

	fd = fsopen("ext4", 0);
	struct fsinfo_params params = {
		.flags		= FSINFO_FLAGS_QUERY_FSCONTEXT,
		.request	= FSINFO_ATTR_CONFIGURATION;
	};
	char buffer[65536];
	fsinfo(fd, NULL, &params, &buffer, sizeof(buffer));

even if that context doesn't currently have a superblock attached.

Thanks,
David
---
The following changes since commit 8dbf1aa122da5bbb4ede0f363a8a18dfc723be33:

  watch_queue: sample: Display superblock notifications (2020-03-19 17:31:09 +0000)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/fsinfo-20200330

for you to fetch changes up to 80f0ef902951fd3e7446767e7d37d543d936b72f:

  fsinfo: Add support for AFS (2020-03-19 17:31:34 +0000)

----------------------------------------------------------------
Filesystem information

----------------------------------------------------------------
David Howells (11):
      fsinfo: Add fsinfo() syscall to query filesystem information
      fsinfo: Provide a bitmap of supported features
      fsinfo: Allow retrieval of superblock devname, options and stats
      fsinfo: Allow fsinfo() to look up a mount object by ID
      fsinfo: Add a uniquifier ID to struct mount
      fsinfo: Allow mount information to be queried
      fsinfo: Allow mount topology and propagation info to be retrieved
      fsinfo: Provide notification overrun handling support
      fsinfo: sample: Mount listing program
      fsinfo: Add API documentation
      fsinfo: Add support for AFS

 Documentation/filesystems/fsinfo.rst        | 574 +++++++++++++++++++
 arch/alpha/kernel/syscalls/syscall.tbl      |   1 +
 arch/arm/tools/syscall.tbl                  |   1 +
 arch/arm64/include/asm/unistd.h             |   2 +-
 arch/arm64/include/asm/unistd32.h           |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |   1 +
 arch/s390/kernel/syscalls/syscall.tbl       |   1 +
 arch/sh/kernel/syscalls/syscall.tbl         |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |   1 +
 fs/Kconfig                                  |   7 +
 fs/Makefile                                 |   1 +
 fs/afs/internal.h                           |   1 +
 fs/afs/super.c                              | 218 +++++++-
 fs/d_path.c                                 |   2 +-
 fs/fsinfo.c                                 | 725 ++++++++++++++++++++++++
 fs/internal.h                               |  14 +
 fs/mount.h                                  |   3 +
 fs/mount_notify.c                           |   2 +
 fs/namespace.c                              | 389 ++++++++++++-
 include/linux/fs.h                          |   4 +
 include/linux/fsinfo.h                      | 111 ++++
 include/linux/syscalls.h                    |   4 +
 include/uapi/asm-generic/unistd.h           |   4 +-
 include/uapi/linux/fsinfo.h                 | 326 +++++++++++
 include/uapi/linux/mount.h                  |  10 +-
 kernel/sys_ni.c                             |   1 +
 samples/vfs/Makefile                        |   7 +
 samples/vfs/test-fsinfo.c                   | 818 ++++++++++++++++++++++++++++
 samples/vfs/test-mntinfo.c                  | 279 ++++++++++
 39 files changed, 3511 insertions(+), 9 deletions(-)
 create mode 100644 Documentation/filesystems/fsinfo.rst
 create mode 100644 fs/fsinfo.c
 create mode 100644 include/linux/fsinfo.h
 create mode 100644 include/uapi/linux/fsinfo.h
 create mode 100644 samples/vfs/test-fsinfo.c
 create mode 100644 samples/vfs/test-mntinfo.c

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-30 13:58 Upcoming: Notifications, FS notifications and fsinfo() David Howells
                   ` (2 preceding siblings ...)
  2020-03-30 14:43 ` [GIT PULL] fsinfo: Filesystem information query David Howells
@ 2020-03-30 20:28 ` Miklos Szeredi
  2020-03-31  9:21   ` Karel Zak
  2020-03-30 21:17 ` Christian Brauner
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 99+ messages in thread
From: Miklos Szeredi @ 2020-03-30 20:28 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi,
	Steven Whitehouse, Jeff Layton, Ian Kent, andres,
	Christian Brauner, keyrings, linux-fsdevel, linux-kernel

On Mon, Mar 30, 2020 at 3:58 PM David Howells <dhowells@redhat.com> wrote:
>
>
> Hi Linus,
>
> I have three sets of patches I'd like to push your way, if you (and Al) are
> willing to consider them.

The basic problem in my view, is that the performance requirement of a
"get filesystem information" type of API just does not warrant a
binary coded interface. I've said this a number of times, but it fell
on deaf ears.

Such binary ABIs (especially if not very carefully designed and
reviewed) usually go through several revisions as the structure fails
to account for future changes in the representation of those structure
fields.   There are too many examples of this to count.   Then there's
the problem of needing to update libc, utilities and language bindings
on each revision or extension of the interface.

All this could be solved with a string key/value representation of the
same data, with minimal performance loss on encoding/parsing.  The
proposed fs interface[1] is one example of that, but I could also
imagine a syscall based one too.

Thanks,
Miklos

[1] https://lore.kernel.org/linux-fsdevel/20200309200238.GB28467@miu.piliscsaba.redhat.com/

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-30 13:58 Upcoming: Notifications, FS notifications and fsinfo() David Howells
                   ` (3 preceding siblings ...)
  2020-03-30 20:28 ` Upcoming: Notifications, FS notifications and fsinfo() Miklos Szeredi
@ 2020-03-30 21:17 ` Christian Brauner
  2020-03-31  5:11   ` Miklos Szeredi
  2020-03-31  7:22   ` Lennart Poettering
  2020-03-31 17:31 ` David Howells
  2020-03-31 21:52 ` David Howells
  6 siblings, 2 replies; 99+ messages in thread
From: Christian Brauner @ 2020-03-30 21:17 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven,
	andres, keyrings, linux-fsdevel, linux-kernel, lennart, cyphar

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="maccentraleurope", Size: 4438 bytes --]

[Cc Lennart and Aleksa, both of which maintain projects too that would
make use of this]

On Mon, Mar 30, 2020 at 02:58:22PM +0100, David Howells wrote:
> 
> Hi Linus,
> 
> I have three sets of patches I'd like to push your way, if you (and Al) are
> willing to consider them.
> 
>  (1) General notification queue plus key/keyring notifications.
> 
>      This adds the core of the notification queue built on pipes, and adds
>      the ability to watch for changes to keys.
> 
>  (2) Mount and superblock notifications.
> 
>      This builds on (1) to provide notifications of mount topology changes
>      and implements a framework for superblock events (configuration
>      changes, I/O errors, quota/space overruns and network status changes).
> 
>  (3) Filesystem information retrieval.
> 
>      This provides an extensible way to retrieve informational attributes
>      about mount objects and filesystems.  This includes providing
>      information intended to make recovering from a notification queue
>      overrun much easier.
> 
> We need (1) for Gnome to efficiently watch for changes in kerberos
> keyrings.  Debarshi Ray has patches ready to go for gnome-online-accounts
> so that it can make use of the facility.
> 
> Sets (2) and (3) can make libmount more efficient.  Karel Zak is working on
> making use of this to avoid reading /proc/mountinfo.
> 
> We need something to make systemd's watching of the mount topology more
> efficient, and (2) and (3) can help with this by making it faster to narrow
> down what changed.  I think Karel has this in his sights, but hasn't yet
> managed to work on it.
> 
> Set (2) should be able to make it easier to watch for mount options inside
> a container, and set (3) should make it easier to examine the mounts inside
> another mount namespace inside a container in a way that can't be done with
> /proc/mounts.  This is requested by Christian Brauner.
> 
> Jeff Layton has a tentative addition to (3) to expose error state to
> userspace, and Andres Freund would like this for Postgres.
> 
> Set (3) further allows the information returned by such as statx() and
> ioctl(FS_IOC_GETFLAGS) to be qualified by indicating which bits are/aren't
> supported.
> 
> Further, for (3), I also allow filesystem-specific overrides/extensions to
> fsinfo() and have a use for it to AFS to expose information about server
> preference for a particular volume (something that is necessary for
> implementing the toolset).  I've provided example code that does similar
> for NFS and some that exposes superblock info from Ext4.  At Vault, Steve
> expressed an interest in this for CIFS and Ted Ts'o expressed a possible
> interest for Ext4.
> 
> Notes:
> 
>  (*) These patches will conflict with apparently upcoming refactoring of
>      the security core, but the fixup doesn't look too bad:
> 
> 	https://lore.kernel.org/linux-next/20200330130636.0846e394@canb.auug.org.au/T/#u
> 
>  (*) Miklós Szeredi would much prefer to implement fsinfo() as a magic
>      filesystem mounted on /proc/self/fsinfo/ whereby your open fds appear
>      as directories under there, each with a set of attribute files
>      corresponding to the attributes that fsinfo() would otherwise provide.
>      To examine something by filename, you'd have to open it O_PATH and
>      then read the individual attribute files in the corresponding per-fd
>      directory.  A readfile() system call has been mooted to elide the
>      {open,read,close} sequence to make it more efficient.

Fwiw, putting down my kernel hat and speaking as someone who maintains
two container runtimes and various other low-level bits and pieces in
userspace who'd make heavy use of this stuff I would prefer the fd-based
fsinfo() approach especially in the light of across namespace
operations, querying all properties of a mount atomically all-at-once,
and safe delegation through fds. Another heavy user of this would be
systemd (Cced Lennart who I've discussed this with) which would prefer
the fd-based approach as well. I think pulling this into a filesystem
and making userspace parse around in a filesystem tree to query mount
information is the wrong approach and will get messy pretty quickly
especially in the face of mount and user namespace interactions and
various other pitfalls. fsinfo() fits quite nicely with the all-fd-based
approach of the whole mount api. So yes, definitely preferred from my
end.

Christian

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-30 21:17 ` Christian Brauner
@ 2020-03-31  5:11   ` Miklos Szeredi
  2020-03-31  8:15     ` Christian Brauner
                       ` (2 more replies)
  2020-03-31  7:22   ` Lennart Poettering
  1 sibling, 3 replies; 99+ messages in thread
From: Miklos Szeredi @ 2020-03-31  5:11 UTC (permalink / raw)
  To: Christian Brauner
  Cc: David Howells, Linus Torvalds, Al Viro, dray, Karel Zak,
	Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres,
	keyrings, linux-fsdevel, linux-kernel, Lennart Poettering,
	Aleksa Sarai

On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:

> Fwiw, putting down my kernel hat and speaking as someone who maintains
> two container runtimes and various other low-level bits and pieces in
> userspace who'd make heavy use of this stuff I would prefer the fd-based
> fsinfo() approach especially in the light of across namespace
> operations, querying all properties of a mount atomically all-at-once,

fsinfo(2) doesn't meet the atomically all-at-once requirement.  Sure,
it's possible to check the various change counters before and after a
batch of calls to check that the result is consistent.  Still, that's
not an atomic all-at-once query, if you'd really require that, than
fsinfo(2) as it currently stands would be inadequate.

> and safe delegation through fds. Another heavy user of this would be
> systemd (Cced Lennart who I've discussed this with) which would prefer
> the fd-based approach as well. I think pulling this into a filesystem
> and making userspace parse around in a filesystem tree to query mount
> information is the wrong approach and will get messy pretty quickly
> especially in the face of mount and user namespace interactions and
> various other pitfalls.

Have you actually looked at my proposed patch?   Do you have concrete
issues or just vague bad feelings?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [GIT PULL] General notification queue and key notifications
  2020-03-30 14:31 ` [GIT PULL] General notification queue and key notifications David Howells
@ 2020-03-31  6:51   ` Stephen Rothwell
  2020-06-02 15:55   ` David Howells
  1 sibling, 0 replies; 99+ messages in thread
From: Stephen Rothwell @ 2020-03-31  6:51 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven,
	andres, christian.brauner, jarkko.sakkinen, keyrings,
	linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2288 bytes --]

Hi David,

On Mon, 30 Mar 2020 15:31:04 +0100 David Howells <dhowells@redhat.com> wrote:
>
>       pipe: Add general notification queue support

This commit has a (reasonably simple) conflict against commit

  6551d5c56eb0 ("pipe: make sure to wake up everybody when the last reader/writer closes")

from Linus' tree.

Also a semantic conflict against commit

  52b31bc9aabc ("io_uring: add splice(2) support")

from the block tree needing this fix up (white space damaged)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index fb8fe0bd5e18..8cdd3870cd4e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2470,7 +2470,7 @@ static int io_splice_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 
 static bool io_splice_punt(struct file *file)
 {
-	if (get_pipe_info(file))
+	if (get_pipe_info(file, true))
 		return false;
 	if (!io_file_supports_async(file))
 		return true;
>	security: Add hooks to rule on setting a watch
>	security: Add a hook for the point of notification insertion

And these have a conflict against commitinclude/linux/lsm_hooks.h

  98e828a0650f ("security: Refactor declaration of LSM hooks")

from the bpf-next tree (will be in the net-next tree pull).  That
requires taking the net-next version of include/linux/lsm_hooks.h and
then applying the following patch:

diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index 9cd4455528e5..4f8d63fd1327 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -252,6 +252,16 @@ LSM_HOOK(int, 0, inode_notifysecctx, struct inode *inode, void *ctx, u32 ctxlen)
 LSM_HOOK(int, 0, inode_setsecctx, struct dentry *dentry, void *ctx, u32 ctxlen)
 LSM_HOOK(int, 0, inode_getsecctx, struct inode *inode, void **ctx,
 	 u32 *ctxlen)
+#ifdef CONFIG_KEY_NOTIFICATIONS
+LSM_HOOK(int, 0, watch_key, struct key *key)
+#endif
+#ifdef CONFIG_DEVICE_NOTIFICATIONS
+LSM_HOOK(int, 0, watch_devices, void)
+#endif
+#ifdef CONFIG_WATCH_QUEUE
+LSM_HOOK(int, 0, post_notification, const struct cred *w_cred,
+	 const struct cred *cred, struct watch_notification *n)
+#endif
 
 #ifdef CONFIG_SECURITY_NETWORK
 LSM_HOOK(int, 0, unix_stream_connect, struct sock *sock, struct sock *other,

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-30 21:17 ` Christian Brauner
  2020-03-31  5:11   ` Miklos Szeredi
@ 2020-03-31  7:22   ` Lennart Poettering
  1 sibling, 0 replies; 99+ messages in thread
From: Lennart Poettering @ 2020-03-31  7:22 UTC (permalink / raw)
  To: Christian Brauner
  Cc: David Howells, torvalds, viro, dray, kzak, mszeredi, swhiteho,
	jlayton, raven, andres, keyrings, linux-fsdevel, linux-kernel,
	cyphar

On Mo, 30.03.20 23:17, Christian Brauner (christian.brauner@ubuntu.com) wrote:

> Fwiw, putting down my kernel hat and speaking as someone who maintains
> two container runtimes and various other low-level bits and pieces in
> userspace who'd make heavy use of this stuff I would prefer the fd-based
> fsinfo() approach especially in the light of across namespace
> operations, querying all properties of a mount atomically all-at-once,
> and safe delegation through fds. Another heavy user of this would be
> systemd (Cced Lennart who I've discussed this with) which would prefer
> the fd-based approach as well. I think pulling this into a filesystem
> and making userspace parse around in a filesystem tree to query mount
> information is the wrong approach and will get messy pretty quickly
> especially in the face of mount and user namespace interactions and
> various other pitfalls. fsinfo() fits quite nicely with the all-fd-based
> approach of the whole mount api. So yes, definitely preferred from my
> end.

Christian is right. I think it's very important to have an API that
allows to query the state of fs attributes in a consistent state,
i.e. so that the attributes userspace is interested in can be queried
in a single call, so they all describe the very same point in
time. Distributing attributes onto multiple individual files just
sucks, because it's then guaranteed that you never can read them in a
way they actually fit together, some attributes you read will be
older, others newer. It's a big design flaw of sysfs (which is
structured like this) if you ask me.

I don't really care if the kernel API for this is binary or
textual. Slight preference for binary, but I don't care too much.

I think it would be wise to bind such APIs to fds, simply because it
always works. Doing path based stuff sucks, because you always need to
mount stuff and have a path tree set up, which is less ideal in a
world where namespacing is common, and namespaces are a shared concept
(at least with your other threads, if not with other processes). As a
maintainer of an init system I really dislike APIs that I can only use
after a mount structure has been set up, too often we want to do stuff
before that. Moreover, philosophically I find it questionnable to use
path based APIs to interface with the path object hierarchy itself. it
feels "too recursive". Just keep this separate: build stuff on top of
the fs that fits on top of the fs, but don't build fs APIs on top of
fs APIs that stem from the same layer.

Summary: atomic APIs rock, fd-based APIs rock. APIs built on
individual files one can only read individually suck. APIs of the path
layer exposed in the path layer suck.

Hope this makes some sense?

Lennart

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-31  5:11   ` Miklos Szeredi
@ 2020-03-31  8:15     ` Christian Brauner
  2020-03-31  8:34       ` Miklos Szeredi
  2020-03-31  8:34     ` Karel Zak
  2020-03-31 21:54     ` David Howells
  2 siblings, 1 reply; 99+ messages in thread
From: Christian Brauner @ 2020-03-31  8:15 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: David Howells, Linus Torvalds, Al Viro, dray, Karel Zak,
	Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres,
	keyrings, linux-fsdevel, linux-kernel, Lennart Poettering,
	Aleksa Sarai

On Tue, Mar 31, 2020 at 07:11:11AM +0200, Miklos Szeredi wrote:
> On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> 
> > Fwiw, putting down my kernel hat and speaking as someone who maintains
> > two container runtimes and various other low-level bits and pieces in
> > userspace who'd make heavy use of this stuff I would prefer the fd-based
> > fsinfo() approach especially in the light of across namespace
> > operations, querying all properties of a mount atomically all-at-once,
> 
> fsinfo(2) doesn't meet the atomically all-at-once requirement.  Sure,
> it's possible to check the various change counters before and after a
> batch of calls to check that the result is consistent.  Still, that's
> not an atomic all-at-once query, if you'd really require that, than
> fsinfo(2) as it currently stands would be inadequate.

It at all that's only true for batch requests.

> 
> > and safe delegation through fds. Another heavy user of this would be
> > systemd (Cced Lennart who I've discussed this with) which would prefer
> > the fd-based approach as well. I think pulling this into a filesystem
> > and making userspace parse around in a filesystem tree to query mount
> > information is the wrong approach and will get messy pretty quickly
> > especially in the face of mount and user namespace interactions and
> > various other pitfalls.
> 
> Have you actually looked at my proposed patch?   Do you have concrete

Yes. So have others, Al actively disliked and nacked it and no-one got
excited about it.

> issues or just vague bad feelings?

We have had that discussion on-list where I made my "vague bad feelings"
clear where you responded with the same dismissive style so I don't see
the point in repeating this experience.

Again, I want to make it clear that here I'm stating my preference as a
user of this api and as such I don't want to have to parse through a
filesystem to get complex information about filesystems. We've had
fruitful discussions [1] around how fsinfo() ties in with supervised
mounts and the rest of the mount api and its clear and simple especially
in the face of namespaces and implements a nice delegation model. So +1
from me.

Christian

[1]: https://youtu.be/LN2CUgp8deo?th40

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-31  5:11   ` Miklos Szeredi
  2020-03-31  8:15     ` Christian Brauner
@ 2020-03-31  8:34     ` Karel Zak
  2020-03-31  8:56       ` Miklos Szeredi
  2020-03-31 21:54     ` David Howells
  2 siblings, 1 reply; 99+ messages in thread
From: Karel Zak @ 2020-03-31  8:34 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Christian Brauner, David Howells, Linus Torvalds, Al Viro, dray,
	Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres,
	keyrings, linux-fsdevel, linux-kernel, Lennart Poettering,
	Aleksa Sarai

On Tue, Mar 31, 2020 at 07:11:11AM +0200, Miklos Szeredi wrote:
> On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> 
> > Fwiw, putting down my kernel hat and speaking as someone who maintains
> > two container runtimes and various other low-level bits and pieces in
> > userspace who'd make heavy use of this stuff I would prefer the fd-based
> > fsinfo() approach especially in the light of across namespace
> > operations, querying all properties of a mount atomically all-at-once,
> 
> fsinfo(2) doesn't meet the atomically all-at-once requirement.

I guess your /proc based idea have exactly the same problem... 

I see two possible ways:

- after open("/mnt", O_PATH) create copy-on-write object in kernel to
  represent mount node -- kernel will able to modify it, but userspace
  will get unchanged data from the FD until to close()

- improve fsinfo() to provide set (list) of the attributes by one call

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-31  8:15     ` Christian Brauner
@ 2020-03-31  8:34       ` Miklos Szeredi
  0 siblings, 0 replies; 99+ messages in thread
From: Miklos Szeredi @ 2020-03-31  8:34 UTC (permalink / raw)
  To: Christian Brauner
  Cc: David Howells, Linus Torvalds, Al Viro, dray, Karel Zak,
	Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres,
	keyrings, linux-fsdevel, linux-kernel, Lennart Poettering,
	Aleksa Sarai

On Tue, Mar 31, 2020 at 10:15 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Tue, Mar 31, 2020 at 07:11:11AM +0200, Miklos Szeredi wrote:
> > On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:
> >
> > > Fwiw, putting down my kernel hat and speaking as someone who maintains
> > > two container runtimes and various other low-level bits and pieces in
> > > userspace who'd make heavy use of this stuff I would prefer the fd-based
> > > fsinfo() approach especially in the light of across namespace
> > > operations, querying all properties of a mount atomically all-at-once,
> >
> > fsinfo(2) doesn't meet the atomically all-at-once requirement.  Sure,
> > it's possible to check the various change counters before and after a
> > batch of calls to check that the result is consistent.  Still, that's
> > not an atomic all-at-once query, if you'd really require that, than
> > fsinfo(2) as it currently stands would be inadequate.
>
> It at all that's only true for batch requests.

For example, there's no way to atomically query mount flags, parent,
and list of children with a single fsinfo() call, you actually need
three calls and they can reflect different states of the same mount.
Not saying this is a problem, just that there's no list of
requirements on what is needed and why.

> > > and safe delegation through fds. Another heavy user of this would be
> > > systemd (Cced Lennart who I've discussed this with) which would prefer
> > > the fd-based approach as well. I think pulling this into a filesystem
> > > and making userspace parse around in a filesystem tree to query mount
> > > information is the wrong approach and will get messy pretty quickly
> > > especially in the face of mount and user namespace interactions and
> > > various other pitfalls.
> >
> > Have you actually looked at my proposed patch?   Do you have concrete
>
> Yes. So have others, Al actively disliked and nacked it and no-one got
> excited about it.

Al, as far as I remember, nacked several things the patch was doing.
I fixed those.

> > issues or just vague bad feelings?
>
> We have had that discussion on-list where I made my "vague bad feelings"
> clear where you responded with the same dismissive style so I don't see
> the point in repeating this experience.
>
> Again, I want to make it clear that here I'm stating my preference as a
> user of this api and as such I don't want to have to parse through a
> filesystem to get complex information about filesystems. We've had
> fruitful discussions [1] around how fsinfo() ties in with supervised
> mounts and the rest of the mount api and its clear and simple especially
> in the face of namespaces and implements a nice delegation model. So +1
> from me.

And you keep ignoring the fact that my patch implements that exact
same delegation model.  That's why I'm asking if you have looked at it
or not.

The "I don't want to have to parse through a filesystem to get complex
information about filesystems" is not a set of requirements that an
API can be designed from.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-31  8:34     ` Karel Zak
@ 2020-03-31  8:56       ` Miklos Szeredi
  2020-03-31  9:49         ` Karel Zak
                           ` (2 more replies)
  0 siblings, 3 replies; 99+ messages in thread
From: Miklos Szeredi @ 2020-03-31  8:56 UTC (permalink / raw)
  To: Karel Zak
  Cc: Christian Brauner, David Howells, Linus Torvalds, Al Viro, dray,
	Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres,
	keyrings, linux-fsdevel, linux-kernel, Lennart Poettering,
	Aleksa Sarai

On Tue, Mar 31, 2020 at 10:34 AM Karel Zak <kzak@redhat.com> wrote:
>
> On Tue, Mar 31, 2020 at 07:11:11AM +0200, Miklos Szeredi wrote:
> > On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:
> >
> > > Fwiw, putting down my kernel hat and speaking as someone who maintains
> > > two container runtimes and various other low-level bits and pieces in
> > > userspace who'd make heavy use of this stuff I would prefer the fd-based
> > > fsinfo() approach especially in the light of across namespace
> > > operations, querying all properties of a mount atomically all-at-once,
> >
> > fsinfo(2) doesn't meet the atomically all-at-once requirement.
>
> I guess your /proc based idea have exactly the same problem...

Yes, that's exactly what I wanted to demonstrate: there's no
fundamental difference between the two API's in this respect.

> I see two possible ways:
>
> - after open("/mnt", O_PATH) create copy-on-write object in kernel to
>   represent mount node -- kernel will able to modify it, but userspace
>   will get unchanged data from the FD until to close()
>
> - improve fsinfo() to provide set (list) of the attributes by one call

I think we are approaching this from the wrong end.   Let's just
ignore all of the proposed interfaces for now and only concentrate on
what this will be used for.

Start with a set of use cases by all interested parties.  E.g.

 - systemd wants to keep track attached mounts in a namespace, as well
as new detached mounts created by fsmount()

 - systemd need to keep information (such as parent, children, mount
flags, fs options, etc) up to date on any change of topology or
attributes.

 - util linux needs to display the topology and state of mounts in the
system that corresponds to a consistent state that set of mounts

 - etc...

From that we can derive a set of requirements for the API.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-30 20:28 ` Upcoming: Notifications, FS notifications and fsinfo() Miklos Szeredi
@ 2020-03-31  9:21   ` Karel Zak
  0 siblings, 0 replies; 99+ messages in thread
From: Karel Zak @ 2020-03-31  9:21 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: David Howells, Linus Torvalds, Al Viro, dray, Miklos Szeredi,
	Steven Whitehouse, Jeff Layton, Ian Kent, andres,
	Christian Brauner, keyrings, linux-fsdevel, linux-kernel

On Mon, Mar 30, 2020 at 10:28:56PM +0200, Miklos Szeredi wrote:
> All this could be solved with a string key/value representation of the
> same data, with minimal performance loss on encoding/parsing.  The
> proposed fs interface[1] is one example of that, but I could also
> imagine a syscall based one too.

Yes, key/value is possible solution. The question is if we really 
need to add extra /sys-like filesystem to get key/value ;-) I can 
imagine key/value from FD based interface without open/read/close for
each attribute,

    fd = open("/mnt", O_PATH);
    fsinfo(fd, "propagation", buf, sizeof(buf));
    fsinfo(fd, "fstype", buf, sizeof(buf));
    close(fd);

why I need /mountfs/<id>/propagation and /mountfs/<id>/fstype to get
the same? It sounds like over-engineering without any extra bonus.

Anyway, if we have FD based interfaces like fsopen(), fsmount(),
open_tree() and move_mount() then it sounds strange that you cannot
use the FD to ask kernel for the mount node attributes and you need 
to open and read another /sys-like files. 

IMHO it would be nice that after open(/mnt, O_PATH) I can do whatever
with the mount point (umount, move, reconfigure, query, etc.). Please,
try to keep it simple and consistent ;-)

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-31  8:56       ` Miklos Szeredi
@ 2020-03-31  9:49         ` Karel Zak
  2020-03-31 12:25         ` Lennart Poettering
  2020-03-31 21:56         ` David Howells
  2 siblings, 0 replies; 99+ messages in thread
From: Karel Zak @ 2020-03-31  9:49 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Christian Brauner, David Howells, Linus Torvalds, Al Viro, dray,
	Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres,
	keyrings, linux-fsdevel, linux-kernel, Lennart Poettering,
	Aleksa Sarai

On Tue, Mar 31, 2020 at 10:56:35AM +0200, Miklos Szeredi wrote:
> I think we are approaching this from the wrong end.   Let's just
> ignore all of the proposed interfaces for now and only concentrate on
> what this will be used for.
> 
> Start with a set of use cases by all interested parties.  E.g.
> 
>  - systemd wants to keep track attached mounts in a namespace, as well
> as new detached mounts created by fsmount()
> 
>  - systemd need to keep information (such as parent, children, mount
> flags, fs options, etc) up to date on any change of topology or
> attributes.
> 
>  - util linux needs to display the topology and state of mounts in the
> system that corresponds to a consistent state that set of mounts

  - like systemd we also need in mount/umount to query one mountpoint
  rather than parse all /proc/self/mountinfo

 Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-31  8:56       ` Miklos Szeredi
  2020-03-31  9:49         ` Karel Zak
@ 2020-03-31 12:25         ` Lennart Poettering
  2020-03-31 15:10           ` Miklos Szeredi
  2020-03-31 21:56         ` David Howells
  2 siblings, 1 reply; 99+ messages in thread
From: Lennart Poettering @ 2020-03-31 12:25 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Karel Zak, Christian Brauner, David Howells, Linus Torvalds,
	Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton,
	Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

On Di, 31.03.20 10:56, Miklos Szeredi (miklos@szeredi.hu) wrote:

> On Tue, Mar 31, 2020 at 10:34 AM Karel Zak <kzak@redhat.com> wrote:
> >
> > On Tue, Mar 31, 2020 at 07:11:11AM +0200, Miklos Szeredi wrote:
> > > On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner
> > > <christian.brauner@ubuntu.com> wrote:
> > >
> > > > Fwiw, putting down my kernel hat and speaking as someone who maintains
> > > > two container runtimes and various other low-level bits and pieces in
> > > > userspace who'd make heavy use of this stuff I would prefer the fd-based
> > > > fsinfo() approach especially in the light of across namespace
> > > > operations, querying all properties of a mount atomically all-at-once,
> > >
> > > fsinfo(2) doesn't meet the atomically all-at-once requirement.
> >
> > I guess your /proc based idea have exactly the same problem...
>
> Yes, that's exactly what I wanted to demonstrate: there's no
> fundamental difference between the two API's in this respect.
>
> > I see two possible ways:
> >
> > - after open("/mnt", O_PATH) create copy-on-write object in kernel to
> >   represent mount node -- kernel will able to modify it, but userspace
> >   will get unchanged data from the FD until to close()
> >
> > - improve fsinfo() to provide set (list) of the attributes by one call
>
> I think we are approaching this from the wrong end.   Let's just
> ignore all of the proposed interfaces for now and only concentrate on
> what this will be used for.
>
> Start with a set of use cases by all interested parties.  E.g.
>
>  - systemd wants to keep track attached mounts in a namespace, as well
> as new detached mounts created by fsmount()
>
>  - systemd need to keep information (such as parent, children, mount
> flags, fs options, etc) up to date on any change of topology or
> attributes.

- We also have code that recursively remounts r/o or unmounts some
  directory tree (with filters), which is currently nasty to do since
  the relationships between dirs are not always clear from
  /proc/self/mountinfo alone, in particular not in an even remotely
  atomic fashion, or when stuff is overmounted.

- We also have code that needs to check if /dev/ is plain tmpfs or
  devtmpfs. We cannot use statfs for that, since in both cases
  TMPFS_MAGIC is reported, hence we currently parse
  /proc/self/mountinfo for that to find the fstype string there, which
  is different for both cases.

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-31 12:25         ` Lennart Poettering
@ 2020-03-31 15:10           ` Miklos Szeredi
  2020-03-31 15:24             ` Lennart Poettering
  0 siblings, 1 reply; 99+ messages in thread
From: Miklos Szeredi @ 2020-03-31 15:10 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Karel Zak, Christian Brauner, David Howells, Linus Torvalds,
	Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton,
	Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

On Tue, Mar 31, 2020 at 2:25 PM Lennart Poettering <mzxreary@0pointer.de> wrote:
>
> On Di, 31.03.20 10:56, Miklos Szeredi (miklos@szeredi.hu) wrote:
>
> > On Tue, Mar 31, 2020 at 10:34 AM Karel Zak <kzak@redhat.com> wrote:
> > >
> > > On Tue, Mar 31, 2020 at 07:11:11AM +0200, Miklos Szeredi wrote:
> > > > On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner
> > > > <christian.brauner@ubuntu.com> wrote:
> > > >
> > > > > Fwiw, putting down my kernel hat and speaking as someone who maintains
> > > > > two container runtimes and various other low-level bits and pieces in
> > > > > userspace who'd make heavy use of this stuff I would prefer the fd-based
> > > > > fsinfo() approach especially in the light of across namespace
> > > > > operations, querying all properties of a mount atomically all-at-once,
> > > >
> > > > fsinfo(2) doesn't meet the atomically all-at-once requirement.
> > >
> > > I guess your /proc based idea have exactly the same problem...
> >
> > Yes, that's exactly what I wanted to demonstrate: there's no
> > fundamental difference between the two API's in this respect.
> >
> > > I see two possible ways:
> > >
> > > - after open("/mnt", O_PATH) create copy-on-write object in kernel to
> > >   represent mount node -- kernel will able to modify it, but userspace
> > >   will get unchanged data from the FD until to close()
> > >
> > > - improve fsinfo() to provide set (list) of the attributes by one call
> >
> > I think we are approaching this from the wrong end.   Let's just
> > ignore all of the proposed interfaces for now and only concentrate on
> > what this will be used for.
> >
> > Start with a set of use cases by all interested parties.  E.g.
> >
> >  - systemd wants to keep track attached mounts in a namespace, as well
> > as new detached mounts created by fsmount()
> >
> >  - systemd need to keep information (such as parent, children, mount
> > flags, fs options, etc) up to date on any change of topology or
> > attributes.
>
> - We also have code that recursively remounts r/o or unmounts some
>   directory tree (with filters),

Recursive remount-ro is clear.  What is not clear is whether you need
to do this for hidden mounts (not possible from userspace without a
way to disable mount following on path lookup).  Would it make sense
to add a kernel API for recursive setting of mount flags?

What exactly is this unmount with filters?  Can you give examples?

> - We also have code that needs to check if /dev/ is plain tmpfs or
>   devtmpfs. We cannot use statfs for that, since in both cases
>   TMPFS_MAGIC is reported, hence we currently parse
>   /proc/self/mountinfo for that to find the fstype string there, which
>   is different for both cases.

Okay.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-31 15:10           ` Miklos Szeredi
@ 2020-03-31 15:24             ` Lennart Poettering
  0 siblings, 0 replies; 99+ messages in thread
From: Lennart Poettering @ 2020-03-31 15:24 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Karel Zak, Christian Brauner, David Howells, Linus Torvalds,
	Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton,
	Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

On Di, 31.03.20 17:10, Miklos Szeredi (miklos@szeredi.hu) wrote:

> On Tue, Mar 31, 2020 at 2:25 PM Lennart Poettering <mzxreary@0pointer.de> wrote:
> >
> > On Di, 31.03.20 10:56, Miklos Szeredi (miklos@szeredi.hu) wrote:
> >
> > > On Tue, Mar 31, 2020 at 10:34 AM Karel Zak <kzak@redhat.com> wrote:
> > > >
> > > > On Tue, Mar 31, 2020 at 07:11:11AM +0200, Miklos Szeredi wrote:
> > > > > On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner
> > > > > <christian.brauner@ubuntu.com> wrote:
> > > > >
> > > > > > Fwiw, putting down my kernel hat and speaking as someone who maintains
> > > > > > two container runtimes and various other low-level bits and pieces in
> > > > > > userspace who'd make heavy use of this stuff I would prefer the fd-based
> > > > > > fsinfo() approach especially in the light of across namespace
> > > > > > operations, querying all properties of a mount atomically all-at-once,
> > > > >
> > > > > fsinfo(2) doesn't meet the atomically all-at-once requirement.
> > > >
> > > > I guess your /proc based idea have exactly the same problem...
> > >
> > > Yes, that's exactly what I wanted to demonstrate: there's no
> > > fundamental difference between the two API's in this respect.
> > >
> > > > I see two possible ways:
> > > >
> > > > - after open("/mnt", O_PATH) create copy-on-write object in kernel to
> > > >   represent mount node -- kernel will able to modify it, but userspace
> > > >   will get unchanged data from the FD until to close()
> > > >
> > > > - improve fsinfo() to provide set (list) of the attributes by one call
> > >
> > > I think we are approaching this from the wrong end.   Let's just
> > > ignore all of the proposed interfaces for now and only concentrate on
> > > what this will be used for.
> > >
> > > Start with a set of use cases by all interested parties.  E.g.
> > >
> > >  - systemd wants to keep track attached mounts in a namespace, as well
> > > as new detached mounts created by fsmount()
> > >
> > >  - systemd need to keep information (such as parent, children, mount
> > > flags, fs options, etc) up to date on any change of topology or
> > > attributes.
> >
> > - We also have code that recursively remounts r/o or unmounts some
> >   directory tree (with filters),
>
> Recursive remount-ro is clear.  What is not clear is whether you need
> to do this for hidden mounts (not possible from userspace without a
> way to disable mount following on path lookup).  Would it make sense
> to add a kernel API for recursive setting of mount flags?

I would be very happy about an explicit kernel API for recursively
toggling the MS_RDONLY. But for many usecases in systemd we need the
ability to filter some subdirs and leave them as is, so while helpful
we'd have to keep the userspace code we currently have anyway.

> What exactly is this unmount with filters?  Can you give examples?

Hmm, actually it's only the r/o remount that has filters, not the
unmount. Sorry for the confusion. And the r/o remount with filters
just means: "remount everything below X read-only except for X/Y and
X/Z/A"...

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-30 13:58 Upcoming: Notifications, FS notifications and fsinfo() David Howells
                   ` (4 preceding siblings ...)
  2020-03-30 21:17 ` Christian Brauner
@ 2020-03-31 17:31 ` David Howells
  2020-03-31 19:42   ` Miklos Szeredi
                     ` (3 more replies)
  2020-03-31 21:52 ` David Howells
  6 siblings, 4 replies; 99+ messages in thread
From: David Howells @ 2020-03-31 17:31 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, Linus Torvalds, Al Viro, dray, Karel Zak,
	Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres,
	Christian Brauner, Lennart Poettering, keyrings, linux-fsdevel,
	linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="maccentraleurope", Size: 28519 bytes --]

Miklos Szeredi <miklos@szeredi.hu> wrote:

> The basic problem in my view, is that the performance requirement of a
> "get filesystem information" type of API just does not warrant a
> binary coded interface. I've said this a number of times, but it fell
> on deaf ears.

It hasn't so fallen, but don't necessarily agree with you.  Let's pin some
numbers on this.

Using what I think is your latest patch where you look up
/proc/<pid>/fdinfo/<fd> to find a file that gives a summary of some
information in "key: val" format, including a mount ID.  You then have to look
in a mounted mountfs magic filesystem for a directory corresponding mount ID
that has a bunch of attribute files in it, most with a single attribute value.

What I can do with it is, say, look up the mount ID of the object attached to
a path - but that's about all because it doesn't implement anything like
look-up-by-mount-ID or list-children.

Attached is a kernel patch, supplementary to the fsinfo patchset, that adds
your implementation, fixed for coexistence with the mount notifications code,
plus a sample program that creates N mounts and then sees how long it takes to
query each of those mounts for its mnt_id by four different methods:

 (1) "f" - Use fsinfo, looking up each mount root directly by path.

 (2) "f2" - Use fsinfo, firstly using fsinfo() to look up the base mount by
     path, then use fsinfo() to get a list of all the children of that mount
     (which in fact gives me the mnt_id, but ignoring that), and then call
     fsinfo() by mount ID for each child to get its information, including its
     mnt_id.

 (3) "p" - Open the root of each mount with O_PATH and then open and read the
     procfile to retrieve information, then parse the received text to find
     the line with that key, then parse the line to get the number, allowing
     for the possibility that the line might acquire extra numbers.

 (4) "p2" - Open the root of the base mount with O_PATH, then read the
     appropriate file in /proc/fdinfo to find the base mount ID.  Open "/mnt"
     O_PATH to use as a base.  Then read <mntid>/children and parse the list
     to find each child.  Each child's <mntid>/id file is then read.

Run the program like:

	mount -t mountfs none /mnt
	mkdir /tmp/a
	./test-fsinfo-perf /tmp/a 20000

Note that it detaches its base mount afterwards and lets it get cleaned up and
systemd goes crazy for a bit.  Note also that it prints the sum of all the
mount IDs as a consistency check for each test.

Okay, the results:

  For  1000 mounts, f= 1514us f2= 1102us p=  6014us p2=  6935us; p=4.0*f p=5.5*f2 p=0.9*p2
  For  2000 mounts, f= 4712us f2= 3675us p= 20937us p2= 22878us; p=4.4*f p=5.7*f2 p=0.9*p2
  For  3000 mounts, f= 6795us f2= 5304us p= 31080us p2= 34056us; p=4.6*f p=5.9*f2 p=0.9*p2
  For  4000 mounts, f= 9291us f2= 7434us p= 40723us p2= 46479us; p=4.4*f p=5.5*f2 p=0.9*p2
  For  5000 mounts, f=11423us f2= 9219us p= 50878us p2= 58857us; p=4.5*f p=5.5*f2 p=0.9*p2
  For 10000 mounts, f=22899us f2=18240us p=101054us p2=117273us; p=4.4*f p=5.5*f2 p=0.9*p2
  For 20000 mounts, f=45811us f2=37211us p=203640us p2=237377us; p=4.4*f p=5.5*f2 p=0.9*p2
  For 30000 mounts, f=69703us f2=54800us p=306778us p2=357629us; p=4.4*f p=5.6*f2 p=0.9*p2
      
The number of mounts doesn't have an effect - not surprising with direct
pathwalk-based approaches ("f" and "p") since the pathwalk part is the same in
both cases, though in one fsinfo() does it and in the other, open(O_PATH).

As you can see, your procfs-based approach takes consistently about 4.4 times
as long as fsinfo(QUERY_PATH) and 5.5 times as long as fsinfo(QUERY_MOUNT).

Going through mountfs ("p2") is even slower than going through procfs, though
this really ought to be comparable to fsinfo-by-mount-ID ("f2"), but the
latter is something like 6.5x faster.

I suspect the procfs-based and mountfs-based approaches suffer from creating
lots of inodes, dentries and file structs as you access the files.  This also
means that they use more live state memory - and I think it lingers - if you
start using them, whereas fsinfo() uses none at all, beyond whatever is used
by the pathwalk to find the object to query (if you go that route).

mountfs is going to be worse also if you want more than one value if you
persist in putting one attribute in each file.

David
---
commit ed109ef4351d44a3e881e6518a207431113c17c0
Author: David Howells <dhowells@redhat.com>
Date:   Tue Mar 31 14:39:07 2020 +0100

    Performance test Miklós's patch vs fsinfo

diff --git a/fs/Makefile b/fs/Makefile
index b6bf2424c7f7..ac0627176db1 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -137,3 +137,4 @@ obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
 obj-$(CONFIG_EROFS_FS)		+= erofs/
 obj-$(CONFIG_VBOXSF_FS)		+= vboxsf/
 obj-$(CONFIG_ZONEFS_FS)		+= zonefs/
+obj-y				+= mountfs/
diff --git a/fs/mount.h b/fs/mount.h
index 063f41bc2e93..89b091fc482f 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -82,6 +82,7 @@ struct mount {
 	atomic_t mnt_subtree_notifications;	/* Number of notifications in subtree */
 	struct watch_list *mnt_watchers; /* Watches on dentries within this mount */
 #endif
+	struct mountfs_entry *mnt_mountfs_entry;
 } __randomize_layout;
 
 #define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */
@@ -177,3 +178,11 @@ static inline void notify_mount(struct mount *triggered,
 {
 }
 #endif
+
+void mnt_namespace_lock_read(void);
+void mnt_namespace_unlock_read(void);
+
+void mountfs_create(struct mount *mnt);
+extern void mountfs_remove(struct mount *mnt);
+int mountfs_lookup_internal(struct vfsmount *m, struct path *path);
+
diff --git a/fs/mountfs/Makefile b/fs/mountfs/Makefile
new file mode 100644
index 000000000000..35a65e9a966f
--- /dev/null
+++ b/fs/mountfs/Makefile
@@ -0,0 +1 @@
+obj-y				+= super.o
diff --git a/fs/mountfs/super.c b/fs/mountfs/super.c
new file mode 100644
index 000000000000..82c01eb6154d
--- /dev/null
+++ b/fs/mountfs/super.c
@@ -0,0 +1,502 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "../pnode.h"
+#include <linux/fs.h>
+#include <linux/kref.h>
+#include <linux/nsproxy.h>
+#include <linux/fs_struct.h>
+#include <linux/fs_context.h>
+
+#define MOUNTFS_SUPER_MAGIC 0x4e756f4d
+
+static DEFINE_SPINLOCK(mountfs_lock);
+static struct rb_root mountfs_entries = RB_ROOT;
+static struct vfsmount *mountfs_mnt __read_mostly;
+
+struct mountfs_entry {
+	struct kref kref;
+	struct mount *mnt;
+	struct rb_node node;
+	int id;
+};
+
+static const char *mountfs_attrs[] = {
+	"root", "mountpoint", "id", "parent", "options", "children",
+	"group", "master", "propagate_from"
+};
+
+#define MOUNTFS_INO(id) (((unsigned long) id + 1) * \
+			 (ARRAY_SIZE(mountfs_attrs) + 1))
+
+void mountfs_entry_release(struct kref *kref)
+{
+	kfree(container_of(kref, struct mountfs_entry, kref));
+}
+
+void mountfs_entry_put(struct mountfs_entry *entry)
+{
+	kref_put(&entry->kref, mountfs_entry_release);
+}
+
+static bool mountfs_entry_visible(struct mountfs_entry *entry)
+{
+	struct mount *mnt;
+	bool visible = false;
+
+	rcu_read_lock();
+	mnt = rcu_dereference(entry->mnt);
+	if (mnt && mnt->mnt_ns == current->nsproxy->mnt_ns)
+		visible = true;
+	rcu_read_unlock();
+
+	return visible;
+}
+static int mountfs_attr_show(struct seq_file *sf, void *v)
+{
+	const char *name = sf->file->f_path.dentry->d_name.name;
+	struct mountfs_entry *entry = sf->private;
+	struct mount *mnt;
+	struct vfsmount *m;
+	struct super_block *sb;
+	struct path root;
+	int tmp, err = -ENODEV;
+
+	mnt_namespace_lock_read();
+
+	mnt = entry->mnt;
+	if (!mnt || !mnt->mnt_ns)
+		goto out;
+
+	err = 0;
+	m = &mnt->mnt;
+	sb = m->mnt_sb;
+
+	if (strcmp(name, "root") == 0) {
+		if (sb->s_op->show_path) {
+			err = sb->s_op->show_path(sf, m->mnt_root);
+		} else {
+			seq_dentry(sf, m->mnt_root, " \t\n\\");
+		}
+		seq_putc(sf, '\n');
+	} else if (strcmp(name, "mountpoint") == 0) {
+		struct path mnt_path = { .dentry = m->mnt_root, .mnt = m };
+
+		get_fs_root(current->fs, &root);
+		err = seq_path_root(sf, &mnt_path, &root, " \t\n\\");
+		if (err == SEQ_SKIP) {
+			seq_puts(sf, "(unreachable)");
+			err = 0;
+		}
+		seq_putc(sf, '\n');
+		path_put(&root);
+	} else if (strcmp(name, "id") == 0) {
+		seq_printf(sf, "%i\n", mnt->mnt_id);
+	} else if (strcmp(name, "parent") == 0) {
+		tmp = rcu_dereference(mnt->mnt_parent)->mnt_id;
+		seq_printf(sf, "%i\n", tmp);
+	} else if (strcmp(name, "options") == 0) {
+		int mnt_flags = READ_ONCE(m->mnt_flags);
+
+		seq_puts(sf, mnt_flags & MNT_READONLY ? "ro" : "rw");
+		seq_mnt_opts(sf, mnt_flags);
+		seq_putc(sf, '\n');
+	} else if (strcmp(name, "children") == 0) {
+		struct mount *child;
+		bool first = true;
+
+		list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
+			if (!first)
+				seq_putc(sf, ',');
+			else
+				first = false;
+			seq_printf(sf, "%i", child->mnt_id);
+		}
+		if (!first)
+			seq_putc(sf, '\n');
+	} else if (strcmp(name, "group") == 0) {
+		if (IS_MNT_SHARED(mnt))
+			seq_printf(sf, "%i\n", mnt->mnt_group_id);
+	} else if (strcmp(name, "master") == 0) {
+		if (IS_MNT_SLAVE(mnt)) {
+			tmp = rcu_dereference(mnt->mnt_master)->mnt_group_id;
+			seq_printf(sf, "%i\n", tmp);
+		}
+	} else if (strcmp(name, "propagate_from") == 0) {
+		if (IS_MNT_SLAVE(mnt)) {
+			get_fs_root(current->fs, &root);
+			tmp = get_dominating_id(mnt, &root);
+			if (tmp)
+				seq_printf(sf, "%i\n", tmp);
+		}
+	} else {
+		WARN_ON(1);
+		err = -EIO;
+	}
+out:
+	mnt_namespace_unlock_read();
+
+	return err;
+}
+
+static int mountfs_attr_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, mountfs_attr_show, inode->i_private);
+}
+
+static const struct file_operations mountfs_attr_fops = {
+	.open		= mountfs_attr_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static struct mountfs_entry *mountfs_node_to_entry(struct rb_node *node)
+{
+	return rb_entry(node, struct mountfs_entry, node);
+}
+
+static struct rb_node **mountfs_find_node(int id, struct rb_node **parent)
+{
+	struct rb_node **link = &mountfs_entries.rb_node;
+
+	*parent = NULL;
+	while (*link) {
+		struct mountfs_entry *entry = mountfs_node_to_entry(*link);
+
+		*parent = *link;
+		if (id < entry->id)
+			link = &entry->node.rb_left;
+		else if (id > entry->id)
+			link = &entry->node.rb_right;
+		else
+			break;
+	}
+	return link;
+}
+
+void mountfs_create(struct mount *mnt)
+{
+	struct mountfs_entry *entry;
+	struct rb_node **link, *parent;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry) {
+		WARN(1, "failed to allocate mountfs entry");
+		return;
+	}
+	kref_init(&entry->kref);
+	entry->mnt = mnt;
+	entry->id = mnt->mnt_id;
+
+	spin_lock(&mountfs_lock);
+	link = mountfs_find_node(entry->id, &parent);
+	if (!WARN_ON(*link)) {
+		rb_link_node(&entry->node, parent, link);
+		rb_insert_color(&entry->node, &mountfs_entries);
+		mnt->mnt_mountfs_entry = entry;
+	} else {
+		kfree(entry);
+	}
+	spin_unlock(&mountfs_lock);
+}
+
+void mountfs_remove(struct mount *mnt)
+{
+	struct mountfs_entry *entry = mnt->mnt_mountfs_entry;
+
+	if (!entry)
+		return;
+	spin_lock(&mountfs_lock);
+	entry->mnt = NULL;
+	rb_erase(&entry->node, &mountfs_entries);
+	spin_unlock(&mountfs_lock);
+
+	mountfs_entry_put(entry);
+
+	mnt->mnt_mountfs_entry = NULL;
+}
+
+static struct mountfs_entry *mountfs_get_entry(const char *name)
+{
+	struct mountfs_entry *entry = NULL;
+	struct rb_node **link, *dummy;
+	unsigned long mnt_id;
+	char buf[32];
+	int ret;
+
+	ret = kstrtoul(name, 10, &mnt_id);
+	if (ret || mnt_id > INT_MAX)
+		return NULL;
+
+	snprintf(buf, sizeof(buf), "%lu", mnt_id);
+	if (strcmp(buf, name) != 0)
+		return NULL;
+
+	spin_lock(&mountfs_lock);
+	link = mountfs_find_node(mnt_id, &dummy);
+	if (*link) {
+		entry = mountfs_node_to_entry(*link);
+		if (!mountfs_entry_visible(entry))
+			entry = NULL;
+		else
+			kref_get(&entry->kref);
+	}
+	spin_unlock(&mountfs_lock);
+
+	return entry;
+}
+
+static void mountfs_init_inode(struct inode *inode, umode_t mode);
+
+static struct dentry *mountfs_lookup_entry(struct dentry *dentry,
+					   struct mountfs_entry *entry,
+					   int idx)
+{
+	struct inode *inode;
+
+	inode = new_inode(dentry->d_sb);
+	if (!inode) {
+		mountfs_entry_put(entry);
+		return ERR_PTR(-ENOMEM);
+	}
+	inode->i_private = entry;
+	inode->i_ino = MOUNTFS_INO(entry->id) + idx;
+	mountfs_init_inode(inode, idx ? S_IFREG | 0444 : S_IFDIR | 0555);
+	return d_splice_alias(inode, dentry);
+
+}
+
+static struct dentry *mountfs_lookup(struct inode *dir, struct dentry *dentry,
+				     unsigned int flags)
+{
+	struct mountfs_entry *entry = dir->i_private;
+	int i = 0;
+
+	if (entry) {
+		for (i = 0; i < ARRAY_SIZE(mountfs_attrs); i++)
+			if (strcmp(mountfs_attrs[i], dentry->d_name.name) == 0)
+				break;
+		if (i == ARRAY_SIZE(mountfs_attrs))
+			return ERR_PTR(-ENOMEM);
+		i++;
+		kref_get(&entry->kref);
+	} else {
+		entry = mountfs_get_entry(dentry->d_name.name);
+		if (!entry)
+			return ERR_PTR(-ENOENT);
+	}
+
+	return mountfs_lookup_entry(dentry, entry, i);
+}
+
+static int mountfs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct mountfs_entry *entry = dentry->d_inode->i_private;
+
+	/* root: valid */
+	if (!entry)
+		return 1;
+
+	/* removed: invalid */
+	if (!entry->mnt)
+		return 0;
+
+	/* attribute or visible in this namespace: valid */
+	if (!d_can_lookup(dentry) || mountfs_entry_visible(entry))
+		return 1;
+
+	/* invlisible in this namespace: valid but deny entry*/
+	return -ENOENT;
+}
+
+static int mountfs_readdir(struct file *file, struct dir_context *ctx)
+{
+	struct rb_node *node;
+	struct mountfs_entry *entry = file_inode(file)->i_private;
+	char name[32];
+	const char *s;
+	unsigned int len, pos, id;
+
+	if (ctx->pos - 2 > INT_MAX || !dir_emit_dots(file, ctx))
+		return 0;
+
+	if (entry) {
+		while (ctx->pos - 2 < ARRAY_SIZE(mountfs_attrs)) {
+			s = mountfs_attrs[ctx->pos - 2];
+			if (!dir_emit(ctx, s, strlen(s),
+				      MOUNTFS_INO(entry->id) + ctx->pos,
+				      DT_REG))
+				break;
+			ctx->pos++;
+		}
+		return 0;
+	}
+
+	pos = ctx->pos - 2;
+	do {
+		spin_lock(&mountfs_lock);
+		mountfs_find_node(pos, &node);
+		pos = 1U + INT_MAX;
+		do {
+			if (!node) {
+				spin_unlock(&mountfs_lock);
+				goto out;
+			}
+			entry = mountfs_node_to_entry(node);
+			node = rb_next(node);
+		} while (!mountfs_entry_visible(entry));
+		if (node)
+			pos = mountfs_node_to_entry(node)->id;
+		id = entry->id;
+		spin_unlock(&mountfs_lock);
+
+		len = snprintf(name, sizeof(name), "%i", id);
+		ctx->pos = id + 2;
+		if (!dir_emit(ctx, name, len, MOUNTFS_INO(id), DT_DIR))
+			return 0;
+	} while (pos <= INT_MAX);
+out:
+	ctx->pos =  pos + 2;
+	return 0;
+}
+
+int mountfs_lookup_internal(struct vfsmount *m, struct path *path)
+{
+	char name[32];
+	struct qstr this = { .name = name };
+	struct mount *mnt = real_mount(m);
+	struct mountfs_entry *entry = mnt->mnt_mountfs_entry;
+	struct dentry *dentry, *old, *root = mountfs_mnt->mnt_root;
+
+	this.len = snprintf(name, sizeof(name), "%i", mnt->mnt_id);
+	dentry = d_hash_and_lookup(root, &this);
+	if (dentry && dentry->d_inode->i_private != entry) {
+		d_invalidate(dentry);
+		dput(dentry);
+		dentry = NULL;
+	}
+	if (!dentry) {
+		dentry = d_alloc(root, &this);
+		if (!dentry)
+			return -ENOMEM;
+
+		kref_get(&entry->kref);
+		old = mountfs_lookup_entry(dentry, entry, 0);
+		if (old) {
+			dput(dentry);
+			if (IS_ERR(old))
+				return PTR_ERR(old);
+			dentry = old;
+		}
+	}
+
+	*path = (struct path) { .mnt = mountfs_mnt, .dentry = dentry };
+	return 0;
+}
+
+static const struct dentry_operations mountfs_dops = {
+	.d_revalidate = mountfs_d_revalidate,
+};
+
+static const struct inode_operations mountfs_iops = {
+	.lookup = mountfs_lookup,
+};
+
+static const struct file_operations mountfs_fops = {
+	.iterate_shared = mountfs_readdir,
+	.read = generic_read_dir,
+	.llseek = generic_file_llseek,
+};
+
+static void mountfs_init_inode(struct inode *inode, umode_t mode)
+{
+	inode->i_mode = mode;
+	inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
+	if (S_ISREG(mode)) {
+		inode->i_size = PAGE_SIZE;
+		inode->i_fop = &mountfs_attr_fops;
+	} else {
+		inode->i_op = &mountfs_iops;
+		inode->i_fop = &mountfs_fops;
+	}
+}
+
+static void mountfs_evict_inode(struct inode *inode)
+{
+	struct mountfs_entry *entry = inode->i_private;
+
+	clear_inode(inode);
+	if (entry)
+		mountfs_entry_put(entry);
+}
+
+static const struct super_operations mountfs_sops = {
+	.statfs		= simple_statfs,
+	.drop_inode	= generic_delete_inode,
+	.evict_inode	= mountfs_evict_inode,
+};
+
+static int mountfs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+	struct inode *root;
+
+	sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV;
+	sb->s_blocksize = PAGE_SIZE;
+	sb->s_blocksize_bits = PAGE_SHIFT;
+	sb->s_magic = MOUNTFS_SUPER_MAGIC;
+	sb->s_time_gran = 1;
+	sb->s_shrink.seeks = 0;
+	sb->s_op = &mountfs_sops;
+	sb->s_d_op = &mountfs_dops;
+
+	root = new_inode(sb);
+	if (!root)
+		return -ENOMEM;
+
+	root->i_ino = 1;
+	mountfs_init_inode(root, S_IFDIR | 0444);
+
+	sb->s_root = d_make_root(root);
+	if (!sb->s_root)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int mountfs_get_tree(struct fs_context *fc)
+{
+	return get_tree_single(fc, mountfs_fill_super);
+}
+
+static const struct fs_context_operations mountfs_context_ops = {
+	.get_tree = mountfs_get_tree,
+};
+
+static int mountfs_init_fs_context(struct fs_context *fc)
+{
+	fc->ops = &mountfs_context_ops;
+	fc->global = true;
+	return 0;
+}
+
+static struct file_system_type mountfs_fs_type = {
+	.name = "mountfs",
+	.init_fs_context = mountfs_init_fs_context,
+	.kill_sb = kill_anon_super,
+};
+
+static int __init mountfs_init(void)
+{
+	int err;
+
+	err = register_filesystem(&mountfs_fs_type);
+	if (!err) {
+		mountfs_mnt = kern_mount(&mountfs_fs_type);
+		if (IS_ERR(mountfs_mnt)) {
+			err = PTR_ERR(mountfs_mnt);
+			unregister_filesystem(&mountfs_fs_type);
+		}
+	}
+	return err;
+}
+fs_initcall(mountfs_init);
diff --git a/fs/namespace.c b/fs/namespace.c
index 5427e732c1bf..a05a2885a90e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -962,6 +962,8 @@ struct vfsmount *vfs_create_mount(struct fs_context *fc)
 
 	if (fc->sb_flags & SB_KERNMOUNT)
 		mnt->mnt.mnt_flags = MNT_INTERNAL;
+	else
+		mountfs_create(mnt);
 
 	atomic_inc(&fc->root->d_sb->s_active);
 	mnt->mnt.mnt_sb		= fc->root->d_sb;
@@ -1033,7 +1035,7 @@ vfs_submount(const struct dentry *mountpoint, struct file_system_type *type,
 }
 EXPORT_SYMBOL_GPL(vfs_submount);
 
-static struct mount *clone_mnt(struct mount *old, struct dentry *root,
+static struct mount *clone_mnt_common(struct mount *old, struct dentry *root,
 					int flag)
 {
 	struct super_block *sb = old->mnt.mnt_sb;
@@ -1100,6 +1102,17 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 	return ERR_PTR(err);
 }
 
+static struct mount *clone_mnt(struct mount *old, struct dentry *root,
+			       int flag)
+{
+	struct mount *mnt = clone_mnt_common(old, root, flag);
+
+	if (!IS_ERR(mnt))
+		mountfs_create(mnt);
+
+	return mnt;
+}
+
 static void cleanup_mnt(struct mount *mnt)
 {
 	struct hlist_node *p;
@@ -1112,6 +1125,7 @@ static void cleanup_mnt(struct mount *mnt)
 	 * so mnt_get_writers() below is safe.
 	 */
 	WARN_ON(mnt_get_writers(mnt));
+
 	if (unlikely(mnt->mnt_pins.first))
 		mnt_pin_kill(mnt);
 	hlist_for_each_entry_safe(m, p, &mnt->mnt_stuck_children, mnt_umount) {
@@ -1197,6 +1211,8 @@ static void mntput_no_expire(struct mount *mnt)
 	unlock_mount_hash();
 	shrink_dentry_list(&list);
 
+	mountfs_remove(mnt);
+
 	if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
 		struct task_struct *task = current;
 		if (likely(!(task->flags & PF_KTHREAD))) {
@@ -1263,13 +1279,14 @@ EXPORT_SYMBOL(path_is_mountpoint);
 struct vfsmount *mnt_clone_internal(const struct path *path)
 {
 	struct mount *p;
-	p = clone_mnt(real_mount(path->mnt), path->dentry, CL_PRIVATE);
+	p = clone_mnt_common(real_mount(path->mnt), path->dentry, CL_PRIVATE);
 	if (IS_ERR(p))
 		return ERR_CAST(p);
 	p->mnt.mnt_flags |= MNT_INTERNAL;
 	return &p->mnt;
 }
 
+
 #ifdef CONFIG_PROC_FS
 /* iterator; we want it to have access to namespace_sem, thus here... */
 static void *m_start(struct seq_file *m, loff_t *pos)
@@ -1411,6 +1428,16 @@ static inline void namespace_lock(void)
 	down_write(&namespace_sem);
 }
 
+void mnt_namespace_lock_read(void)
+{
+	down_read(&namespace_sem);
+}
+
+void mnt_namespace_unlock_read(void)
+{
+	up_read(&namespace_sem);
+}
+
 enum umount_tree_flags {
 	UMOUNT_SYNC = 1,
 	UMOUNT_PROPAGATE = 2,
diff --git a/fs/proc/base.c b/fs/proc/base.c
index c7c64272b0fa..0477f8b51182 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3092,6 +3092,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
 	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
+	DIR("fdmount",    S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations),
 	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
@@ -3497,6 +3498,7 @@ static const struct inode_operations proc_tid_comm_inode_operations = {
 static const struct pid_entry tid_base_stuff[] = {
 	DIR("fd",        S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
 	DIR("fdinfo",    S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
+	DIR("fdmount",   S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations),
 	DIR("ns",	 S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 81882a13212d..94a57e178801 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -361,3 +361,85 @@ const struct file_operations proc_fdinfo_operations = {
 	.iterate_shared	= proc_readfdinfo,
 	.llseek		= generic_file_llseek,
 };
+
+static int proc_fdmount_link(struct dentry *dentry, struct path *path)
+{
+	struct files_struct *files = NULL;
+	struct task_struct *task;
+	struct path fd_path;
+	int ret = -ENOENT;
+
+	task = get_proc_task(d_inode(dentry));
+	if (task) {
+		files = get_files_struct(task);
+		put_task_struct(task);
+	}
+
+	if (files) {
+		unsigned int fd = proc_fd(d_inode(dentry));
+		struct file *fd_file;
+
+		spin_lock(&files->file_lock);
+		fd_file = fcheck_files(files, fd);
+		if (fd_file) {
+			fd_path = fd_file->f_path;
+			path_get(&fd_path);
+			ret = 0;
+		}
+		spin_unlock(&files->file_lock);
+		put_files_struct(files);
+	}
+	if (!ret) {
+		ret = mountfs_lookup_internal(fd_path.mnt, path);
+		path_put(&fd_path);
+	}
+
+	return ret;
+}
+
+static struct dentry *proc_fdmount_instantiate(struct dentry *dentry,
+	struct task_struct *task, const void *ptr)
+{
+	const struct fd_data *data = ptr;
+	struct proc_inode *ei;
+	struct inode *inode;
+
+	inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | 0400);
+	if (!inode)
+		return ERR_PTR(-ENOENT);
+
+	ei = PROC_I(inode);
+	ei->fd = data->fd;
+
+	inode->i_op = &proc_pid_link_inode_operations;
+	inode->i_size = 64;
+
+	ei->op.proc_get_link = proc_fdmount_link;
+	tid_fd_update_inode(task, inode, 0);
+
+	d_set_d_op(dentry, &tid_fd_dentry_operations);
+	return d_splice_alias(inode, dentry);
+}
+
+static struct dentry *
+proc_lookupfdmount(struct inode *dir, struct dentry *dentry, unsigned int flags)
+{
+	return proc_lookupfd_common(dir, dentry, proc_fdmount_instantiate);
+}
+
+static int proc_readfdmount(struct file *file, struct dir_context *ctx)
+{
+	return proc_readfd_common(file, ctx,
+				  proc_fdmount_instantiate);
+}
+
+const struct inode_operations proc_fdmount_inode_operations = {
+	.lookup		= proc_lookupfdmount,
+	.setattr	= proc_setattr,
+};
+
+const struct file_operations proc_fdmount_operations = {
+	.read		= generic_read_dir,
+	.iterate_shared	= proc_readfdmount,
+	.llseek		= generic_file_llseek,
+};
diff --git a/fs/proc/fd.h b/fs/proc/fd.h
index f371a602bf58..9e087c833e65 100644
--- a/fs/proc/fd.h
+++ b/fs/proc/fd.h
@@ -10,6 +10,9 @@ extern const struct inode_operations proc_fd_inode_operations;
 extern const struct file_operations proc_fdinfo_operations;
 extern const struct inode_operations proc_fdinfo_inode_operations;
 
+extern const struct file_operations proc_fdmount_operations;
+extern const struct inode_operations proc_fdmount_inode_operations;
+
 extern int proc_fd_permission(struct inode *inode, int mask);
 
 static inline unsigned int proc_fd(struct inode *inode)
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 273ee82d8aa9..e634faa9160e 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -61,24 +61,6 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
 	return security_sb_show_options(m, sb);
 }
 
-static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
-{
-	static const struct proc_fs_info mnt_info[] = {
-		{ MNT_NOSUID, ",nosuid" },
-		{ MNT_NODEV, ",nodev" },
-		{ MNT_NOEXEC, ",noexec" },
-		{ MNT_NOATIME, ",noatime" },
-		{ MNT_NODIRATIME, ",nodiratime" },
-		{ MNT_RELATIME, ",relatime" },
-		{ 0, NULL }
-	};
-	const struct proc_fs_info *fs_infop;
-
-	for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) {
-		if (mnt->mnt_flags & fs_infop->flag)
-			seq_puts(m, fs_infop->str);
-	}
-}
 
 static inline void mangle(struct seq_file *m, const char *s)
 {
@@ -120,7 +102,7 @@ static int show_vfsmnt(struct seq_file *m, struct vfsmount *mnt)
 	err = show_sb_opts(m, sb);
 	if (err)
 		goto out;
-	show_mnt_opts(m, mnt);
+	seq_mnt_opts(m, mnt->mnt_flags);
 	if (sb->s_op->show_options)
 		err = sb->s_op->show_options(m, mnt_path.dentry);
 	seq_puts(m, " 0 0\n");
@@ -153,7 +135,7 @@ static int show_mountinfo(struct seq_file *m, struct vfsmount *mnt)
 		goto out;
 
 	seq_puts(m, mnt->mnt_flags & MNT_READONLY ? " ro" : " rw");
-	show_mnt_opts(m, mnt);
+	seq_mnt_opts(m, mnt->mnt_flags);
 
 	/* Tagged fields ("foo:X" or "bar") */
 	if (IS_MNT_SHARED(r))
diff --git a/fs/seq_file.c b/fs/seq_file.c
index 1600034a929b..9726baba1732 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -15,6 +15,7 @@
 #include <linux/cred.h>
 #include <linux/mm.h>
 #include <linux/printk.h>
+#include <linux/mount.h>
 #include <linux/string_helpers.h>
 
 #include <linux/uaccess.h>
@@ -548,6 +549,28 @@ int seq_dentry(struct seq_file *m, struct dentry *dentry, const char *esc)
 }
 EXPORT_SYMBOL(seq_dentry);
 
+void seq_mnt_opts(struct seq_file *m, int mnt_flags)
+{
+	unsigned int i;
+	static const struct {
+		int flag;
+		const char *str;
+	} mnt_info[] = {
+		{ MNT_NOSUID, ",nosuid" },
+		{ MNT_NODEV, ",nodev" },
+		{ MNT_NOEXEC, ",noexec" },
+		{ MNT_NOATIME, ",noatime" },
+		{ MNT_NODIRATIME, ",nodiratime" },
+		{ MNT_RELATIME, ",relatime" },
+		{ 0, NULL }
+	};
+
+	for (i = 0; mnt_info[i].flag; i++) {
+		if (mnt_flags & mnt_info[i].flag)
+			seq_puts(m, mnt_info[i].str);
+	}
+}
+
 static void *single_start(struct seq_file *p, loff_t *pos)
 {
 	return NULL + (*pos == 0);
diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h
index 770c2bf3aa43..9dd7812eb777 100644
--- a/include/linux/seq_file.h
+++ b/include/linux/seq_file.h
@@ -138,6 +138,7 @@ int seq_file_path(struct seq_file *, struct file *, const char *);
 int seq_dentry(struct seq_file *, struct dentry *, const char *);
 int seq_path_root(struct seq_file *m, const struct path *path,
 		  const struct path *root, const char *esc);
+void seq_mnt_opts(struct seq_file *m, int mnt_flags);
 
 int single_open(struct file *, int (*)(struct seq_file *, void *), void *);
 int single_open_size(struct file *, int (*)(struct seq_file *, void *), void *, size_t);
diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
index 19be60ab950e..78deb8483d27 100644
--- a/samples/vfs/Makefile
+++ b/samples/vfs/Makefile
@@ -4,6 +4,7 @@
 hostprogs := \
 	test-fsinfo \
 	test-fsmount \
+	test-fsinfo-perf \
 	test-mntinfo \
 	test-statx
 
@@ -12,6 +13,7 @@ always-y := $(hostprogs)
 HOSTCFLAGS_test-fsinfo.o += -I$(objtree)/usr/include
 HOSTLDLIBS_test-fsinfo += -static -lm
 HOSTCFLAGS_test-mntinfo.o += -I$(objtree)/usr/include
+HOSTCFLAGS_test-fsinfo-perf.o += -I$(objtree)/usr/include
 
 HOSTCFLAGS_test-fsmount.o += -I$(objtree)/usr/include
 HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-31 17:31 ` David Howells
@ 2020-03-31 19:42   ` Miklos Szeredi
  2020-03-31 19:47   ` David Howells
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 99+ messages in thread
From: Miklos Szeredi @ 2020-03-31 19:42 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi,
	Steven Whitehouse, Jeff Layton, Ian Kent, andres,
	Christian Brauner, Lennart Poettering, keyrings, linux-fsdevel,
	linux-kernel

On Tue, Mar 31, 2020 at 7:31 PM David Howells <dhowells@redhat.com> wrote:
>
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > The basic problem in my view, is that the performance requirement of a
> > "get filesystem information" type of API just does not warrant a
> > binary coded interface. I've said this a number of times, but it fell
> > on deaf ears.
>
> It hasn't so fallen, but don't necessarily agree with you.  Let's pin some
> numbers on this.

Cool, thanks for testing.  Unfortunately the test-fsinfo-perf.c file
didn't make it into the patch.   Can you please refresh and resend?

> Okay, the results:
>
>   For  1000 mounts, f= 1514us f2= 1102us p=  6014us p2=  6935us; p=4.0*f p=5.5*f2 p=0.9*p2
>   For  2000 mounts, f= 4712us f2= 3675us p= 20937us p2= 22878us; p=4.4*f p=5.7*f2 p=0.9*p2
>   For  3000 mounts, f= 6795us f2= 5304us p= 31080us p2= 34056us; p=4.6*f p=5.9*f2 p=0.9*p2
>   For  4000 mounts, f= 9291us f2= 7434us p= 40723us p2= 46479us; p=4.4*f p=5.5*f2 p=0.9*p2
>   For  5000 mounts, f\x11423us f2= 9219us p= 50878us p2= 58857us; p=4.5*f p=5.5*f2 p=0.9*p2
>   For 10000 mounts, f"899us f2\x18240us p\x101054us p2\x117273us; p=4.4*f p=5.5*f2 p=0.9*p2
>   For 20000 mounts, fE811us f27211us p 3640us p2#7377us; p=4.4*f p=5.5*f2 p=0.9*p2
>   For 30000 mounts, fi703us f2T800us p06778us p257629us; p=4.4*f p=5.6*f2 p=0.9*p2

So even the p2 method will give at least 80k queries/s, which is quite
good, considering that the need to rescan the complete mount tree
should be exceedingly rare (and in case it mattered, could be
optimized by priming from /proc/self/mountinfo).

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-31 17:31 ` David Howells
  2020-03-31 19:42   ` Miklos Szeredi
@ 2020-03-31 19:47   ` David Howells
  2020-03-31 21:14   ` David Howells
  2020-03-31 21:23   ` David Howells
  3 siblings, 0 replies; 99+ messages in thread
From: David Howells @ 2020-03-31 19:47 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, Linus Torvalds, Al Viro, dray, Karel Zak,
	Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres,
	Christian Brauner, Lennart Poettering, keyrings, linux-fsdevel,
	linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="maccentraleurope", Size: 33923 bytes --]

Miklos Szeredi <miklos@szeredi.hu> wrote:

> Cool, thanks for testing.  Unfortunately the test-fsinfo-perf.c file
> didn't make it into the patch.   Can you please refresh and resend?

Oops - I forgot to add it.  See attached.

David
---
commit b7239021cb7660bf328bb7fcce05e3a35ce5842b
Author: David Howells <dhowells@redhat.com>
Date:   Tue Mar 31 14:39:07 2020 +0100

    Performance test Miklós's patch vs fsinfo

diff --git a/fs/Makefile b/fs/Makefile
index b6bf2424c7f7..ac0627176db1 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -137,3 +137,4 @@ obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
 obj-$(CONFIG_EROFS_FS)		+= erofs/
 obj-$(CONFIG_VBOXSF_FS)		+= vboxsf/
 obj-$(CONFIG_ZONEFS_FS)		+= zonefs/
+obj-y				+= mountfs/
diff --git a/fs/mount.h b/fs/mount.h
index 063f41bc2e93..89b091fc482f 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -82,6 +82,7 @@ struct mount {
 	atomic_t mnt_subtree_notifications;	/* Number of notifications in subtree */
 	struct watch_list *mnt_watchers; /* Watches on dentries within this mount */
 #endif
+	struct mountfs_entry *mnt_mountfs_entry;
 } __randomize_layout;
 
 #define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */
@@ -177,3 +178,11 @@ static inline void notify_mount(struct mount *triggered,
 {
 }
 #endif
+
+void mnt_namespace_lock_read(void);
+void mnt_namespace_unlock_read(void);
+
+void mountfs_create(struct mount *mnt);
+extern void mountfs_remove(struct mount *mnt);
+int mountfs_lookup_internal(struct vfsmount *m, struct path *path);
+
diff --git a/fs/mountfs/Makefile b/fs/mountfs/Makefile
new file mode 100644
index 000000000000..35a65e9a966f
--- /dev/null
+++ b/fs/mountfs/Makefile
@@ -0,0 +1 @@
+obj-y				+= super.o
diff --git a/fs/mountfs/super.c b/fs/mountfs/super.c
new file mode 100644
index 000000000000..82c01eb6154d
--- /dev/null
+++ b/fs/mountfs/super.c
@@ -0,0 +1,502 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "../pnode.h"
+#include <linux/fs.h>
+#include <linux/kref.h>
+#include <linux/nsproxy.h>
+#include <linux/fs_struct.h>
+#include <linux/fs_context.h>
+
+#define MOUNTFS_SUPER_MAGIC 0x4e756f4d
+
+static DEFINE_SPINLOCK(mountfs_lock);
+static struct rb_root mountfs_entries = RB_ROOT;
+static struct vfsmount *mountfs_mnt __read_mostly;
+
+struct mountfs_entry {
+	struct kref kref;
+	struct mount *mnt;
+	struct rb_node node;
+	int id;
+};
+
+static const char *mountfs_attrs[] = {
+	"root", "mountpoint", "id", "parent", "options", "children",
+	"group", "master", "propagate_from"
+};
+
+#define MOUNTFS_INO(id) (((unsigned long) id + 1) * \
+			 (ARRAY_SIZE(mountfs_attrs) + 1))
+
+void mountfs_entry_release(struct kref *kref)
+{
+	kfree(container_of(kref, struct mountfs_entry, kref));
+}
+
+void mountfs_entry_put(struct mountfs_entry *entry)
+{
+	kref_put(&entry->kref, mountfs_entry_release);
+}
+
+static bool mountfs_entry_visible(struct mountfs_entry *entry)
+{
+	struct mount *mnt;
+	bool visible = false;
+
+	rcu_read_lock();
+	mnt = rcu_dereference(entry->mnt);
+	if (mnt && mnt->mnt_ns == current->nsproxy->mnt_ns)
+		visible = true;
+	rcu_read_unlock();
+
+	return visible;
+}
+static int mountfs_attr_show(struct seq_file *sf, void *v)
+{
+	const char *name = sf->file->f_path.dentry->d_name.name;
+	struct mountfs_entry *entry = sf->private;
+	struct mount *mnt;
+	struct vfsmount *m;
+	struct super_block *sb;
+	struct path root;
+	int tmp, err = -ENODEV;
+
+	mnt_namespace_lock_read();
+
+	mnt = entry->mnt;
+	if (!mnt || !mnt->mnt_ns)
+		goto out;
+
+	err = 0;
+	m = &mnt->mnt;
+	sb = m->mnt_sb;
+
+	if (strcmp(name, "root") == 0) {
+		if (sb->s_op->show_path) {
+			err = sb->s_op->show_path(sf, m->mnt_root);
+		} else {
+			seq_dentry(sf, m->mnt_root, " \t\n\\");
+		}
+		seq_putc(sf, '\n');
+	} else if (strcmp(name, "mountpoint") == 0) {
+		struct path mnt_path = { .dentry = m->mnt_root, .mnt = m };
+
+		get_fs_root(current->fs, &root);
+		err = seq_path_root(sf, &mnt_path, &root, " \t\n\\");
+		if (err == SEQ_SKIP) {
+			seq_puts(sf, "(unreachable)");
+			err = 0;
+		}
+		seq_putc(sf, '\n');
+		path_put(&root);
+	} else if (strcmp(name, "id") == 0) {
+		seq_printf(sf, "%i\n", mnt->mnt_id);
+	} else if (strcmp(name, "parent") == 0) {
+		tmp = rcu_dereference(mnt->mnt_parent)->mnt_id;
+		seq_printf(sf, "%i\n", tmp);
+	} else if (strcmp(name, "options") == 0) {
+		int mnt_flags = READ_ONCE(m->mnt_flags);
+
+		seq_puts(sf, mnt_flags & MNT_READONLY ? "ro" : "rw");
+		seq_mnt_opts(sf, mnt_flags);
+		seq_putc(sf, '\n');
+	} else if (strcmp(name, "children") == 0) {
+		struct mount *child;
+		bool first = true;
+
+		list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
+			if (!first)
+				seq_putc(sf, ',');
+			else
+				first = false;
+			seq_printf(sf, "%i", child->mnt_id);
+		}
+		if (!first)
+			seq_putc(sf, '\n');
+	} else if (strcmp(name, "group") == 0) {
+		if (IS_MNT_SHARED(mnt))
+			seq_printf(sf, "%i\n", mnt->mnt_group_id);
+	} else if (strcmp(name, "master") == 0) {
+		if (IS_MNT_SLAVE(mnt)) {
+			tmp = rcu_dereference(mnt->mnt_master)->mnt_group_id;
+			seq_printf(sf, "%i\n", tmp);
+		}
+	} else if (strcmp(name, "propagate_from") == 0) {
+		if (IS_MNT_SLAVE(mnt)) {
+			get_fs_root(current->fs, &root);
+			tmp = get_dominating_id(mnt, &root);
+			if (tmp)
+				seq_printf(sf, "%i\n", tmp);
+		}
+	} else {
+		WARN_ON(1);
+		err = -EIO;
+	}
+out:
+	mnt_namespace_unlock_read();
+
+	return err;
+}
+
+static int mountfs_attr_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, mountfs_attr_show, inode->i_private);
+}
+
+static const struct file_operations mountfs_attr_fops = {
+	.open		= mountfs_attr_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static struct mountfs_entry *mountfs_node_to_entry(struct rb_node *node)
+{
+	return rb_entry(node, struct mountfs_entry, node);
+}
+
+static struct rb_node **mountfs_find_node(int id, struct rb_node **parent)
+{
+	struct rb_node **link = &mountfs_entries.rb_node;
+
+	*parent = NULL;
+	while (*link) {
+		struct mountfs_entry *entry = mountfs_node_to_entry(*link);
+
+		*parent = *link;
+		if (id < entry->id)
+			link = &entry->node.rb_left;
+		else if (id > entry->id)
+			link = &entry->node.rb_right;
+		else
+			break;
+	}
+	return link;
+}
+
+void mountfs_create(struct mount *mnt)
+{
+	struct mountfs_entry *entry;
+	struct rb_node **link, *parent;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry) {
+		WARN(1, "failed to allocate mountfs entry");
+		return;
+	}
+	kref_init(&entry->kref);
+	entry->mnt = mnt;
+	entry->id = mnt->mnt_id;
+
+	spin_lock(&mountfs_lock);
+	link = mountfs_find_node(entry->id, &parent);
+	if (!WARN_ON(*link)) {
+		rb_link_node(&entry->node, parent, link);
+		rb_insert_color(&entry->node, &mountfs_entries);
+		mnt->mnt_mountfs_entry = entry;
+	} else {
+		kfree(entry);
+	}
+	spin_unlock(&mountfs_lock);
+}
+
+void mountfs_remove(struct mount *mnt)
+{
+	struct mountfs_entry *entry = mnt->mnt_mountfs_entry;
+
+	if (!entry)
+		return;
+	spin_lock(&mountfs_lock);
+	entry->mnt = NULL;
+	rb_erase(&entry->node, &mountfs_entries);
+	spin_unlock(&mountfs_lock);
+
+	mountfs_entry_put(entry);
+
+	mnt->mnt_mountfs_entry = NULL;
+}
+
+static struct mountfs_entry *mountfs_get_entry(const char *name)
+{
+	struct mountfs_entry *entry = NULL;
+	struct rb_node **link, *dummy;
+	unsigned long mnt_id;
+	char buf[32];
+	int ret;
+
+	ret = kstrtoul(name, 10, &mnt_id);
+	if (ret || mnt_id > INT_MAX)
+		return NULL;
+
+	snprintf(buf, sizeof(buf), "%lu", mnt_id);
+	if (strcmp(buf, name) != 0)
+		return NULL;
+
+	spin_lock(&mountfs_lock);
+	link = mountfs_find_node(mnt_id, &dummy);
+	if (*link) {
+		entry = mountfs_node_to_entry(*link);
+		if (!mountfs_entry_visible(entry))
+			entry = NULL;
+		else
+			kref_get(&entry->kref);
+	}
+	spin_unlock(&mountfs_lock);
+
+	return entry;
+}
+
+static void mountfs_init_inode(struct inode *inode, umode_t mode);
+
+static struct dentry *mountfs_lookup_entry(struct dentry *dentry,
+					   struct mountfs_entry *entry,
+					   int idx)
+{
+	struct inode *inode;
+
+	inode = new_inode(dentry->d_sb);
+	if (!inode) {
+		mountfs_entry_put(entry);
+		return ERR_PTR(-ENOMEM);
+	}
+	inode->i_private = entry;
+	inode->i_ino = MOUNTFS_INO(entry->id) + idx;
+	mountfs_init_inode(inode, idx ? S_IFREG | 0444 : S_IFDIR | 0555);
+	return d_splice_alias(inode, dentry);
+
+}
+
+static struct dentry *mountfs_lookup(struct inode *dir, struct dentry *dentry,
+				     unsigned int flags)
+{
+	struct mountfs_entry *entry = dir->i_private;
+	int i = 0;
+
+	if (entry) {
+		for (i = 0; i < ARRAY_SIZE(mountfs_attrs); i++)
+			if (strcmp(mountfs_attrs[i], dentry->d_name.name) == 0)
+				break;
+		if (i == ARRAY_SIZE(mountfs_attrs))
+			return ERR_PTR(-ENOMEM);
+		i++;
+		kref_get(&entry->kref);
+	} else {
+		entry = mountfs_get_entry(dentry->d_name.name);
+		if (!entry)
+			return ERR_PTR(-ENOENT);
+	}
+
+	return mountfs_lookup_entry(dentry, entry, i);
+}
+
+static int mountfs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct mountfs_entry *entry = dentry->d_inode->i_private;
+
+	/* root: valid */
+	if (!entry)
+		return 1;
+
+	/* removed: invalid */
+	if (!entry->mnt)
+		return 0;
+
+	/* attribute or visible in this namespace: valid */
+	if (!d_can_lookup(dentry) || mountfs_entry_visible(entry))
+		return 1;
+
+	/* invlisible in this namespace: valid but deny entry*/
+	return -ENOENT;
+}
+
+static int mountfs_readdir(struct file *file, struct dir_context *ctx)
+{
+	struct rb_node *node;
+	struct mountfs_entry *entry = file_inode(file)->i_private;
+	char name[32];
+	const char *s;
+	unsigned int len, pos, id;
+
+	if (ctx->pos - 2 > INT_MAX || !dir_emit_dots(file, ctx))
+		return 0;
+
+	if (entry) {
+		while (ctx->pos - 2 < ARRAY_SIZE(mountfs_attrs)) {
+			s = mountfs_attrs[ctx->pos - 2];
+			if (!dir_emit(ctx, s, strlen(s),
+				      MOUNTFS_INO(entry->id) + ctx->pos,
+				      DT_REG))
+				break;
+			ctx->pos++;
+		}
+		return 0;
+	}
+
+	pos = ctx->pos - 2;
+	do {
+		spin_lock(&mountfs_lock);
+		mountfs_find_node(pos, &node);
+		pos = 1U + INT_MAX;
+		do {
+			if (!node) {
+				spin_unlock(&mountfs_lock);
+				goto out;
+			}
+			entry = mountfs_node_to_entry(node);
+			node = rb_next(node);
+		} while (!mountfs_entry_visible(entry));
+		if (node)
+			pos = mountfs_node_to_entry(node)->id;
+		id = entry->id;
+		spin_unlock(&mountfs_lock);
+
+		len = snprintf(name, sizeof(name), "%i", id);
+		ctx->pos = id + 2;
+		if (!dir_emit(ctx, name, len, MOUNTFS_INO(id), DT_DIR))
+			return 0;
+	} while (pos <= INT_MAX);
+out:
+	ctx->pos =  pos + 2;
+	return 0;
+}
+
+int mountfs_lookup_internal(struct vfsmount *m, struct path *path)
+{
+	char name[32];
+	struct qstr this = { .name = name };
+	struct mount *mnt = real_mount(m);
+	struct mountfs_entry *entry = mnt->mnt_mountfs_entry;
+	struct dentry *dentry, *old, *root = mountfs_mnt->mnt_root;
+
+	this.len = snprintf(name, sizeof(name), "%i", mnt->mnt_id);
+	dentry = d_hash_and_lookup(root, &this);
+	if (dentry && dentry->d_inode->i_private != entry) {
+		d_invalidate(dentry);
+		dput(dentry);
+		dentry = NULL;
+	}
+	if (!dentry) {
+		dentry = d_alloc(root, &this);
+		if (!dentry)
+			return -ENOMEM;
+
+		kref_get(&entry->kref);
+		old = mountfs_lookup_entry(dentry, entry, 0);
+		if (old) {
+			dput(dentry);
+			if (IS_ERR(old))
+				return PTR_ERR(old);
+			dentry = old;
+		}
+	}
+
+	*path = (struct path) { .mnt = mountfs_mnt, .dentry = dentry };
+	return 0;
+}
+
+static const struct dentry_operations mountfs_dops = {
+	.d_revalidate = mountfs_d_revalidate,
+};
+
+static const struct inode_operations mountfs_iops = {
+	.lookup = mountfs_lookup,
+};
+
+static const struct file_operations mountfs_fops = {
+	.iterate_shared = mountfs_readdir,
+	.read = generic_read_dir,
+	.llseek = generic_file_llseek,
+};
+
+static void mountfs_init_inode(struct inode *inode, umode_t mode)
+{
+	inode->i_mode = mode;
+	inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
+	if (S_ISREG(mode)) {
+		inode->i_size = PAGE_SIZE;
+		inode->i_fop = &mountfs_attr_fops;
+	} else {
+		inode->i_op = &mountfs_iops;
+		inode->i_fop = &mountfs_fops;
+	}
+}
+
+static void mountfs_evict_inode(struct inode *inode)
+{
+	struct mountfs_entry *entry = inode->i_private;
+
+	clear_inode(inode);
+	if (entry)
+		mountfs_entry_put(entry);
+}
+
+static const struct super_operations mountfs_sops = {
+	.statfs		= simple_statfs,
+	.drop_inode	= generic_delete_inode,
+	.evict_inode	= mountfs_evict_inode,
+};
+
+static int mountfs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+	struct inode *root;
+
+	sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV;
+	sb->s_blocksize = PAGE_SIZE;
+	sb->s_blocksize_bits = PAGE_SHIFT;
+	sb->s_magic = MOUNTFS_SUPER_MAGIC;
+	sb->s_time_gran = 1;
+	sb->s_shrink.seeks = 0;
+	sb->s_op = &mountfs_sops;
+	sb->s_d_op = &mountfs_dops;
+
+	root = new_inode(sb);
+	if (!root)
+		return -ENOMEM;
+
+	root->i_ino = 1;
+	mountfs_init_inode(root, S_IFDIR | 0444);
+
+	sb->s_root = d_make_root(root);
+	if (!sb->s_root)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int mountfs_get_tree(struct fs_context *fc)
+{
+	return get_tree_single(fc, mountfs_fill_super);
+}
+
+static const struct fs_context_operations mountfs_context_ops = {
+	.get_tree = mountfs_get_tree,
+};
+
+static int mountfs_init_fs_context(struct fs_context *fc)
+{
+	fc->ops = &mountfs_context_ops;
+	fc->global = true;
+	return 0;
+}
+
+static struct file_system_type mountfs_fs_type = {
+	.name = "mountfs",
+	.init_fs_context = mountfs_init_fs_context,
+	.kill_sb = kill_anon_super,
+};
+
+static int __init mountfs_init(void)
+{
+	int err;
+
+	err = register_filesystem(&mountfs_fs_type);
+	if (!err) {
+		mountfs_mnt = kern_mount(&mountfs_fs_type);
+		if (IS_ERR(mountfs_mnt)) {
+			err = PTR_ERR(mountfs_mnt);
+			unregister_filesystem(&mountfs_fs_type);
+		}
+	}
+	return err;
+}
+fs_initcall(mountfs_init);
diff --git a/fs/namespace.c b/fs/namespace.c
index 5427e732c1bf..a05a2885a90e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -962,6 +962,8 @@ struct vfsmount *vfs_create_mount(struct fs_context *fc)
 
 	if (fc->sb_flags & SB_KERNMOUNT)
 		mnt->mnt.mnt_flags = MNT_INTERNAL;
+	else
+		mountfs_create(mnt);
 
 	atomic_inc(&fc->root->d_sb->s_active);
 	mnt->mnt.mnt_sb		= fc->root->d_sb;
@@ -1033,7 +1035,7 @@ vfs_submount(const struct dentry *mountpoint, struct file_system_type *type,
 }
 EXPORT_SYMBOL_GPL(vfs_submount);
 
-static struct mount *clone_mnt(struct mount *old, struct dentry *root,
+static struct mount *clone_mnt_common(struct mount *old, struct dentry *root,
 					int flag)
 {
 	struct super_block *sb = old->mnt.mnt_sb;
@@ -1100,6 +1102,17 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 	return ERR_PTR(err);
 }
 
+static struct mount *clone_mnt(struct mount *old, struct dentry *root,
+			       int flag)
+{
+	struct mount *mnt = clone_mnt_common(old, root, flag);
+
+	if (!IS_ERR(mnt))
+		mountfs_create(mnt);
+
+	return mnt;
+}
+
 static void cleanup_mnt(struct mount *mnt)
 {
 	struct hlist_node *p;
@@ -1112,6 +1125,7 @@ static void cleanup_mnt(struct mount *mnt)
 	 * so mnt_get_writers() below is safe.
 	 */
 	WARN_ON(mnt_get_writers(mnt));
+
 	if (unlikely(mnt->mnt_pins.first))
 		mnt_pin_kill(mnt);
 	hlist_for_each_entry_safe(m, p, &mnt->mnt_stuck_children, mnt_umount) {
@@ -1197,6 +1211,8 @@ static void mntput_no_expire(struct mount *mnt)
 	unlock_mount_hash();
 	shrink_dentry_list(&list);
 
+	mountfs_remove(mnt);
+
 	if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
 		struct task_struct *task = current;
 		if (likely(!(task->flags & PF_KTHREAD))) {
@@ -1263,13 +1279,14 @@ EXPORT_SYMBOL(path_is_mountpoint);
 struct vfsmount *mnt_clone_internal(const struct path *path)
 {
 	struct mount *p;
-	p = clone_mnt(real_mount(path->mnt), path->dentry, CL_PRIVATE);
+	p = clone_mnt_common(real_mount(path->mnt), path->dentry, CL_PRIVATE);
 	if (IS_ERR(p))
 		return ERR_CAST(p);
 	p->mnt.mnt_flags |= MNT_INTERNAL;
 	return &p->mnt;
 }
 
+
 #ifdef CONFIG_PROC_FS
 /* iterator; we want it to have access to namespace_sem, thus here... */
 static void *m_start(struct seq_file *m, loff_t *pos)
@@ -1411,6 +1428,16 @@ static inline void namespace_lock(void)
 	down_write(&namespace_sem);
 }
 
+void mnt_namespace_lock_read(void)
+{
+	down_read(&namespace_sem);
+}
+
+void mnt_namespace_unlock_read(void)
+{
+	up_read(&namespace_sem);
+}
+
 enum umount_tree_flags {
 	UMOUNT_SYNC = 1,
 	UMOUNT_PROPAGATE = 2,
diff --git a/fs/proc/base.c b/fs/proc/base.c
index c7c64272b0fa..0477f8b51182 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3092,6 +3092,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
 	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
+	DIR("fdmount",    S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations),
 	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
@@ -3497,6 +3498,7 @@ static const struct inode_operations proc_tid_comm_inode_operations = {
 static const struct pid_entry tid_base_stuff[] = {
 	DIR("fd",        S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
 	DIR("fdinfo",    S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
+	DIR("fdmount",   S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations),
 	DIR("ns",	 S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 81882a13212d..94a57e178801 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -361,3 +361,85 @@ const struct file_operations proc_fdinfo_operations = {
 	.iterate_shared	= proc_readfdinfo,
 	.llseek		= generic_file_llseek,
 };
+
+static int proc_fdmount_link(struct dentry *dentry, struct path *path)
+{
+	struct files_struct *files = NULL;
+	struct task_struct *task;
+	struct path fd_path;
+	int ret = -ENOENT;
+
+	task = get_proc_task(d_inode(dentry));
+	if (task) {
+		files = get_files_struct(task);
+		put_task_struct(task);
+	}
+
+	if (files) {
+		unsigned int fd = proc_fd(d_inode(dentry));
+		struct file *fd_file;
+
+		spin_lock(&files->file_lock);
+		fd_file = fcheck_files(files, fd);
+		if (fd_file) {
+			fd_path = fd_file->f_path;
+			path_get(&fd_path);
+			ret = 0;
+		}
+		spin_unlock(&files->file_lock);
+		put_files_struct(files);
+	}
+	if (!ret) {
+		ret = mountfs_lookup_internal(fd_path.mnt, path);
+		path_put(&fd_path);
+	}
+
+	return ret;
+}
+
+static struct dentry *proc_fdmount_instantiate(struct dentry *dentry,
+	struct task_struct *task, const void *ptr)
+{
+	const struct fd_data *data = ptr;
+	struct proc_inode *ei;
+	struct inode *inode;
+
+	inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | 0400);
+	if (!inode)
+		return ERR_PTR(-ENOENT);
+
+	ei = PROC_I(inode);
+	ei->fd = data->fd;
+
+	inode->i_op = &proc_pid_link_inode_operations;
+	inode->i_size = 64;
+
+	ei->op.proc_get_link = proc_fdmount_link;
+	tid_fd_update_inode(task, inode, 0);
+
+	d_set_d_op(dentry, &tid_fd_dentry_operations);
+	return d_splice_alias(inode, dentry);
+}
+
+static struct dentry *
+proc_lookupfdmount(struct inode *dir, struct dentry *dentry, unsigned int flags)
+{
+	return proc_lookupfd_common(dir, dentry, proc_fdmount_instantiate);
+}
+
+static int proc_readfdmount(struct file *file, struct dir_context *ctx)
+{
+	return proc_readfd_common(file, ctx,
+				  proc_fdmount_instantiate);
+}
+
+const struct inode_operations proc_fdmount_inode_operations = {
+	.lookup		= proc_lookupfdmount,
+	.setattr	= proc_setattr,
+};
+
+const struct file_operations proc_fdmount_operations = {
+	.read		= generic_read_dir,
+	.iterate_shared	= proc_readfdmount,
+	.llseek		= generic_file_llseek,
+};
diff --git a/fs/proc/fd.h b/fs/proc/fd.h
index f371a602bf58..9e087c833e65 100644
--- a/fs/proc/fd.h
+++ b/fs/proc/fd.h
@@ -10,6 +10,9 @@ extern const struct inode_operations proc_fd_inode_operations;
 extern const struct file_operations proc_fdinfo_operations;
 extern const struct inode_operations proc_fdinfo_inode_operations;
 
+extern const struct file_operations proc_fdmount_operations;
+extern const struct inode_operations proc_fdmount_inode_operations;
+
 extern int proc_fd_permission(struct inode *inode, int mask);
 
 static inline unsigned int proc_fd(struct inode *inode)
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 273ee82d8aa9..e634faa9160e 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -61,24 +61,6 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
 	return security_sb_show_options(m, sb);
 }
 
-static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
-{
-	static const struct proc_fs_info mnt_info[] = {
-		{ MNT_NOSUID, ",nosuid" },
-		{ MNT_NODEV, ",nodev" },
-		{ MNT_NOEXEC, ",noexec" },
-		{ MNT_NOATIME, ",noatime" },
-		{ MNT_NODIRATIME, ",nodiratime" },
-		{ MNT_RELATIME, ",relatime" },
-		{ 0, NULL }
-	};
-	const struct proc_fs_info *fs_infop;
-
-	for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) {
-		if (mnt->mnt_flags & fs_infop->flag)
-			seq_puts(m, fs_infop->str);
-	}
-}
 
 static inline void mangle(struct seq_file *m, const char *s)
 {
@@ -120,7 +102,7 @@ static int show_vfsmnt(struct seq_file *m, struct vfsmount *mnt)
 	err = show_sb_opts(m, sb);
 	if (err)
 		goto out;
-	show_mnt_opts(m, mnt);
+	seq_mnt_opts(m, mnt->mnt_flags);
 	if (sb->s_op->show_options)
 		err = sb->s_op->show_options(m, mnt_path.dentry);
 	seq_puts(m, " 0 0\n");
@@ -153,7 +135,7 @@ static int show_mountinfo(struct seq_file *m, struct vfsmount *mnt)
 		goto out;
 
 	seq_puts(m, mnt->mnt_flags & MNT_READONLY ? " ro" : " rw");
-	show_mnt_opts(m, mnt);
+	seq_mnt_opts(m, mnt->mnt_flags);
 
 	/* Tagged fields ("foo:X" or "bar") */
 	if (IS_MNT_SHARED(r))
diff --git a/fs/seq_file.c b/fs/seq_file.c
index 1600034a929b..9726baba1732 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -15,6 +15,7 @@
 #include <linux/cred.h>
 #include <linux/mm.h>
 #include <linux/printk.h>
+#include <linux/mount.h>
 #include <linux/string_helpers.h>
 
 #include <linux/uaccess.h>
@@ -548,6 +549,28 @@ int seq_dentry(struct seq_file *m, struct dentry *dentry, const char *esc)
 }
 EXPORT_SYMBOL(seq_dentry);
 
+void seq_mnt_opts(struct seq_file *m, int mnt_flags)
+{
+	unsigned int i;
+	static const struct {
+		int flag;
+		const char *str;
+	} mnt_info[] = {
+		{ MNT_NOSUID, ",nosuid" },
+		{ MNT_NODEV, ",nodev" },
+		{ MNT_NOEXEC, ",noexec" },
+		{ MNT_NOATIME, ",noatime" },
+		{ MNT_NODIRATIME, ",nodiratime" },
+		{ MNT_RELATIME, ",relatime" },
+		{ 0, NULL }
+	};
+
+	for (i = 0; mnt_info[i].flag; i++) {
+		if (mnt_flags & mnt_info[i].flag)
+			seq_puts(m, mnt_info[i].str);
+	}
+}
+
 static void *single_start(struct seq_file *p, loff_t *pos)
 {
 	return NULL + (*pos == 0);
diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h
index 770c2bf3aa43..9dd7812eb777 100644
--- a/include/linux/seq_file.h
+++ b/include/linux/seq_file.h
@@ -138,6 +138,7 @@ int seq_file_path(struct seq_file *, struct file *, const char *);
 int seq_dentry(struct seq_file *, struct dentry *, const char *);
 int seq_path_root(struct seq_file *m, const struct path *path,
 		  const struct path *root, const char *esc);
+void seq_mnt_opts(struct seq_file *m, int mnt_flags);
 
 int single_open(struct file *, int (*)(struct seq_file *, void *), void *);
 int single_open_size(struct file *, int (*)(struct seq_file *, void *), void *, size_t);
diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
index 19be60ab950e..78deb8483d27 100644
--- a/samples/vfs/Makefile
+++ b/samples/vfs/Makefile
@@ -4,6 +4,7 @@
 hostprogs := \
 	test-fsinfo \
 	test-fsmount \
+	test-fsinfo-perf \
 	test-mntinfo \
 	test-statx
 
@@ -12,6 +13,7 @@ always-y := $(hostprogs)
 HOSTCFLAGS_test-fsinfo.o += -I$(objtree)/usr/include
 HOSTLDLIBS_test-fsinfo += -static -lm
 HOSTCFLAGS_test-mntinfo.o += -I$(objtree)/usr/include
+HOSTCFLAGS_test-fsinfo-perf.o += -I$(objtree)/usr/include
 
 HOSTCFLAGS_test-fsmount.o += -I$(objtree)/usr/include
 HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include
diff --git a/samples/vfs/test-fsinfo-perf.c b/samples/vfs/test-fsinfo-perf.c
new file mode 100644
index 000000000000..fba40737f768
--- /dev/null
+++ b/samples/vfs/test-fsinfo-perf.c
@@ -0,0 +1,361 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Test the fsinfo() system call
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#define _GNU_SOURCE
+#define _ATFILE_SOURCE
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <unistd.h>
+#include <ctype.h>
+#include <errno.h>
+#include <time.h>
+#include <math.h>
+#include <fcntl.h>
+#include <sys/syscall.h>
+#include <sys/stat.h>
+#include <sys/mount.h>
+#include <sys/time.h>
+#include <linux/fsinfo.h>
+
+#ifndef __NR_fsinfo
+#define __NR_fsinfo -1
+#endif
+
+#define ERR(ret, what) do { if ((long)(ret) == -1) { perror(what); exit(1); } } while(0)
+#define OOM(ret) do { if (!(ret)) { perror(NULL); exit(1); } } while(0)
+
+static int nr_mounts = 3;
+static const char *base_path;
+
+static __attribute__((unused))
+ssize_t fsinfo(int dfd, const char *filename,
+	       struct fsinfo_params *params, size_t params_size,
+	       void *result_buffer, size_t result_buf_size)
+{
+	return syscall(__NR_fsinfo, dfd, filename,
+		       params, params_size,
+		       result_buffer, result_buf_size);
+}
+
+static void iterate(void (*func)(int i, const char *))
+{
+	char name[4096];
+	int i;
+
+	for (i = 0; i < nr_mounts; i++) {
+		sprintf(name, "%s/%d", base_path, i);
+		func(i, name);
+	}
+}
+
+static void make_mount(int ix, const char *path)
+{
+	ERR(mkdir(path, 0755), "mkdir");
+	ERR(mount("none", path, "tmpfs", 0, NULL), "mount");
+	ERR(mount("none", path, NULL, MS_PRIVATE, NULL), "mount");
+}
+
+static void do_umount(void)
+{
+	printf("--- umount ---\n");
+	if (umount2(base_path, MNT_DETACH) == -1)
+		perror("umount");
+}
+
+static unsigned long sum_mnt_id;
+
+static void get_mntid_by_fsinfo(int ix, const char *path)
+{
+	struct fsinfo_mount_info r;
+	struct fsinfo_params params = {
+		.flags		= FSINFO_FLAGS_QUERY_PATH,
+		.request	= FSINFO_ATTR_MOUNT_INFO,
+	};
+
+	ERR(fsinfo(AT_FDCWD, path, &params, sizeof(params), &r, sizeof(r)),
+	    "fsinfo");
+	//printf("[%u] %u\n", ix, r.mnt_id);
+	sum_mnt_id += r.mnt_id;
+}
+
+static void get_mntid_by_proc(int ix, const char *path)
+{
+	unsigned int mnt_id;
+	ssize_t len;
+	char procfile[100], buffer[4096], *p, *nl;
+	int fd, fd2;
+
+	fd = open(path, O_PATH);
+	ERR(fd, "open/path");
+	sprintf(procfile, "/proc/self/fdinfo/%u", fd);
+	fd2 = open(procfile, O_RDONLY);
+	ERR(fd2, "open/proc");
+	len = read(fd2, buffer, sizeof(buffer) - 1);
+	ERR(len, "read");
+	buffer[len] = 0;
+	close(fd2);
+	close(fd);
+
+	p = buffer;
+	do {
+		nl = strchr(p, '\n');
+		if (nl)
+			*nl++ = '\0';
+		else
+			nl = NULL;
+
+		if (strncmp(p, "mnt_id:", 7) != 0)
+			continue;
+		p += 7;
+		while (isblank(*p))
+			p++;
+		/* Have to allow for extra numbers being added to the line */
+		if (sscanf(p, "%u", &mnt_id) != 1) {
+			fprintf(stderr, "Bad format %s\n", procfile);
+			exit(3);
+		}
+		break;
+
+	} while ((p = nl));
+
+	if (!p) {
+		fprintf(stderr, "Missing field %s\n", procfile);
+		exit(3);
+	}
+
+	sum_mnt_id += mnt_id;
+	//printf("[%u] %u\n", ix, mnt_id);
+}
+
+static void get_mntid_by_fsinfo_2(void)
+{
+	struct fsinfo_mount_child *children, *c, *end;
+	struct fsinfo_mount_info r;
+	struct fsinfo_params params = {
+		.flags		= FSINFO_FLAGS_QUERY_PATH,
+		.request	= FSINFO_ATTR_MOUNT_INFO,
+	};
+	unsigned int base_mnt_id;
+	size_t s_children, n_children;
+	char name[32];
+	int i;
+
+	/* Convert path to mount ID */
+	ERR(fsinfo(AT_FDCWD, base_path, &params, sizeof(params), &r, sizeof(r)),
+	    "fsinfo/base");
+	base_mnt_id = r.mnt_id;
+	//printf("[B] %u\n", base_mnt_id);
+
+	/* Get a list of all the children of this mount ID */
+	s_children = (nr_mounts + 1) * sizeof(*children);
+	children = malloc(s_children);
+	OOM(children);
+
+	params.flags	= FSINFO_FLAGS_QUERY_MOUNT;
+	params.request	= FSINFO_ATTR_MOUNT_CHILDREN;
+	sprintf(name, "%u", base_mnt_id);
+	s_children = fsinfo(AT_FDCWD, name, &params, sizeof(params), children, s_children);
+	ERR(s_children, "fsinfo/children");
+
+	/* Query each child */
+	n_children = s_children / sizeof(*c) - 1; // Parent is added at end
+	c = children;
+	end = c + n_children;
+	for (i = 0; c < end; c++, i++) {
+		//printf("[%u] %u\n", i, c->mnt_id);
+		params.flags	= FSINFO_FLAGS_QUERY_MOUNT;
+		params.request	= FSINFO_ATTR_MOUNT_INFO;
+		sprintf(name, "%u", c->mnt_id);
+		ERR(fsinfo(AT_FDCWD, name, &params, sizeof(params), &r, sizeof(r)),
+		    "fsinfo/child");
+		sum_mnt_id += r.mnt_id;
+	}
+}
+
+static void get_mntid_by_mountfs(void)
+{
+	unsigned int base_mnt_id, mnt_id, x;
+	ssize_t len, s_children;
+	char procfile[100], buffer[100], *children, *p, *q, *nl, *comma;
+	int fd, fd2, mntfd, i;
+
+	/* Start off by reading the mount ID from the base path */
+	fd = open(base_path, O_PATH);
+	ERR(fd, "open/path");
+	sprintf(procfile, "/proc/self/fdinfo/%u", fd);
+	fd2 = open(procfile, O_RDONLY);
+	ERR(fd2, "open/proc");
+	len = read(fd2, buffer, sizeof(buffer) - 1);
+	ERR(len, "read");
+	buffer[len] = 0;
+	close(fd2);
+	close(fd);
+
+	p = buffer;
+	do {
+		nl = strchr(p, '\n');
+		if (nl)
+			*nl++ = '\0';
+		else
+			nl = NULL;
+
+		if (strncmp(p, "mnt_id:", 7) != 0)
+			continue;
+		p += 7;
+		while (isblank(*p))
+			p++;
+		/* Have to allow for extra numbers being added to the line */
+		if (sscanf(p, "%u", &base_mnt_id) != 1) {
+			fprintf(stderr, "Bad format %s\n", procfile);
+			exit(3);
+		}
+		break;
+
+	} while ((p = nl));
+
+	if (!p) {
+		fprintf(stderr, "Missing field %s\n", procfile);
+		exit(3);
+	}
+
+	if (0) printf("[B] %u\n", base_mnt_id);
+
+	mntfd = open("/mnt", O_PATH);
+	ERR(fd, "open/mountfs");
+
+	/* Get a list of all the children of this mount ID */
+	s_children = (nr_mounts) * 12;
+	children = malloc(s_children);
+	OOM(children);
+
+	sprintf(procfile, "%u/children", base_mnt_id);
+	fd = openat(mntfd, procfile, O_RDONLY);
+	ERR(fd, "open/children");
+	s_children = read(fd, children, s_children - 1);
+	ERR(s_children, "read/children");
+	close(fd);
+	if (s_children > 0 && children[s_children - 1] == '\n')
+		s_children--;
+	children[s_children] = 0;
+
+	/* Query each child */
+	p = children;
+	if (!*p)
+		return;
+	i = 0;
+	do {
+		mnt_id = strtoul(p, &comma, 10);
+		if (*comma) {
+			if (*comma != ',') {
+				fprintf(stderr, "Bad format in mountfs-children\n");
+				exit(3);
+			}
+			comma++;
+		}
+
+		sprintf(procfile, "%u/id", mnt_id);
+		fd = openat(mntfd, procfile, O_RDONLY);
+		ERR(fd, procfile);
+		len = read(fd, buffer, sizeof(buffer) - 1);
+		ERR(len, "read/id");
+		close(fd);
+		if (len > 0 && buffer[len - 1] == '\n')
+			len--;
+		buffer[len] = 0;
+
+		x = strtoul(buffer, &q, 10);
+
+		if (*q) {
+			fprintf(stderr, "Bad format in %s '%s'\n", procfile, buffer);
+			exit(3);
+		}
+
+		if (0) printf("[%u] %u\n", i++, x);
+		sum_mnt_id += x;
+	} while (p = comma, *comma);
+}
+
+static unsigned long duration(struct timeval *before, struct timeval *after)
+{
+	unsigned long a, b;
+
+	a = after->tv_sec * 1000000 + after->tv_usec;
+	b = before->tv_sec * 1000000 + before->tv_usec;
+	return a - b;
+}
+
+int main(int argc, char **argv)
+{
+	struct timeval f_before, f_after;
+	struct timeval f2_before, f2_after;
+	struct timeval p_before, p_after;
+	struct timeval p2_before, p2_after;
+	const char *path;
+	unsigned long f_dur, f2_dur, p_dur, p2_dur;
+
+	if (argc < 2) {
+		fprintf(stderr, "Format: %s <path> [nr_mounts]\n", argv[0]);
+		exit(2);
+	}
+
+	if (argc == 3)
+		nr_mounts = atoi(argv[2]);
+
+	path = argv[1];
+	ERR(mount("none", path, "tmpfs", 0, NULL), "mount");
+	ERR(mount("none", path, NULL, MS_PRIVATE, NULL), "mount");
+	base_path = path;
+	atexit(do_umount);
+
+	printf("--- make mounts ---\n");
+	iterate(make_mount);
+
+	printf("--- test fsinfo by path ---\n");
+	sum_mnt_id = 0;
+	ERR(gettimeofday(&f_before, NULL), "gettimeofday");
+	iterate(get_mntid_by_fsinfo);
+	ERR(gettimeofday(&f_after, NULL), "gettimeofday");
+	printf("sum(mnt_id) = %lu\n", sum_mnt_id);
+
+	printf("--- test fsinfo by mnt_id ---\n");
+	sum_mnt_id = 0;
+	ERR(gettimeofday(&f2_before, NULL), "gettimeofday");
+	get_mntid_by_fsinfo_2();
+	ERR(gettimeofday(&f2_after, NULL), "gettimeofday");
+	printf("sum(mnt_id) = %lu\n", sum_mnt_id);
+
+	printf("--- test /proc/fdinfo ---\n");
+	sum_mnt_id = 0;
+	ERR(gettimeofday(&p_before, NULL), "gettimeofday");
+	iterate(get_mntid_by_proc);
+	ERR(gettimeofday(&p_after, NULL), "gettimeofday");
+	printf("sum(mnt_id) = %lu\n", sum_mnt_id);
+
+	printf("--- test mountfs ---\n");
+	sum_mnt_id = 0;
+	ERR(gettimeofday(&p2_before, NULL), "gettimeofday");
+	get_mntid_by_mountfs();
+	ERR(gettimeofday(&p2_after, NULL), "gettimeofday");
+	printf("sum(mnt_id) = %lu\n", sum_mnt_id);
+
+	f_dur  = duration(&f_before,  &f_after);
+	f2_dur = duration(&f2_before, &f2_after);
+	p_dur  = duration(&p_before,  &p_after);
+	p2_dur = duration(&p2_before, &p2_after);
+	//printf("fsinfo duration %10luus for %d mounts\n", f_dur, nr_mounts);
+	//printf("procfd duration %10luus for %d mounts\n", p_dur, nr_mounts);
+
+	printf("For %7d mounts, f=%10luus f2=%10luus p=%10luus p2=%10luus; p=%.1f*f p=%.1f*f2 p=%.1f*p2\n",
+	       nr_mounts, f_dur, f2_dur, p_dur, p2_dur,
+	       (double)p_dur / (double)f_dur,
+	       (double)p_dur / (double)f2_dur,
+	       (double)p_dur / (double)p2_dur);
+	return 0;
+}

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-31 17:31 ` David Howells
  2020-03-31 19:42   ` Miklos Szeredi
  2020-03-31 19:47   ` David Howells
@ 2020-03-31 21:14   ` David Howells
  2020-03-31 21:23   ` David Howells
  3 siblings, 0 replies; 99+ messages in thread
From: David Howells @ 2020-03-31 21:14 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, Linus Torvalds, Al Viro, dray, Karel Zak,
	Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres,
	Christian Brauner, Lennart Poettering, keyrings, linux-fsdevel,
	linux-kernel

Miklos Szeredi <miklos@szeredi.hu> wrote:

> So even the p2 method will give at least 80k queries/s, which is quite
> good, considering that the need to rescan the complete mount tree
> should be exceedingly rare (and in case it mattered, could be
> optimized by priming from /proc/self/mountinfo).

One thing to note is that the test is actually a little biased in favour of
the "p" test, where the mnt_id is looked up by path from /proc/fdinfo.  That's
not all that useful, except as an index into mountfs.  I'm not sure how much
use it as a check on whether the mount is the same mount or not since mount
IDs can get reused.

If I instead use the parent_id all round as the desired target value, I then
see:

For 10000 mounts, f"899us f2\x18240us p\x101054us p2\x117273us <-- prev email
For 10000 mounts, f$853us f2 453us p#5581us p2= 59798us <-- parent_id

Some observations:

 (1) fsinfo() gets a bit slower, reflecting the extra locking that must be
     done to access the topology information (it's using a different
     attribute).

 (2) Going via /proc/fdinfo now includes further a access into mountfs - and
     this makes the access ~2.3x slower than it was before.

 (3) Going via mount ID directly into mountfs (the "p2" test) appears faster
     than it did (when it shouldn't have changed), though it's still slower
     than fsinfo.  This I ascribe to the caching of the inode and dentry from
     the "p" test.

The attached patch adjusts the test program.

David
---
commit e9844e27f3061e4ef2d1511786b5ea60338dc610
Author: David Howells <dhowells@redhat.com>
Date:   Tue Mar 31 21:14:58 2020 +0100

    Get parent ID instead

diff --git a/samples/vfs/test-fsinfo-perf.c b/samples/vfs/test-fsinfo-perf.c
index fba40737f768..2bcde06ee78b 100644
--- a/samples/vfs/test-fsinfo-perf.c
+++ b/samples/vfs/test-fsinfo-perf.c
@@ -69,27 +69,27 @@ static void do_umount(void)
 		perror("umount");
 }
 
-static unsigned long sum_mnt_id;
+static unsigned long sum_check, sum_check_2;
 
-static void get_mntid_by_fsinfo(int ix, const char *path)
+static void get_id_by_fsinfo(int ix, const char *path)
 {
-	struct fsinfo_mount_info r;
+	struct fsinfo_mount_topology r;
 	struct fsinfo_params params = {
 		.flags		= FSINFO_FLAGS_QUERY_PATH,
-		.request	= FSINFO_ATTR_MOUNT_INFO,
+		.request	= FSINFO_ATTR_MOUNT_TOPOLOGY,
 	};
 
 	ERR(fsinfo(AT_FDCWD, path, &params, sizeof(params), &r, sizeof(r)),
 	    "fsinfo");
-	//printf("[%u] %u\n", ix, r.mnt_id);
-	sum_mnt_id += r.mnt_id;
+	sum_check += r.parent_id;
+	sum_check_2 += r.mnt_topology_changes;
 }
 
-static void get_mntid_by_proc(int ix, const char *path)
+static void get_id_by_proc(int ix, const char *path)
 {
-	unsigned int mnt_id;
+	unsigned int mnt_id, x;
 	ssize_t len;
-	char procfile[100], buffer[4096], *p, *nl;
+	char procfile[100], buffer[4096], *p, *q, *nl;
 	int fd, fd2;
 
 	fd = open(path, O_PATH);
@@ -130,12 +130,31 @@ static void get_mntid_by_proc(int ix, const char *path)
 		exit(3);
 	}
 
-	sum_mnt_id += mnt_id;
-	//printf("[%u] %u\n", ix, mnt_id);
+	/* Now look the ID up on mountfs */
+	sprintf(procfile, "/mnt/%u/parent", mnt_id);
+	fd = open(procfile, O_RDONLY);
+	ERR(fd, procfile);
+	len = read(fd, buffer, sizeof(buffer) - 1);
+	ERR(len, "read/parent");
+	close(fd);
+	if (len > 0 && buffer[len - 1] = '\n')
+		len--;
+	buffer[len] = 0;
+
+	x = strtoul(buffer, &q, 10);
+
+	if (*q) {
+		fprintf(stderr, "Bad format in %s '%s'\n", procfile, buffer);
+		exit(3);
+	}
+
+	sum_check += x;
+	//printf("[%u] %u\n", ix, x);
 }
 
-static void get_mntid_by_fsinfo_2(void)
+static void get_id_by_fsinfo_2(void)
 {
+	struct fsinfo_mount_topology t;
 	struct fsinfo_mount_child *children, *c, *end;
 	struct fsinfo_mount_info r;
 	struct fsinfo_params params = {
@@ -171,15 +190,16 @@ static void get_mntid_by_fsinfo_2(void)
 	for (i = 0; c < end; c++, i++) {
 		//printf("[%u] %u\n", i, c->mnt_id);
 		params.flags	= FSINFO_FLAGS_QUERY_MOUNT;
-		params.request	= FSINFO_ATTR_MOUNT_INFO;
+		params.request	= FSINFO_ATTR_MOUNT_TOPOLOGY;
 		sprintf(name, "%u", c->mnt_id);
-		ERR(fsinfo(AT_FDCWD, name, &params, sizeof(params), &r, sizeof(r)),
+		ERR(fsinfo(AT_FDCWD, name, &params, sizeof(params), &t, sizeof(t)),
 		    "fsinfo/child");
-		sum_mnt_id += r.mnt_id;
+		sum_check += t.parent_id;
+		sum_check_2 += t.mnt_topology_changes;
 	}
 }
 
-static void get_mntid_by_mountfs(void)
+static void get_id_by_mountfs(void)
 {
 	unsigned int base_mnt_id, mnt_id, x;
 	ssize_t len, s_children;
@@ -260,11 +280,11 @@ static void get_mntid_by_mountfs(void)
 			comma++;
 		}
 
-		sprintf(procfile, "%u/id", mnt_id);
+		sprintf(procfile, "%u/parent", mnt_id);
 		fd = openat(mntfd, procfile, O_RDONLY);
 		ERR(fd, procfile);
 		len = read(fd, buffer, sizeof(buffer) - 1);
-		ERR(len, "read/id");
+		ERR(len, "read/parent");
 		close(fd);
 		if (len > 0 && buffer[len - 1] = '\n')
 			len--;
@@ -278,7 +298,7 @@ static void get_mntid_by_mountfs(void)
 		}
 
 		if (0) printf("[%u] %u\n", i++, x);
-		sum_mnt_id += x;
+		sum_check += x;
 	} while (p = comma, *comma);
 }
 
@@ -318,32 +338,32 @@ int main(int argc, char **argv)
 	iterate(make_mount);
 
 	printf("--- test fsinfo by path ---\n");
-	sum_mnt_id = 0;
+	sum_check = 0;
 	ERR(gettimeofday(&f_before, NULL), "gettimeofday");
-	iterate(get_mntid_by_fsinfo);
+	iterate(get_id_by_fsinfo);
 	ERR(gettimeofday(&f_after, NULL), "gettimeofday");
-	printf("sum(mnt_id) = %lu\n", sum_mnt_id);
+	printf("sum(mnt_id) = %lu\n", sum_check);
 
 	printf("--- test fsinfo by mnt_id ---\n");
-	sum_mnt_id = 0;
+	sum_check = 0;
 	ERR(gettimeofday(&f2_before, NULL), "gettimeofday");
-	get_mntid_by_fsinfo_2();
+	get_id_by_fsinfo_2();
 	ERR(gettimeofday(&f2_after, NULL), "gettimeofday");
-	printf("sum(mnt_id) = %lu\n", sum_mnt_id);
+	printf("sum(mnt_id) = %lu\n", sum_check);
 
 	printf("--- test /proc/fdinfo ---\n");
-	sum_mnt_id = 0;
+	sum_check = 0;
 	ERR(gettimeofday(&p_before, NULL), "gettimeofday");
-	iterate(get_mntid_by_proc);
+	iterate(get_id_by_proc);
 	ERR(gettimeofday(&p_after, NULL), "gettimeofday");
-	printf("sum(mnt_id) = %lu\n", sum_mnt_id);
+	printf("sum(mnt_id) = %lu\n", sum_check);
 
 	printf("--- test mountfs ---\n");
-	sum_mnt_id = 0;
+	sum_check = 0;
 	ERR(gettimeofday(&p2_before, NULL), "gettimeofday");
-	get_mntid_by_mountfs();
+	get_id_by_mountfs();
 	ERR(gettimeofday(&p2_after, NULL), "gettimeofday");
-	printf("sum(mnt_id) = %lu\n", sum_mnt_id);
+	printf("sum(mnt_id) = %lu\n", sum_check);
 
 	f_dur  = duration(&f_before,  &f_after);
 	f2_dur = duration(&f2_before, &f2_after);

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-31 17:31 ` David Howells
                     ` (2 preceding siblings ...)
  2020-03-31 21:14   ` David Howells
@ 2020-03-31 21:23   ` David Howells
  3 siblings, 0 replies; 99+ messages in thread
From: David Howells @ 2020-03-31 21:23 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, Linus Torvalds, Al Viro, dray, Karel Zak,
	Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres,
	Christian Brauner, Lennart Poettering, keyrings, linux-fsdevel,
	linux-kernel

David Howells <dhowells@redhat.com> wrote:

> > So even the p2 method will give at least 80k queries/s, which is quite
> > good, considering that the need to rescan the complete mount tree
> > should be exceedingly rare (and in case it mattered, could be
> > optimized by priming from /proc/self/mountinfo).
> 
> One thing to note is that the test is actually a little biased in favour of
> the "p" test, where the mnt_id is looked up by path from /proc/fdinfo.  That's
> not all that useful, except as an index into mountfs.  I'm not sure how much
> use it as a check on whether the mount is the same mount or not since mount
> IDs can get reused.

However, to deal with an overrun, you're going to have to read multiple
attributes.  So I've added an attribute file to expose the topology change
counter and it now reads that as well.

For 10000 mounts, f=22899us f2=18240us p=101054us p2=117273us <-- prev email
For 10000 mounts, f=24853us f2=20453us p=235581us p2= 59798us <-- parent_id
For 10000 mounts, f=24621us f2=20528us p=320164us p2=111416us <-- counter

Probably unsurprisingly, this doesn't affect fsinfo() significantly since I've
tried to expose the change counters in relevant places.  It does, however,
significantly affect mountfs because you seem to want every value to be
exposed through its own file.

Now this can be worked around by having files that bundle up several values
that are of interest to a particular operation (e.g. rescanning after a
notification queue overrun).

See the attached additional patch.  Note that the

	sum_check_2 += r.mnt_topology_changes;

bits in the fsinfo() tests accidentally got left in the preceding patch and so
aren't in this one.

David
---
commit 6c62787aec41f67c1d5a55a0d59578854bcef6f8
Author: David Howells <dhowells@redhat.com>
Date:   Tue Mar 31 21:53:11 2020 +0100

    Add a mountfs file to export the topology counter

diff --git a/fs/mountfs/super.c b/fs/mountfs/super.c
index 82c01eb6154d..58c05feb4fdd 100644
--- a/fs/mountfs/super.c
+++ b/fs/mountfs/super.c
@@ -22,7 +22,7 @@ struct mountfs_entry {
 
 static const char *mountfs_attrs[] = {
 	"root", "mountpoint", "id", "parent", "options", "children",
-	"group", "master", "propagate_from"
+	"group", "master", "propagate_from", "counter"
 };
 
 #define MOUNTFS_INO(id) (((unsigned long) id + 1) * \
@@ -128,6 +128,8 @@ static int mountfs_attr_show(struct seq_file *sf, void *v)
 			if (tmp)
 				seq_printf(sf, "%i\n", tmp);
 		}
+	} else if (strcmp(name, "counter") == 0) {
+		seq_printf(sf, "%u\n", atomic_read(&mnt->mnt_topology_changes));
 	} else {
 		WARN_ON(1);
 		err = -EIO;
diff --git a/samples/vfs/test-fsinfo-perf.c b/samples/vfs/test-fsinfo-perf.c
index 2bcde06ee78b..2b7606a53c2d 100644
--- a/samples/vfs/test-fsinfo-perf.c
+++ b/samples/vfs/test-fsinfo-perf.c
@@ -149,6 +149,26 @@ static void get_id_by_proc(int ix, const char *path)
 	}
 
 	sum_check += x;
+
+	/* And now the topology change counter */
+	sprintf(procfile, "/mnt/%u/counter", mnt_id);
+	fd = open(procfile, O_RDONLY);
+	ERR(fd, procfile);
+	len = read(fd, buffer, sizeof(buffer) - 1);
+	ERR(len, "read/counter");
+	close(fd);
+	if (len > 0 && buffer[len - 1] == '\n')
+		len--;
+	buffer[len] = 0;
+
+	x = strtoul(buffer, &q, 10);
+
+	if (*q) {
+		fprintf(stderr, "Bad format in %s '%s'\n", procfile, buffer);
+		exit(3);
+	}
+
+	sum_check_2 += x;
 	//printf("[%u] %u\n", ix, x);
 }
 
@@ -204,7 +224,7 @@ static void get_id_by_mountfs(void)
 	unsigned int base_mnt_id, mnt_id, x;
 	ssize_t len, s_children;
 	char procfile[100], buffer[100], *children, *p, *q, *nl, *comma;
-	int fd, fd2, mntfd, i;
+	int fd, fd2, mntfd;
 
 	/* Start off by reading the mount ID from the base path */
 	fd = open(base_path, O_PATH);
@@ -269,7 +289,6 @@ static void get_id_by_mountfs(void)
 	p = children;
 	if (!*p)
 		return;
-	i = 0;
 	do {
 		mnt_id = strtoul(p, &comma, 10);
 		if (*comma) {
@@ -297,8 +316,26 @@ static void get_id_by_mountfs(void)
 			exit(3);
 		}
 
-		if (0) printf("[%u] %u\n", i++, x);
 		sum_check += x;
+
+		sprintf(procfile, "%u/counter", mnt_id);
+		fd = openat(mntfd, procfile, O_RDONLY);
+		ERR(fd, procfile);
+		len = read(fd, buffer, sizeof(buffer) - 1);
+		ERR(len, "read/counter");
+		close(fd);
+		if (len > 0 && buffer[len - 1] == '\n')
+			len--;
+		buffer[len] = 0;
+
+		x = strtoul(buffer, &q, 10);
+
+		if (*q) {
+			fprintf(stderr, "Bad format in %s '%s'\n", procfile, buffer);
+			exit(3);
+		}
+
+		sum_check_2 += x;
 	} while (p = comma, *comma);
 }
 

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-30 13:58 Upcoming: Notifications, FS notifications and fsinfo() David Howells
                   ` (5 preceding siblings ...)
  2020-03-31 17:31 ` David Howells
@ 2020-03-31 21:52 ` David Howells
  2020-04-01  9:04   ` Karel Zak
  2020-04-01 14:41   ` Lennart Poettering
  6 siblings, 2 replies; 99+ messages in thread
From: David Howells @ 2020-03-31 21:52 UTC (permalink / raw)
  To: Christian Brauner
  Cc: dhowells, torvalds, viro, dray, kzak, mszeredi, swhiteho,
	jlayton, raven, andres, keyrings, linux-fsdevel, linux-kernel,
	lennart, cyphar

Christian Brauner <christian.brauner@ubuntu.com> wrote:

> querying all properties of a mount atomically all-at-once,

I don't actually offer that, per se.

Having an atomic all-at-once query for a single mount is actually quite a
burden on the system.  There's potentially a lot of state involved, much of
which you don't necessarily need.

I've tried to avoid the need to do that by adding change counters that can be
queried cheaply.  You read the counters, then you check mounts and superblocks
for which the counters have changed, and then you re-read the counters.  I've
added multiple counters, assigned to different purposes, to make it easier to
pin down what has changed - and so reduce the amount of checking required.

What I have added to fsinfo() is a way to atomically retrieve a list of all
the children of a mount, including, for each mount, the mount ID (which may
have been reused), a uniquifier (which shouldn't wrap over the kernel
lifetime) and the sum of the mount object and superblock change counters.

This should allow you to quickly rescan the mount tree as fsinfo() can look up
mounts by mount ID instead of by path or fd.

Below is a sample file from the kernel that scans by this method, displaying
an ascii art tree of all the mounts under a path or mount.

David
---
// SPDX-License-Identifier: GPL-2.0-or-later
/* Test the fsinfo() system call
 *
 * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
 * Written by David Howells (dhowells@redhat.com)
 */

#define _GNU_SOURCE
#define _ATFILE_SOURCE
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <unistd.h>
#include <ctype.h>
#include <errno.h>
#include <time.h>
#include <math.h>
#include <sys/syscall.h>
#include <linux/fsinfo.h>
#include <linux/socket.h>
#include <linux/fcntl.h>
#include <sys/stat.h>
#include <arpa/inet.h>

#ifndef __NR_fsinfo
#define __NR_fsinfo -1
#endif

static __attribute__((unused))
ssize_t fsinfo(int dfd, const char *filename,
	       struct fsinfo_params *params, size_t params_size,
	       void *result_buffer, size_t result_buf_size)
{
	return syscall(__NR_fsinfo, dfd, filename,
		       params, params_size,
		       result_buffer, result_buf_size);
}

static char tree_buf[4096];
static char bar_buf[4096];
static unsigned int children_list_interval;

/*
 * Get an fsinfo attribute in a statically allocated buffer.
 */
static void get_attr(unsigned int mnt_id, unsigned int attr, unsigned int Nth,
		     void *buf, size_t buf_size)
{
	struct fsinfo_params params = {
		.flags		= FSINFO_FLAGS_QUERY_MOUNT,
		.request	= attr,
		.Nth		= Nth,
	};
	char file[32];
	long ret;

	sprintf(file, "%u", mnt_id);

	memset(buf, 0xbd, buf_size);

	ret = fsinfo(AT_FDCWD, file, &params, sizeof(params), buf, buf_size);
	if (ret == -1) {
		fprintf(stderr, "mount-%s: %m\n", file);
		exit(1);
	}
}

/*
 * Get an fsinfo attribute in a dynamically allocated buffer.
 */
static void *get_attr_alloc(unsigned int mnt_id, unsigned int attr,
			    unsigned int Nth, size_t *_size)
{
	struct fsinfo_params params = {
		.flags		= FSINFO_FLAGS_QUERY_MOUNT,
		.request	= attr,
		.Nth		= Nth,
	};
	size_t buf_size = 4096;
	char file[32];
	void *r;
	long ret;

	sprintf(file, "%u", mnt_id);

	for (;;) {
		r = malloc(buf_size);
		if (!r) {
			perror("malloc");
			exit(1);
		}
		memset(r, 0xbd, buf_size);

		ret = fsinfo(AT_FDCWD, file, &params, sizeof(params), r, buf_size);
		if (ret == -1) {
			fprintf(stderr, "mount-%s: %x,%x,%x %m\n",
				file, params.request, params.Nth, params.Mth);
			exit(1);
		}

		if (ret <= buf_size) {
			*_size = ret;
			break;
		}
		buf_size = (ret + 4096 - 1) & ~(4096 - 1);
	}

	return r;
}

/*
 * Display a mount and then recurse through its children.
 */
static void display_mount(unsigned int mnt_id, unsigned int depth, char *path)
{
	struct fsinfo_mount_topology top;
	struct fsinfo_mount_child child;
	struct fsinfo_mount_info info;
	struct fsinfo_ids ids;
	void *children;
	unsigned int d;
	size_t ch_size, p_size;
	char dev[64];
	int i, n, s;

	get_attr(mnt_id, FSINFO_ATTR_MOUNT_TOPOLOGY, 0, &top, sizeof(top));
	get_attr(mnt_id, FSINFO_ATTR_MOUNT_INFO, 0, &info, sizeof(info));
	get_attr(mnt_id, FSINFO_ATTR_IDS, 0, &ids, sizeof(ids));
	if (depth > 0)
		printf("%s", tree_buf);

	s = strlen(path);
	printf("%s", !s ? "\"\"" : path);
	if (!s)
		s += 2;
	s += depth;
	if (s < 38)
		s = 38 - s;
	else
		s = 1;
	printf("%*.*s", s, s, "");

	sprintf(dev, "%x:%x", ids.f_dev_major, ids.f_dev_minor);
	printf("%10u %8x %2x %x %5s %s",
	       info.mnt_id,
	       (info.sb_changes +
		info.sb_notifications +
		info.mnt_attr_changes +
		info.mnt_topology_changes +
		info.mnt_subtree_notifications),
	       info.attr, top.propagation,
	       dev, ids.f_fs_name);
	putchar('\n');

	children = get_attr_alloc(mnt_id, FSINFO_ATTR_MOUNT_CHILDREN, 0, &ch_size);
	n = ch_size / children_list_interval - 1;

	bar_buf[depth + 1] = '|';
	if (depth > 0) {
		tree_buf[depth - 4 + 1] = bar_buf[depth - 4 + 1];
		tree_buf[depth - 4 + 2] = ' ';
	}

	tree_buf[depth + 0] = ' ';
	tree_buf[depth + 1] = '\\';
	tree_buf[depth + 2] = '_';
	tree_buf[depth + 3] = ' ';
	tree_buf[depth + 4] = 0;
	d = depth + 4;

	memset(&child, 0, sizeof(child));
	for (i = 0; i < n; i++) {
		void *p = children + i * children_list_interval;

		if (sizeof(child) >= children_list_interval)
			memcpy(&child, p, children_list_interval);
		else
			memcpy(&child, p, sizeof(child));

		if (i == n - 1)
			bar_buf[depth + 1] = ' ';
		path = get_attr_alloc(child.mnt_id, FSINFO_ATTR_MOUNT_POINT,
				      0, &p_size);
		display_mount(child.mnt_id, d, path + 1);
		free(path);
	}

	free(children);
	if (depth > 0) {
		tree_buf[depth - 4 + 1] = '\\';
		tree_buf[depth - 4 + 2] = '_';
	}
	tree_buf[depth] = 0;
}

/*
 * Find the ID of whatever is at the nominated path.
 */
static unsigned int lookup_mnt_by_path(const char *path)
{
	struct fsinfo_mount_info mnt;
	struct fsinfo_params params = {
		.flags		= FSINFO_FLAGS_QUERY_PATH,
		.request	= FSINFO_ATTR_MOUNT_INFO,
	};

	if (fsinfo(AT_FDCWD, path, &params, sizeof(params), &mnt, sizeof(mnt)) == -1) {
		perror(path);
		exit(1);
	}

	return mnt.mnt_id;
}

/*
 * Determine the element size for the mount child list.
 */
static unsigned int query_list_element_size(int mnt_id, unsigned int attr)
{
	struct fsinfo_attribute_info attr_info;

	get_attr(mnt_id, FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, attr,
		 &attr_info, sizeof(attr_info));
	return attr_info.size;
}

/*
 *
 */
int main(int argc, char **argv)
{
	unsigned int mnt_id;
	char *path;
	bool use_mnt_id = false;
	int opt;

	while ((opt = getopt(argc, argv, "m"))) {
		switch (opt) {
		case 'm':
			use_mnt_id = true;
			continue;
		}
		break;
	}

	argc -= optind;
	argv += optind;

	switch (argc) {
	case 0:
		mnt_id = lookup_mnt_by_path("/");
		path = "ROOT";
		break;
	case 1:
		path = argv[0];
		if (use_mnt_id) {
			mnt_id = strtoul(argv[0], NULL, 0);
			break;
		}

		mnt_id = lookup_mnt_by_path(argv[0]);
		break;
	default:
		printf("Format: test-mntinfo\n");
		printf("Format: test-mntinfo <path>\n");
		printf("Format: test-mntinfo -m <mnt_id>\n");
		exit(2);
	}

	children_list_interval =
		query_list_element_size(mnt_id, FSINFO_ATTR_MOUNT_CHILDREN);

	printf("MOUNT                                 MOUNT ID   CHANGE#  AT P DEV   TYPE\n");
	printf("------------------------------------- ---------- -------- -- - ----- --------\n");
	display_mount(mnt_id, 0, path);
	return 0;
}

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-31  5:11   ` Miklos Szeredi
  2020-03-31  8:15     ` Christian Brauner
  2020-03-31  8:34     ` Karel Zak
@ 2020-03-31 21:54     ` David Howells
  2020-04-01  8:43       ` Karel Zak
  2 siblings, 1 reply; 99+ messages in thread
From: David Howells @ 2020-03-31 21:54 UTC (permalink / raw)
  To: Karel Zak
  Cc: dhowells, Miklos Szeredi, Christian Brauner, Linus Torvalds,
	Al Viro, dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton,
	Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel,
	Lennart Poettering, Aleksa Sarai

Karel Zak <kzak@redhat.com> wrote:

> - improve fsinfo() to provide set (list) of the attributes by one call

That would be my preferred way.  I wouldn't want to let the user pin copies of
state, and I wouldn't want to make open(O_PATH) do it automatically.

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-31  8:56       ` Miklos Szeredi
  2020-03-31  9:49         ` Karel Zak
  2020-03-31 12:25         ` Lennart Poettering
@ 2020-03-31 21:56         ` David Howells
  2 siblings, 0 replies; 99+ messages in thread
From: David Howells @ 2020-03-31 21:56 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: dhowells, Miklos Szeredi, Karel Zak, Christian Brauner,
	Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel,
	linux-kernel, Aleksa Sarai

Lennart Poettering <mzxreary@0pointer.de> wrote:

> - We also have code that needs to check if /dev/ is plain tmpfs or
>   devtmpfs. We cannot use statfs for that, since in both cases
>   TMPFS_MAGIC is reported, hence we currently parse
>   /proc/self/mountinfo for that to find the fstype string there, which
>   is different for both cases.

btw, fsinfo(FSINFO_ATTR_IDS) gets you the name of the filesystem type in
addition to the magic number.

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-31 21:54     ` David Howells
@ 2020-04-01  8:43       ` Karel Zak
  0 siblings, 0 replies; 99+ messages in thread
From: Karel Zak @ 2020-04-01  8:43 UTC (permalink / raw)
  To: David Howells
  Cc: Miklos Szeredi, Christian Brauner, Linus Torvalds, Al Viro, dray,
	Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres,
	keyrings, linux-fsdevel, linux-kernel, Lennart Poettering,
	Aleksa Sarai

On Tue, Mar 31, 2020 at 10:54:23PM +0100, David Howells wrote:
> Karel Zak <kzak@redhat.com> wrote:
> 
> > - improve fsinfo() to provide set (list) of the attributes by one call
> 
> That would be my preferred way.  I wouldn't want to let the user pin copies of
> state, and I wouldn't want to make open(O_PATH) do it automatically.

You can create cow object on first fsinfo() call, ideally add some
flags to control this behavior -- but you're right, this way is
complicated to implement and possibly dangerous.

I guess return some vector of attributes in one fsinfo() will be good
enough.

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-31 21:52 ` David Howells
@ 2020-04-01  9:04   ` Karel Zak
  2020-04-01 13:34     ` Miklos Szeredi
                       ` (2 more replies)
  2020-04-01 14:41   ` Lennart Poettering
  1 sibling, 3 replies; 99+ messages in thread
From: Karel Zak @ 2020-04-01  9:04 UTC (permalink / raw)
  To: David Howells
  Cc: Christian Brauner, torvalds, viro, dray, mszeredi, swhiteho,
	jlayton, raven, andres, keyrings, linux-fsdevel, linux-kernel,
	lennart, cyphar

On Tue, Mar 31, 2020 at 10:52:52PM +0100, David Howells wrote:
> Christian Brauner <christian.brauner@ubuntu.com> wrote:
> 
> > querying all properties of a mount atomically all-at-once,
> 
> I don't actually offer that, per se.
> 
> Having an atomic all-at-once query for a single mount is actually quite a
> burden on the system.  There's potentially a lot of state involved, much of
> which you don't necessarily need.

If all means "all possible attributes" than it is unnecessary, for
example ext4 timestamps or volume uuid/label are rarely necessary.
We usually need together (as consistent set):

    source
    mountpoint
    FS type
    FS root (FSINFO_ATTR_MOUNT_PATH)
    FS options (FSINFO_ATTR_CONFIGURATION)
    VFS attributes
    VFS propagation flags
    mount ID
    parent ID
    devno (or maj:min)

 Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-01  9:04   ` Karel Zak
@ 2020-04-01 13:34     ` Miklos Szeredi
  2020-04-01 13:55     ` David Howells
  2020-04-01 13:58     ` David Howells
  2 siblings, 0 replies; 99+ messages in thread
From: Miklos Szeredi @ 2020-04-01 13:34 UTC (permalink / raw)
  To: Karel Zak
  Cc: David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray,
	Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres,
	keyrings, linux-fsdevel, linux-kernel, Lennart Poettering,
	Aleksa Sarai

[-- Attachment #1: Type: text/plain, Size: 1346 bytes --]

On Wed, Apr 1, 2020 at 11:05 AM Karel Zak <kzak@redhat.com> wrote:
>
> On Tue, Mar 31, 2020 at 10:52:52PM +0100, David Howells wrote:
> > Christian Brauner <christian.brauner@ubuntu.com> wrote:
> >
> > > querying all properties of a mount atomically all-at-once,
> >
> > I don't actually offer that, per se.
> >
> > Having an atomic all-at-once query for a single mount is actually quite a
> > burden on the system.  There's potentially a lot of state involved, much of
> > which you don't necessarily need.
>
> If all means "all possible attributes" than it is unnecessary, for
> example ext4 timestamps or volume uuid/label are rarely necessary.
> We usually need together (as consistent set):
>
>     source
>     mountpoint
>     FS type
>     FS root (FSINFO_ATTR_MOUNT_PATH)
>     FS options (FSINFO_ATTR_CONFIGURATION)
>     VFS attributes
>     VFS propagation flags
>     mount ID
>     parent ID
>     devno (or maj:min)

This is trivial with mountfs (reuse format of /proc/PID/mountinfo):

# cat /mnt/30/info
30 20 0:14 / /mnt rw,relatime - mountfs none rw

Attached patch applies against readfile patch.

We might want something more generic, and it won't get any simpler:

 mount.h          |    1 +
 mountfs/super.c  |   12 +++++++++++-
 proc_namespace.c |    2 +-
 3 files changed, 13 insertions(+), 2 deletions(-)

Thanks,
Miklos

[-- Attachment #2: mountfs-info.patch --]
[-- Type: text/x-patch, Size: 1769 bytes --]

---
 fs/mount.h          |    1 +
 fs/mountfs/super.c  |   12 +++++++++++-
 fs/proc_namespace.c |    2 +-
 3 files changed, 13 insertions(+), 2 deletions(-)

--- a/fs/mount.h
+++ b/fs/mount.h
@@ -186,3 +186,4 @@ void mountfs_create(struct mount *mnt);
 extern void mountfs_remove(struct mount *mnt);
 int mountfs_lookup_internal(struct vfsmount *m, struct path *path);
 
+int show_mountinfo(struct seq_file *m, struct vfsmount *mnt);
--- a/fs/mountfs/super.c
+++ b/fs/mountfs/super.c
@@ -22,7 +22,7 @@ struct mountfs_entry {
 
 static const char *mountfs_attrs[] = {
 	"root", "mountpoint", "id", "parent", "options", "children",
-	"group", "master", "propagate_from", "counter"
+	"group", "master", "propagate_from", "counter", "info"
 };
 
 #define MOUNTFS_INO(id) (((unsigned long) id + 1) * \
@@ -126,11 +126,21 @@ static int mountfs_attr_show(struct seq_
 		if (IS_MNT_SLAVE(mnt)) {
 			get_fs_root(current->fs, &root);
 			tmp = get_dominating_id(mnt, &root);
+			path_put(&root);
 			if (tmp)
 				seq_printf(sf, "%i\n", tmp);
 		}
 	} else if (strcmp(name, "counter") == 0) {
 		seq_printf(sf, "%u\n", atomic_read(&mnt->mnt_topology_changes));
+	} else if (strcmp(name, "info") == 0) {
+		struct proc_mounts p = {};
+
+		WARN_ON(sf->private);
+		sf->private = &p;
+		get_fs_root(current->fs, &p.root);
+		err = show_mountinfo(sf, m);
+		path_put(&p.root);
+		sf->private = NULL;
 	} else {
 		WARN_ON(1);
 		err = -EIO;
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -110,7 +110,7 @@ static int show_vfsmnt(struct seq_file *
 	return err;
 }
 
-static int show_mountinfo(struct seq_file *m, struct vfsmount *mnt)
+int show_mountinfo(struct seq_file *m, struct vfsmount *mnt)
 {
 	struct proc_mounts *p = m->private;
 	struct mount *r = real_mount(mnt);

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-01  9:04   ` Karel Zak
  2020-04-01 13:34     ` Miklos Szeredi
@ 2020-04-01 13:55     ` David Howells
  2020-04-01 13:58     ` David Howells
  2 siblings, 0 replies; 99+ messages in thread
From: David Howells @ 2020-04-01 13:55 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, Karel Zak, Christian Brauner, Linus Torvalds, Al Viro,
	dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent,
	andres, keyrings, linux-fsdevel, linux-kernel,
	Lennart Poettering, Aleksa Sarai

Miklos Szeredi <miklos@szeredi.hu> wrote:

> Attached patch applies against readfile patch.

But doesn't actually do what Karel asked for.  show_mountinfo() itself does
not give you what Karel asked for.  Plus there's more information you need to
add to it.

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-01  9:04   ` Karel Zak
  2020-04-01 13:34     ` Miklos Szeredi
  2020-04-01 13:55     ` David Howells
@ 2020-04-01 13:58     ` David Howells
  2020-04-01 15:25       ` Miklos Szeredi
  2020-04-01 16:01       ` David Howells
  2 siblings, 2 replies; 99+ messages in thread
From: David Howells @ 2020-04-01 13:58 UTC (permalink / raw)
  Cc: dhowells, Miklos Szeredi, Karel Zak, Christian Brauner,
	Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel,
	linux-kernel, Lennart Poettering, Aleksa Sarai

David Howells <dhowells@redhat.com> wrote:

> > Attached patch applies against readfile patch.
> 
> But doesn't actually do what Karel asked for.  show_mountinfo() itself does
> not give you what Karel asked for.  Plus there's more information you need to
> add to it.

And arguably, it's worse than just reading /proc/mounts.  If you get a
notification that something changed (ie. you poll /proc/mounts or mount
notifications gives you an overrun) you now have to read *every*
/mountfs/*/info file.  That is way more expensive.

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-03-31 21:52 ` David Howells
  2020-04-01  9:04   ` Karel Zak
@ 2020-04-01 14:41   ` Lennart Poettering
  2020-04-01 15:33     ` Miklos Szeredi
  2020-04-01 16:06     ` David Howells
  1 sibling, 2 replies; 99+ messages in thread
From: Lennart Poettering @ 2020-04-01 14:41 UTC (permalink / raw)
  To: David Howells
  Cc: Christian Brauner, torvalds, viro, dray, kzak, mszeredi,
	swhiteho, jlayton, raven, andres, keyrings, linux-fsdevel,
	linux-kernel, cyphar

On Di, 31.03.20 22:52, David Howells (dhowells@redhat.com) wrote:

> Christian Brauner <christian.brauner@ubuntu.com> wrote:
>
> > querying all properties of a mount atomically all-at-once,
>
> I don't actually offer that, per se.
>
> Having an atomic all-at-once query for a single mount is actually quite a
> burden on the system.  There's potentially a lot of state involved, much of
> which you don't necessarily need.

Hmm, do it like with statx() and specify a mask for the fields userspace
wants? Then it would be as lightweight as it possibly could be?

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-01 13:58     ` David Howells
@ 2020-04-01 15:25       ` Miklos Szeredi
  2020-04-03  9:11         ` Karel Zak
  2020-04-01 16:01       ` David Howells
  1 sibling, 1 reply; 99+ messages in thread
From: Miklos Szeredi @ 2020-04-01 15:25 UTC (permalink / raw)
  To: David Howells
  Cc: Karel Zak, Christian Brauner, Linus Torvalds, Al Viro, dray,
	Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres,
	keyrings, linux-fsdevel, linux-kernel, Lennart Poettering,
	Aleksa Sarai

[-- Attachment #1: Type: text/plain, Size: 1653 bytes --]

On Wed, Apr 1, 2020 at 3:58 PM David Howells <dhowells@redhat.com> wrote:
>
> David Howells <dhowells@redhat.com> wrote:
>
> > > Attached patch applies against readfile patch.
> >
> > But doesn't actually do what Karel asked for.  show_mountinfo() itself does
> > not give you what Karel asked for.

Not sure what you mean.  I think it shows precisely the information
Karel asked for.

>  Plus there's more information you need to
> > add to it.

The mountinfo format is extensible (see
Documentation/filesystems/proc.txt) so for example adding the change
counters would be simple.

> And arguably, it's worse than just reading /proc/mounts.  If you get a
> notification that something changed (ie. you poll /proc/mounts or mount
> notifications gives you an overrun) you now have to read *every*
> /mountfs/*/info file.  That is way more expensive.

fsinfo(2) will never be substantially cheaper than reading and parsing
/mnt/MNT_ID/info.  In fact reading a large part of the mount table
using fsinfo(2) will be substantially slower than parsing
/proc/self/mountinfo (this doesn't actually do the parsing but that
would add a very small amount of overhead):

root@kvm:~# ./test-fsinfo-perf /tmp/a 30000
--- make mounts ---
--- test fsinfo by path ---
sum(mnt_id) = 960000
--- test fsinfo by mnt_id ---
sum(mnt_id) = 960000
--- test /proc/fdinfo ---
sum(mnt_id) = 960000
--- test mountfs ---
sum(mnt_id) = 960000
--- test mountinfo ---
sum(mnt_id) = 960000
For   30000 mounts, f=    154963us f2=    148337us p=   1803699us p2=
  257019us; m=     53996us; p=11.6*f p=12.2*f2 p=7.0*p2 p=33.4*m
--- umount ---

Yes, that's 33 times faster!

Thanks,
Miklos

[-- Attachment #2: test-fsinfo-perf-mountinfo.patch --]
[-- Type: text/x-patch, Size: 3709 bytes --]

---
 samples/vfs/test-fsinfo-perf.c |   91 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 87 insertions(+), 4 deletions(-)

--- a/samples/vfs/test-fsinfo-perf.c
+++ b/samples/vfs/test-fsinfo-perf.c
@@ -339,6 +339,79 @@ static void get_id_by_mountfs(void)
 	} while (p = comma, *comma);
 }
 
+static void get_id_by_mountinfo(void)
+{
+	unsigned int base_mnt_id, mnt_id, x;
+	ssize_t len;
+	char procfile[100], buffer[4096], *p, *nl;
+	int fd, fd2, mntfd;
+
+	/* Start off by reading the mount ID from the base path */
+	fd = open(base_path, O_PATH);
+	ERR(fd, "open/path");
+	sprintf(procfile, "/proc/self/fdinfo/%u", fd);
+	fd2 = open(procfile, O_RDONLY);
+	ERR(fd2, "open/proc");
+	len = read(fd2, buffer, sizeof(buffer) - 1);
+	ERR(len, "read");
+	buffer[len] = 0;
+	close(fd2);
+	close(fd);
+
+	p = buffer;
+	do {
+		nl = strchr(p, '\n');
+		if (nl)
+			*nl++ = '\0';
+		else
+			nl = NULL;
+
+		if (strncmp(p, "mnt_id:", 7) != 0)
+			continue;
+		p += 7;
+		while (isblank(*p))
+			p++;
+		/* Have to allow for extra numbers being added to the line */
+		if (sscanf(p, "%u", &base_mnt_id) != 1) {
+			fprintf(stderr, "Bad format %s\n", procfile);
+			exit(3);
+		}
+		break;
+
+	} while ((p = nl));
+
+	if (!p) {
+		fprintf(stderr, "Missing field %s\n", procfile);
+		exit(3);
+	}
+
+	if (0) printf("[B] %u\n", base_mnt_id);
+
+	mntfd = open("/proc/self/mountinfo", O_RDONLY);
+	ERR(mntfd, "/proc/self/mountinfo");
+
+	while ((len = read(mntfd, buffer, sizeof(buffer)))) {
+		ERR(len, "read/mountinfo");
+
+		for (p = buffer; p < buffer + len; p = nl + 1) {
+			nl = strchr(p, '\n');
+			if (!nl) {
+				fprintf(stderr, "error parsing mountinfo\n");
+				exit(3);
+			}
+			*nl = '\0';
+			if (sscanf(p, "%i %i", &mnt_id, &x) != 2) {
+				fprintf(stderr, "error parsing mountinfo\n");
+				exit(3);
+			}
+			if (x == base_mnt_id)
+				sum_check += x;
+		}
+	}
+
+	close(mntfd);
+}
+
 static unsigned long duration(struct timeval *before, struct timeval *after)
 {
 	unsigned long a, b;
@@ -354,8 +427,9 @@ int main(int argc, char **argv)
 	struct timeval f2_before, f2_after;
 	struct timeval p_before, p_after;
 	struct timeval p2_before, p2_after;
+	struct timeval m_before, m_after;
 	const char *path;
-	unsigned long f_dur, f2_dur, p_dur, p2_dur;
+	unsigned long f_dur, f2_dur, p_dur, p2_dur, m_dur;
 
 	if (argc < 2) {
 		fprintf(stderr, "Format: %s <path> [nr_mounts]\n", argv[0]);
@@ -402,17 +476,26 @@ int main(int argc, char **argv)
 	ERR(gettimeofday(&p2_after, NULL), "gettimeofday");
 	printf("sum(mnt_id) = %lu\n", sum_check);
 
+	printf("--- test mountinfo ---\n");
+	sum_check = 0;
+	ERR(gettimeofday(&m_before, NULL), "gettimeofday");
+	get_id_by_mountinfo();
+	ERR(gettimeofday(&m_after, NULL), "gettimeofday");
+	printf("sum(mnt_id) = %lu\n", sum_check);
+
 	f_dur  = duration(&f_before,  &f_after);
 	f2_dur = duration(&f2_before, &f2_after);
 	p_dur  = duration(&p_before,  &p_after);
 	p2_dur = duration(&p2_before, &p2_after);
+	m_dur  = duration(&m_before,  &m_after);
 	//printf("fsinfo duration %10luus for %d mounts\n", f_dur, nr_mounts);
 	//printf("procfd duration %10luus for %d mounts\n", p_dur, nr_mounts);
 
-	printf("For %7d mounts, f=%10luus f2=%10luus p=%10luus p2=%10luus; p=%.1f*f p=%.1f*f2 p=%.1f*p2\n",
-	       nr_mounts, f_dur, f2_dur, p_dur, p2_dur,
+	printf("For %7d mounts, f=%10luus f2=%10luus p=%10luus p2=%10luus; m=%10luus; p=%.1f*f p=%.1f*f2 p=%.1f*p2 p=%.1f*m\n",
+	       nr_mounts, f_dur, f2_dur, p_dur, p2_dur, m_dur,
 	       (double)p_dur / (double)f_dur,
 	       (double)p_dur / (double)f2_dur,
-	       (double)p_dur / (double)p2_dur);
+	       (double)p_dur / (double)p2_dur,
+	       (double)p_dur / (double)m_dur);
 	return 0;
 }

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-01 14:41   ` Lennart Poettering
@ 2020-04-01 15:33     ` Miklos Szeredi
  2020-04-01 16:06     ` David Howells
  1 sibling, 0 replies; 99+ messages in thread
From: Miklos Szeredi @ 2020-04-01 15:33 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray,
	Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton,
	Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

On Wed, Apr 1, 2020 at 4:41 PM Lennart Poettering <mzxreary@0pointer.de> wrote:
>
> On Di, 31.03.20 22:52, David Howells (dhowells@redhat.com) wrote:
>
> > Christian Brauner <christian.brauner@ubuntu.com> wrote:
> >
> > > querying all properties of a mount atomically all-at-once,
> >
> > I don't actually offer that, per se.
> >
> > Having an atomic all-at-once query for a single mount is actually quite a
> > burden on the system.  There's potentially a lot of state involved, much of
> > which you don't necessarily need.
>
> Hmm, do it like with statx() and specify a mask for the fields userspace
> wants? Then it would be as lightweight as it possibly could be?

Yes, however binary structures mixed with variable length fields are
not going to be pretty.

Again, if we want something even halfway sane for a syscall interface,
go with a string key/value vector.

If that's really needed.  I've still not heard a convincing argument
in favor of a syscall.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-01 13:58     ` David Howells
  2020-04-01 15:25       ` Miklos Szeredi
@ 2020-04-01 16:01       ` David Howells
  2020-04-01 16:30         ` Miklos Szeredi
                           ` (2 more replies)
  1 sibling, 3 replies; 99+ messages in thread
From: David Howells @ 2020-04-01 16:01 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, Karel Zak, Christian Brauner, Linus Torvalds, Al Viro,
	dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent,
	andres, keyrings, linux-fsdevel, linux-kernel,
	Lennart Poettering, Aleksa Sarai

Miklos Szeredi <miklos@szeredi.hu> wrote:

> > > But doesn't actually do what Karel asked for.  show_mountinfo() itself does
> > > not give you what Karel asked for.
> 
> Not sure what you mean.  I think it shows precisely the information
> Karel asked for.

It's not atomic.

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-01 14:41   ` Lennart Poettering
  2020-04-01 15:33     ` Miklos Szeredi
@ 2020-04-01 16:06     ` David Howells
  2020-04-01 16:40       ` Miklos Szeredi
  1 sibling, 1 reply; 99+ messages in thread
From: David Howells @ 2020-04-01 16:06 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, Lennart Poettering, Christian Brauner, Linus Torvalds,
	Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel,
	linux-kernel, Aleksa Sarai

Miklos Szeredi <miklos@szeredi.hu> wrote:

> I've still not heard a convincing argument in favor of a syscall.

From your own results, scanning 10000 mounts through mountfs and reading just
two values from each is an order of magnitude slower without the effect of the
dentry/inode caches.  It gets faster on the second run because the mountfs
dentries and inodes are cached - but at a cost of >205MiB of RAM.  And it's
*still* slower than fsinfo().

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-01 16:01       ` David Howells
@ 2020-04-01 16:30         ` Miklos Szeredi
  2020-04-02 15:22         ` David Howells
  2020-04-02 15:24         ` David Howells
  2 siblings, 0 replies; 99+ messages in thread
From: Miklos Szeredi @ 2020-04-01 16:30 UTC (permalink / raw)
  To: David Howells
  Cc: Karel Zak, Christian Brauner, Linus Torvalds, Al Viro, dray,
	Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres,
	keyrings, linux-fsdevel, linux-kernel, Lennart Poettering,
	Aleksa Sarai

On Wed, Apr 1, 2020 at 6:02 PM David Howells <dhowells@redhat.com> wrote:
>
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > > > But doesn't actually do what Karel asked for.  show_mountinfo() itself does
> > > > not give you what Karel asked for.
> >
> > Not sure what you mean.  I think it shows precisely the information
> > Karel asked for.
>
> It's not atomic.

Yes it is.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-01 16:06     ` David Howells
@ 2020-04-01 16:40       ` Miklos Szeredi
  2020-04-02  2:52         ` Ian Kent
  0 siblings, 1 reply; 99+ messages in thread
From: Miklos Szeredi @ 2020-04-01 16:40 UTC (permalink / raw)
  To: David Howells
  Cc: Lennart Poettering, Christian Brauner, Linus Torvalds, Al Viro,
	dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton,
	Ian Kent, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

On Wed, Apr 1, 2020 at 6:07 PM David Howells <dhowells@redhat.com> wrote:
>
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > I've still not heard a convincing argument in favor of a syscall.
>
> From your own results, scanning 10000 mounts through mountfs and reading just
> two values from each is an order of magnitude slower without the effect of the
> dentry/inode caches.  It gets faster on the second run because the mountfs
> dentries and inodes are cached - but at a cost of >205MiB of RAM.  And it's
> *still* slower than fsinfo().

Already told you that we can just delete the dentry on dput_final, so
the memory argument is immaterial.

And the speed argument also, because there's no use case where that
would make a difference.  You keep bringing up the notification queue
overrun when watching a subtree, but that's going to be painful with
fsinfo(2) as well.   If that's a relevant use case (not saying it's
true), might as well add a /mnt/MNT_ID/subtree_info (trivial again)
that contains all information for the subtree.  Have fun implementing
that with fsinfo(2).

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-01 16:40       ` Miklos Szeredi
@ 2020-04-02  2:52         ` Ian Kent
  2020-04-02 13:52           ` Miklos Szeredi
  0 siblings, 1 reply; 99+ messages in thread
From: Ian Kent @ 2020-04-02  2:52 UTC (permalink / raw)
  To: Miklos Szeredi, David Howells
  Cc: Lennart Poettering, Christian Brauner, Linus Torvalds, Al Viro,
	dray, Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton,
	andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai

On Wed, 2020-04-01 at 18:40 +0200, Miklos Szeredi wrote:
> On Wed, Apr 1, 2020 at 6:07 PM David Howells <dhowells@redhat.com>
> wrote:
> > Miklos Szeredi <miklos@szeredi.hu> wrote:
> > 
> > > I've still not heard a convincing argument in favor of a syscall.
> > 
> > From your own results, scanning 10000 mounts through mountfs and
> > reading just
> > two values from each is an order of magnitude slower without the
> > effect of the
> > dentry/inode caches.  It gets faster on the second run because the
> > mountfs
> > dentries and inodes are cached - but at a cost of >205MiB of
> > RAM.  And it's
> > *still* slower than fsinfo().
> 
> Already told you that we can just delete the dentry on dput_final, so
> the memory argument is immaterial.
> 
> And the speed argument also, because there's no use case where that
> would make a difference.  You keep bringing up the notification queue
> overrun when watching a subtree, but that's going to be painful with
> fsinfo(2) as well.   If that's a relevant use case (not saying it's
> true), might as well add a /mnt/MNT_ID/subtree_info (trivial again)
> that contains all information for the subtree.  Have fun implementing
> that with fsinfo(2).

Forgive me for not trawling through your patch to work this out
but how does a poll on a path get what's needed to get mount info.

Or, more specifically, how does one get what's needed to go directly
to the place to get mount info. when something in the tree under the
polled path changes (mount/umount). IIUC poll alone won't do subtree
change monitoring?

Don't get me wrong, neither the proc nor the fsinfo implementations
deal with the notification storms that cause much of the problem we
see now.

IMHO that's a separate and very difficult problem in itself that
can't even be considered until getting the information efficiently
is resolved.

Ian

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-02  2:52         ` Ian Kent
@ 2020-04-02 13:52           ` Miklos Szeredi
  2020-04-02 14:36             ` Lennart Poettering
  2020-04-03  1:44             ` Ian Kent
  0 siblings, 2 replies; 99+ messages in thread
From: Miklos Szeredi @ 2020-04-02 13:52 UTC (permalink / raw)
  To: Ian Kent
  Cc: David Howells, Lennart Poettering, Christian Brauner,
	Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi,
	Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel,
	linux-kernel, Aleksa Sarai

On Thu, Apr 2, 2020 at 4:52 AM Ian Kent <raven@themaw.net> wrote:
>
> On Wed, 2020-04-01 at 18:40 +0200, Miklos Szeredi wrote:
> > On Wed, Apr 1, 2020 at 6:07 PM David Howells <dhowells@redhat.com>
> > wrote:
> > > Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > > I've still not heard a convincing argument in favor of a syscall.
> > >
> > > From your own results, scanning 10000 mounts through mountfs and
> > > reading just
> > > two values from each is an order of magnitude slower without the
> > > effect of the
> > > dentry/inode caches.  It gets faster on the second run because the
> > > mountfs
> > > dentries and inodes are cached - but at a cost of >205MiB of
> > > RAM.  And it's
> > > *still* slower than fsinfo().
> >
> > Already told you that we can just delete the dentry on dput_final, so
> > the memory argument is immaterial.
> >
> > And the speed argument also, because there's no use case where that
> > would make a difference.  You keep bringing up the notification queue
> > overrun when watching a subtree, but that's going to be painful with
> > fsinfo(2) as well.   If that's a relevant use case (not saying it's
> > true), might as well add a /mnt/MNT_ID/subtree_info (trivial again)
> > that contains all information for the subtree.  Have fun implementing
> > that with fsinfo(2).
>
> Forgive me for not trawling through your patch to work this out
> but how does a poll on a path get what's needed to get mount info.
>
> Or, more specifically, how does one get what's needed to go directly
> to the place to get mount info. when something in the tree under the
> polled path changes (mount/umount). IIUC poll alone won't do subtree
> change monitoring?

The mechanisms are basically the same as with fsinfo(2).   You can get
to the mountfs entry through the mount ID or through a proc/fd/ type
symlink.  So if you have a path, there are two options:

 - find out the mount ID belonging to that path and go to /mountfs/$mntid/
 - open the path with fd = open(path, O_PATH) and the go to
/proc/self/fdmount/$fd/

Currently the only way to find the mount id from a path is by parsing
/proc/self/fdinfo/$fd.  It is trivial, however, to extend statx(2) to
return it directly from a path.   Also the mount notification queue
that David implemented contains the mount ID of the changed mount.

> Don't get me wrong, neither the proc nor the fsinfo implementations
> deal with the notification storms that cause much of the problem we
> see now.
>
> IMHO that's a separate and very difficult problem in itself that
> can't even be considered until getting the information efficiently
> is resolved.

This mount notification storm issue got me thinking.   If I understand
correctly, systemd wants mount notifications so that it can do the
desktop pop-up thing.   Is that correct?

But that doesn't apply to automounts at all.  A new mount performed by
automount is uninteresting to to desktops, since it's triggered by
crossing the automount point (i.e. a normal path lookup), not an
external event like inserting a usb stick, etc...

Am I missing something?

Maybe the solution is to just allow filtering out such notifications
at the source, so automount triggers don't generate events for
systemd.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-02 13:52           ` Miklos Szeredi
@ 2020-04-02 14:36             ` Lennart Poettering
  2020-04-02 15:22               ` Miklos Szeredi
  2020-04-03  1:44             ` Ian Kent
  1 sibling, 1 reply; 99+ messages in thread
From: Lennart Poettering @ 2020-04-02 14:36 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds,
	Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

On Do, 02.04.20 15:52, Miklos Szeredi (miklos@szeredi.hu) wrote:

> > Don't get me wrong, neither the proc nor the fsinfo implementations
> > deal with the notification storms that cause much of the problem we
> > see now.
> >
> > IMHO that's a separate and very difficult problem in itself that
> > can't even be considered until getting the information efficiently
> > is resolved.
>
> This mount notification storm issue got me thinking.   If I understand
> correctly, systemd wants mount notifications so that it can do the
> desktop pop-up thing.   Is that correct?

This has little to do with the desktop. Startup scheduling is
mostly about figuring out when we can do the next step of startup, and
to a big amount this means issuing a mount command of some form, then
waiting until it is established, then invoking the next and so on, and
when the right mounts are established start the right services that
require them and so on. And with today's system complexity with
storage daemons and so on this all becomes a complex network of
concurrent dependencies.

Most mounts are established on behalf of pid 1 itself, for those we
could just wait until the mount syscall/command completes (and we
do). But there's plenty cases where that's not the case, hence we need
to make sure we follow system mount table state as a whole, regardless
if its systemd itself that triggers some mount or something else (for
example some shell script, udisks, …).

> But that doesn't apply to automounts at all.  A new mount performed by
> automount is uninteresting to to desktops, since it's triggered by
> crossing the automount point (i.e. a normal path lookup), not an
> external event like inserting a usb stick, etc...

systemd does not propagate mount events to desktops.

You appear to be thinking about the "udisks" project or so?

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-02 14:36             ` Lennart Poettering
@ 2020-04-02 15:22               ` Miklos Szeredi
  2020-04-02 15:28                 ` Lennart Poettering
                                   ` (2 more replies)
  0 siblings, 3 replies; 99+ messages in thread
From: Miklos Szeredi @ 2020-04-02 15:22 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds,
	Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

On Thu, Apr 2, 2020 at 4:36 PM Lennart Poettering <mzxreary@0pointer.de> wrote:

> You appear to be thinking about the "udisks" project or so?

Probably.

The real question is: is there a sane way to filter mount
notifications so that systemd receives only those which it is
interested in, rather than the tens of thousands that for example
autofs is managing and has nothing to do with systemd?

Is there a specific mountpoint or mountpoints that systemd is waiting
for?   How exactly does this work?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-01 16:01       ` David Howells
  2020-04-01 16:30         ` Miklos Szeredi
@ 2020-04-02 15:22         ` David Howells
  2020-04-02 15:24           ` Miklos Szeredi
  2020-04-02 15:42           ` David Howells
  2020-04-02 15:24         ` David Howells
  2 siblings, 2 replies; 99+ messages in thread
From: David Howells @ 2020-04-02 15:22 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, Karel Zak, Christian Brauner, Linus Torvalds, Al Viro,
	dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent,
	andres, keyrings, linux-fsdevel, linux-kernel,
	Lennart Poettering, Aleksa Sarai

Miklos Szeredi <miklos@szeredi.hu> wrote:

> > > Not sure what you mean.  I think it shows precisely the information
> > > Karel asked for.
> >
> > It's not atomic.
> 
> Yes it is.

No, it really isn't - though it could be made so.

ext4_show_mount(), for example, doesn't lock against "mount -o remount", so
the configuration can be changing whilst it's being rendered to text.

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-01 16:01       ` David Howells
  2020-04-01 16:30         ` Miklos Szeredi
  2020-04-02 15:22         ` David Howells
@ 2020-04-02 15:24         ` David Howells
  2 siblings, 0 replies; 99+ messages in thread
From: David Howells @ 2020-04-02 15:24 UTC (permalink / raw)
  Cc: dhowells, Miklos Szeredi, Karel Zak, Christian Brauner,
	Linus Torvalds, Al Viro, dray, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, Ian Kent, andres, keyrings, linux-fsdevel,
	linux-kernel, Lennart Poettering, Aleksa Sarai

David Howells <dhowells@redhat.com> wrote:

> ext4_show_mount()

ext4_show_options(), sorry.

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-02 15:22         ` David Howells
@ 2020-04-02 15:24           ` Miklos Szeredi
  2020-04-02 15:42           ` David Howells
  1 sibling, 0 replies; 99+ messages in thread
From: Miklos Szeredi @ 2020-04-02 15:24 UTC (permalink / raw)
  To: David Howells
  Cc: Karel Zak, Christian Brauner, Linus Torvalds, Al Viro, dray,
	Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres,
	keyrings, linux-fsdevel, linux-kernel, Lennart Poettering,
	Aleksa Sarai

On Thu, Apr 2, 2020 at 5:23 PM David Howells <dhowells@redhat.com> wrote:
>
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > > > Not sure what you mean.  I think it shows precisely the information
> > > > Karel asked for.
> > >
> > > It's not atomic.
> >
> > Yes it is.
>
> No, it really isn't - though it could be made so.
>
> ext4_show_mount(), for example, doesn't lock against "mount -o remount", so
> the configuration can be changing whilst it's being rendered to text.

Does s_umount nest inside namespace_sem?  I really don't see the
relation of those locks.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-02 15:22               ` Miklos Szeredi
@ 2020-04-02 15:28                 ` Lennart Poettering
  2020-04-02 15:35                   ` Miklos Szeredi
  2020-04-02 15:51                 ` David Howells
  2020-04-02 15:56                 ` David Howells
  2 siblings, 1 reply; 99+ messages in thread
From: Lennart Poettering @ 2020-04-02 15:28 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds,
	Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

On Do, 02.04.20 17:22, Miklos Szeredi (miklos@szeredi.hu) wrote:

> On Thu, Apr 2, 2020 at 4:36 PM Lennart Poettering <mzxreary@0pointer.de> wrote:
>
> > You appear to be thinking about the "udisks" project or so?
>
> Probably.
>
> The real question is: is there a sane way to filter mount
> notifications so that systemd receives only those which it is
> interested in, rather than the tens of thousands that for example
> autofs is managing and has nothing to do with systemd?

systemd cares about all mount points in PID1's mount namespace.

The fact that mount tables can grow large is why we want something
better than constantly reparsing the whole /proc/self/mountinfo. But
filtering subsets of that is something we don't really care about.

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-02 15:28                 ` Lennart Poettering
@ 2020-04-02 15:35                   ` Miklos Szeredi
  2020-04-02 15:50                     ` Lennart Poettering
  0 siblings, 1 reply; 99+ messages in thread
From: Miklos Szeredi @ 2020-04-02 15:35 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds,
	Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

On Thu, Apr 2, 2020 at 5:28 PM Lennart Poettering <mzxreary@0pointer.de> wrote:
>
> On Do, 02.04.20 17:22, Miklos Szeredi (miklos@szeredi.hu) wrote:
>
> > On Thu, Apr 2, 2020 at 4:36 PM Lennart Poettering <mzxreary@0pointer.de> wrote:
> >
> > > You appear to be thinking about the "udisks" project or so?
> >
> > Probably.
> >
> > The real question is: is there a sane way to filter mount
> > notifications so that systemd receives only those which it is
> > interested in, rather than the tens of thousands that for example
> > autofs is managing and has nothing to do with systemd?
>
> systemd cares about all mount points in PID1's mount namespace.
>
> The fact that mount tables can grow large is why we want something
> better than constantly reparsing the whole /proc/self/mountinfo. But
> filtering subsets of that is something we don't really care about.

I can accept that, but you haven't given a reason why that's so.

What does it do with the fact that an automount point was crossed, for
example?  How does that affect the operation of systemd?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-02 15:22         ` David Howells
  2020-04-02 15:24           ` Miklos Szeredi
@ 2020-04-02 15:42           ` David Howells
  1 sibling, 0 replies; 99+ messages in thread
From: David Howells @ 2020-04-02 15:42 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, Karel Zak, Christian Brauner, Linus Torvalds, Al Viro,
	dray, Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent,
	andres, keyrings, linux-fsdevel, linux-kernel,
	Lennart Poettering, Aleksa Sarai

Miklos Szeredi <miklos@szeredi.hu> wrote:

> > ext4_show_mount(), for example, doesn't lock against "mount -o remount", so
> > the configuration can be changing whilst it's being rendered to text.
> 
> Does s_umount nest inside namespace_sem?  I really don't see the
> relation of those locks.

If I understand aright what Al has told me, it's a bad idea to do any blocking
operation inside of namespace_sem apart from kmalloc(GFP_KERNEL).

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-02 15:35                   ` Miklos Szeredi
@ 2020-04-02 15:50                     ` Lennart Poettering
  2020-04-02 17:20                       ` Miklos Szeredi
  0 siblings, 1 reply; 99+ messages in thread
From: Lennart Poettering @ 2020-04-02 15:50 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds,
	Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

On Do, 02.04.20 17:35, Miklos Szeredi (miklos@szeredi.hu) wrote:

> > systemd cares about all mount points in PID1's mount namespace.
> >
> > The fact that mount tables can grow large is why we want something
> > better than constantly reparsing the whole /proc/self/mountinfo. But
> > filtering subsets of that is something we don't really care about.
>
> I can accept that, but you haven't given a reason why that's so.
>
> What does it do with the fact that an automount point was crossed, for
> example?  How does that affect the operation of systemd?

We don't care how a mount point came to be. If it's autofs or
something else, we don't care. We don't access these mount points
ourselves ever, we just watch their existance.

I mean, it's not just about startup it's also about shutdown. At
shutdown we need to unmount everything from the leaves towards the
root so that all file systems are in a clean state. And that means
*all* mounts, even autofs ones, even udisks ones, or whatever else
established them, we don't care. I mean, the autofs daemon can die any
time, we still must be able to sensibly shutdown, and thus unmount all
mounts inside some autofs hierarchy at the right time, before
unmounting the autofs top-level dir and then what might be further up
the tree.

systemd needs to know the whole tree, to figure out deps properly for
things like that, hence we aren't interested in filtering, we are
interested in minimizing what we do when something changes.

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-02 15:22               ` Miklos Szeredi
  2020-04-02 15:28                 ` Lennart Poettering
@ 2020-04-02 15:51                 ` David Howells
  2020-04-02 15:56                 ` David Howells
  2 siblings, 0 replies; 99+ messages in thread
From: David Howells @ 2020-04-02 15:51 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: dhowells, Miklos Szeredi, Ian Kent, Christian Brauner,
	Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi,
	Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel,
	linux-kernel, Aleksa Sarai

Lennart Poettering <mzxreary@0pointer.de> wrote:

> systemd cares about all mount points in PID1's mount namespace.
> 
> The fact that mount tables can grow large is why we want something
> better than constantly reparsing the whole /proc/self/mountinfo. But
> filtering subsets of that is something we don't really care about.

With the notifications stuff I've done, you can do, for example:

	pipe2(pipefd, O_NOTIFICATION_PIPE);
	ioctl(pipefd[0], IOC_WATCH_QUEUE_SET_SIZE, 256);
	watch_mount(AT_FDCWD, "/", 0, pipefd[0], 0x02);

And that will catch all mount object changes in the subtree rooted at the
given path, in this case "/".

If you want to limit it to just the notifications on that mount, you would
need to install a filter:

	struct watch_notification_filter filter = {
		.nr_filters	= 1,
		.filters = {
			[0]	= {
				.type		= WATCH_TYPE_MOUNT_NOTIFY,
				.subtype_filter[0]= UINT_MAX,
				.info_mask	= NOTIFY_MOUNT_IS_RECURSIVE,
				.info_filter	= 0,
			},
		},
	};
	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);

Note that this doesn't monitor for superblock changes and events.  They must
be watched individually with something like:

	watch_sb(AT_FDCWD, "/afs", AT_NO_AUTOMOUNT, pipefd[0], 0x27);

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-02 15:22               ` Miklos Szeredi
  2020-04-02 15:28                 ` Lennart Poettering
  2020-04-02 15:51                 ` David Howells
@ 2020-04-02 15:56                 ` David Howells
  2 siblings, 0 replies; 99+ messages in thread
From: David Howells @ 2020-04-02 15:56 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: dhowells, Miklos Szeredi, Ian Kent, Christian Brauner,
	Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi,
	Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel,
	linux-kernel, Aleksa Sarai

David Howells <dhowells@redhat.com> wrote:

> 				.info_mask	= NOTIFY_MOUNT_IS_RECURSIVE,

Sorry, I meant NOTIFY_MOUNT_IN_SUBTREE; NOTIFY_MOUNT_IS_RECURSIVE indicates
that the operation was recursive in nature.

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-02 15:50                     ` Lennart Poettering
@ 2020-04-02 17:20                       ` Miklos Szeredi
  2020-04-03 11:08                         ` Lennart Poettering
  0 siblings, 1 reply; 99+ messages in thread
From: Miklos Szeredi @ 2020-04-02 17:20 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds,
	Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

On Thu, Apr 2, 2020 at 5:50 PM Lennart Poettering <mzxreary@0pointer.de> wrote:
>
> On Do, 02.04.20 17:35, Miklos Szeredi (miklos@szeredi.hu) wrote:
>
> > > systemd cares about all mount points in PID1's mount namespace.
> > >
> > > The fact that mount tables can grow large is why we want something
> > > better than constantly reparsing the whole /proc/self/mountinfo. But
> > > filtering subsets of that is something we don't really care about.
> >
> > I can accept that, but you haven't given a reason why that's so.
> >
> > What does it do with the fact that an automount point was crossed, for
> > example?  How does that affect the operation of systemd?
>
> We don't care how a mount point came to be. If it's autofs or
> something else, we don't care. We don't access these mount points
> ourselves ever, we just watch their existance.
>
> I mean, it's not just about startup it's also about shutdown. At
> shutdown we need to unmount everything from the leaves towards the
> root so that all file systems are in a clean state.

Unfortunately that's not guaranteed by umounting all filesystems from
the init namespace.  A filesystem is shut down when all references to
it are gone.  Perhaps you instead want to lazy unmount root (yeah,
that may not actually be allowed, but anyway, lazy unmounting the top
level ones should do) and watch for super block shutdown events
instead.

Does that make any sense?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-02 13:52           ` Miklos Szeredi
  2020-04-02 14:36             ` Lennart Poettering
@ 2020-04-03  1:44             ` Ian Kent
  2020-04-03 11:11               ` Lennart Poettering
  1 sibling, 1 reply; 99+ messages in thread
From: Ian Kent @ 2020-04-03  1:44 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: David Howells, Lennart Poettering, Christian Brauner,
	Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi,
	Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel,
	linux-kernel, Aleksa Sarai

On Thu, 2020-04-02 at 15:52 +0200, Miklos Szeredi wrote:
> On Thu, Apr 2, 2020 at 4:52 AM Ian Kent <raven@themaw.net> wrote:
> > On Wed, 2020-04-01 at 18:40 +0200, Miklos Szeredi wrote:
> > > On Wed, Apr 1, 2020 at 6:07 PM David Howells <dhowells@redhat.com
> > > >
> > > wrote:
> > > > Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > > 
> > > > > I've still not heard a convincing argument in favor of a
> > > > > syscall.
> > > > 
> > > > From your own results, scanning 10000 mounts through mountfs
> > > > and
> > > > reading just
> > > > two values from each is an order of magnitude slower without
> > > > the
> > > > effect of the
> > > > dentry/inode caches.  It gets faster on the second run because
> > > > the
> > > > mountfs
> > > > dentries and inodes are cached - but at a cost of >205MiB of
> > > > RAM.  And it's
> > > > *still* slower than fsinfo().
> > > 
> > > Already told you that we can just delete the dentry on
> > > dput_final, so
> > > the memory argument is immaterial.
> > > 
> > > And the speed argument also, because there's no use case where
> > > that
> > > would make a difference.  You keep bringing up the notification
> > > queue
> > > overrun when watching a subtree, but that's going to be painful
> > > with
> > > fsinfo(2) as well.   If that's a relevant use case (not saying
> > > it's
> > > true), might as well add a /mnt/MNT_ID/subtree_info (trivial
> > > again)
> > > that contains all information for the subtree.  Have fun
> > > implementing
> > > that with fsinfo(2).
> > 
> > Forgive me for not trawling through your patch to work this out
> > but how does a poll on a path get what's needed to get mount info.
> > 
> > Or, more specifically, how does one get what's needed to go
> > directly
> > to the place to get mount info. when something in the tree under
> > the
> > polled path changes (mount/umount). IIUC poll alone won't do
> > subtree
> > change monitoring?
> 
> The mechanisms are basically the same as with fsinfo(2).   You can
> get
> to the mountfs entry through the mount ID or through a proc/fd/ type
> symlink.  So if you have a path, there are two options:
> 
>  - find out the mount ID belonging to that path and go to
> /mountfs/$mntid/
>  - open the path with fd = open(path, O_PATH) and the go to
> /proc/self/fdmount/$fd/
> 
> Currently the only way to find the mount id from a path is by parsing
> /proc/self/fdinfo/$fd.  It is trivial, however, to extend statx(2) to
> return it directly from a path.   Also the mount notification queue
> that David implemented contains the mount ID of the changed mount.

I'm aware the mount id comes through David's notifications, I was
wondering how to get that via your recommendation, thanks.

In your scheme it sounds like the mount id doesn't hold the
importance it deserves, it's central to the whole idea of getting
information about these mounts. But it sounds like you need to
open fds to paths you might not know to find it ...

Your explanation wasn't clear on how one gets notifications of
events within a tree under a mount you've opened an fd on to
get events?

> 
> > Don't get me wrong, neither the proc nor the fsinfo implementations
> > deal with the notification storms that cause much of the problem we
> > see now.
> > 
> > IMHO that's a separate and very difficult problem in itself that
> > can't even be considered until getting the information efficiently
> > is resolved.
> 
> This mount notification storm issue got me thinking.   If I
> understand
> correctly, systemd wants mount notifications so that it can do the
> desktop pop-up thing.   Is that correct?
> 
> But that doesn't apply to automounts at all.  A new mount performed
> by
> automount is uninteresting to to desktops, since it's triggered by
> crossing the automount point (i.e. a normal path lookup), not an
> external event like inserting a usb stick, etc...
> 
> Am I missing something?

Yeah, you're not missing anything.

Unfortunately, in a recent discussion on the autofs mailing list,
an investigation showed that systemd does want/get events for
autofs mounts and proceeds to issue around a 100 or so events on
the d-bus for every one.

> 
> Maybe the solution is to just allow filtering out such notifications
> at the source, so automount triggers don't generate events for
> systemd.

Except that autofs automounts might be expected to be seen on a
desktop, that's not out of the question I guess.

Ian

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-01 15:25       ` Miklos Szeredi
@ 2020-04-03  9:11         ` Karel Zak
  0 siblings, 0 replies; 99+ messages in thread
From: Karel Zak @ 2020-04-03  9:11 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray,
	Miklos Szeredi, Steven Whitehouse, Jeff Layton, Ian Kent, andres,
	keyrings, linux-fsdevel, linux-kernel, Lennart Poettering,
	Aleksa Sarai

On Wed, Apr 01, 2020 at 05:25:54PM +0200, Miklos Szeredi wrote:
> fsinfo(2) will never be substantially cheaper than reading and parsing
> /mnt/MNT_ID/info.  In fact reading a large part of the mount table
> using fsinfo(2) will be substantially slower than parsing
> /proc/self/mountinfo (this doesn't actually do the parsing but that
> would add a very small amount of overhead):

I think nobody wants to use fsinfo() or mountfs as replacement to whole
/proc/self/mountinfo. It does not make sense. We need per-mountpoint
API, for whole mount table use-cases (like findmnt or lsblk) the
current mountinfo is good enough.

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-02 17:20                       ` Miklos Szeredi
@ 2020-04-03 11:08                         ` Lennart Poettering
  2020-04-03 11:48                           ` Miklos Szeredi
  0 siblings, 1 reply; 99+ messages in thread
From: Lennart Poettering @ 2020-04-03 11:08 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds,
	Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

On Do, 02.04.20 19:20, Miklos Szeredi (miklos@szeredi.hu) wrote:

> On Thu, Apr 2, 2020 at 5:50 PM Lennart Poettering <mzxreary@0pointer.de> wrote:
> >
> > On Do, 02.04.20 17:35, Miklos Szeredi (miklos@szeredi.hu) wrote:
> >
> > > > systemd cares about all mount points in PID1's mount namespace.
> > > >
> > > > The fact that mount tables can grow large is why we want something
> > > > better than constantly reparsing the whole /proc/self/mountinfo. But
> > > > filtering subsets of that is something we don't really care about.
> > >
> > > I can accept that, but you haven't given a reason why that's so.
> > >
> > > What does it do with the fact that an automount point was crossed, for
> > > example?  How does that affect the operation of systemd?
> >
> > We don't care how a mount point came to be. If it's autofs or
> > something else, we don't care. We don't access these mount points
> > ourselves ever, we just watch their existance.
> >
> > I mean, it's not just about startup it's also about shutdown. At
> > shutdown we need to unmount everything from the leaves towards the
> > root so that all file systems are in a clean state.
>
> Unfortunately that's not guaranteed by umounting all filesystems from
> the init namespace.  A filesystem is shut down when all references to
> it are gone.  Perhaps you instead want to lazy unmount root (yeah,
> that may not actually be allowed, but anyway, lazy unmounting the top
> level ones should do) and watch for super block shutdown events
> instead.
>
> Does that make any sense?

When all mounts in the init mount namespace are unmounted and all
remaining processes killed we switch root back to the initrd, so that
even the root fs can be unmounted, and then we disassemble any backing
complex storage if there is, i.e. lvm, luks, raid, …

Because the initrd is its own little root fs independent of the actual
root we can fully disassemble everything this way, as we do not retain
any references to it anymore in any way.

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-03  1:44             ` Ian Kent
@ 2020-04-03 11:11               ` Lennart Poettering
  2020-04-03 11:38                 ` Miklos Szeredi
  0 siblings, 1 reply; 99+ messages in thread
From: Lennart Poettering @ 2020-04-03 11:11 UTC (permalink / raw)
  To: Ian Kent
  Cc: Miklos Szeredi, David Howells, Christian Brauner, Linus Torvalds,
	Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

On Fr, 03.04.20 09:44, Ian Kent (raven@themaw.net) wrote:

> > Currently the only way to find the mount id from a path is by parsing
> > /proc/self/fdinfo/$fd.  It is trivial, however, to extend statx(2) to
> > return it directly from a path.   Also the mount notification queue
> > that David implemented contains the mount ID of the changed mount.

I would love to have the mount ID exposed via statx().

In systemd we generally try name_to_handle_at() to query the mount ID
first. It returns both the actual fhandle and the mount ID after all,
and we then throw the fhandle away. It's not available on all fs
though, but it has the benefit that it works without procfs and on
a number of older kernels that didn't expose the mnt id in fdinfo.

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-03 11:11               ` Lennart Poettering
@ 2020-04-03 11:38                 ` Miklos Szeredi
  2020-04-03 12:05                   ` Richard Weinberger
                                     ` (2 more replies)
  0 siblings, 3 replies; 99+ messages in thread
From: Miklos Szeredi @ 2020-04-03 11:38 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds,
	Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

[-- Attachment #1: Type: text/plain, Size: 535 bytes --]

On Fri, Apr 3, 2020 at 1:11 PM Lennart Poettering <mzxreary@0pointer.de> wrote:
>
> On Fr, 03.04.20 09:44, Ian Kent (raven@themaw.net) wrote:
>
> > > Currently the only way to find the mount id from a path is by parsing
> > > /proc/self/fdinfo/$fd.  It is trivial, however, to extend statx(2) to
> > > return it directly from a path.   Also the mount notification queue
> > > that David implemented contains the mount ID of the changed mount.
>
> I would love to have the mount ID exposed via statx().

Here's a patch.

Thanks,
Miklos

[-- Attachment #2: statx-add-mount-id.patch --]
[-- Type: text/x-patch, Size: 2253 bytes --]

From: Miklos Szeredi <mszeredi@redhat.com>
Subject: statx: add mount ID

Systemd is hacking around to get it and it's trivial to add to statx, so...

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/stat.c                 |    6 ++++++
 include/linux/stat.h      |    1 +
 include/uapi/linux/stat.h |    6 +++++-
 3 files changed, 12 insertions(+), 1 deletion(-)

--- a/fs/stat.c
+++ b/fs/stat.c
@@ -22,6 +22,7 @@
 #include <asm/unistd.h>
 
 #include "internal.h"
+#include "mount.h"
 
 /**
  * generic_fillattr - Fill in the basic attributes from the inode struct
@@ -199,6 +200,10 @@ int vfs_statx(int dfd, const char __user
 		goto out;
 
 	error = vfs_getattr(&path, stat, request_mask, flags);
+	if (!error) {
+		stat->mnt_id = real_mount(path.mnt)->mnt_id;
+		stat->result_mask |= STATX_MNT_ID;
+	}
 	path_put(&path);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
@@ -563,6 +568,7 @@ cp_statx(const struct kstat *stat, struc
 	tmp.stx_rdev_minor = MINOR(stat->rdev);
 	tmp.stx_dev_major = MAJOR(stat->dev);
 	tmp.stx_dev_minor = MINOR(stat->dev);
+	tmp.stx_mnt_id = stat->mnt_id;
 
 	return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
 }
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -47,6 +47,7 @@ struct kstat {
 	struct timespec64 ctime;
 	struct timespec64 btime;			/* File creation time */
 	u64		blocks;
+	u64		mnt_id;
 };
 
 #endif
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -123,7 +123,10 @@ struct statx {
 	__u32	stx_dev_major;	/* ID of device containing file [uncond] */
 	__u32	stx_dev_minor;
 	/* 0x90 */
-	__u64	__spare2[14];	/* Spare space for future expansion */
+	__u64	stx_mnt_id;
+	__u64	__spare2;
+	/* 0xa0 */
+	__u64	__spare3[12];	/* Spare space for future expansion */
 	/* 0x100 */
 };
 
@@ -148,6 +151,7 @@ struct statx {
 #define STATX_BLOCKS		0x00000400U	/* Want/got stx_blocks */
 #define STATX_BASIC_STATS	0x000007ffU	/* The stuff in the normal stat struct */
 #define STATX_BTIME		0x00000800U	/* Want/got stx_btime */
+#define STATX_MNT_ID		0x00001000U	/* Got stx_mnt_id */
 #define STATX_ALL		0x00000fffU	/* All currently supported flags */
 #define STATX__RESERVED		0x80000000U	/* Reserved for future struct statx expansion */
 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-03 11:08                         ` Lennart Poettering
@ 2020-04-03 11:48                           ` Miklos Szeredi
  2020-04-03 15:01                             ` Lennart Poettering
  0 siblings, 1 reply; 99+ messages in thread
From: Miklos Szeredi @ 2020-04-03 11:48 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds,
	Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

On Fri, Apr 3, 2020 at 1:08 PM Lennart Poettering <mzxreary@0pointer.de> wrote:
>
> On Do, 02.04.20 19:20, Miklos Szeredi (miklos@szeredi.hu) wrote:
>
> > On Thu, Apr 2, 2020 at 5:50 PM Lennart Poettering <mzxreary@0pointer.de> wrote:
> > >
> > > On Do, 02.04.20 17:35, Miklos Szeredi (miklos@szeredi.hu) wrote:
> > >
> > > > > systemd cares about all mount points in PID1's mount namespace.
> > > > >
> > > > > The fact that mount tables can grow large is why we want something
> > > > > better than constantly reparsing the whole /proc/self/mountinfo. But
> > > > > filtering subsets of that is something we don't really care about.
> > > >
> > > > I can accept that, but you haven't given a reason why that's so.
> > > >
> > > > What does it do with the fact that an automount point was crossed, for
> > > > example?  How does that affect the operation of systemd?
> > >
> > > We don't care how a mount point came to be. If it's autofs or
> > > something else, we don't care. We don't access these mount points
> > > ourselves ever, we just watch their existance.
> > >
> > > I mean, it's not just about startup it's also about shutdown. At
> > > shutdown we need to unmount everything from the leaves towards the
> > > root so that all file systems are in a clean state.
> >
> > Unfortunately that's not guaranteed by umounting all filesystems from
> > the init namespace.  A filesystem is shut down when all references to
> > it are gone.  Perhaps you instead want to lazy unmount root (yeah,
> > that may not actually be allowed, but anyway, lazy unmounting the top
> > level ones should do) and watch for super block shutdown events
> > instead.
> >
> > Does that make any sense?
>
> When all mounts in the init mount namespace are unmounted and all
> remaining processes killed we switch root back to the initrd, so that
> even the root fs can be unmounted, and then we disassemble any backing
> complex storage if there is, i.e. lvm, luks, raid, …

I think it could be done the other way round, much simpler:

 - switch back to initrd
 - umount root, keeping the tree intact (UMOUNT_DETACHED)
 - kill all remaining processes, wait for all to exit

I think that should guarantee that all super blocks have been shut down.  Al?

The advantage would be that there's no need to walk the mount tree
unmounting individual leafs, since it's all done automagically.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-03 11:38                 ` Miklos Szeredi
@ 2020-04-03 12:05                   ` Richard Weinberger
  2020-04-03 15:12                   ` Lennart Poettering
  2020-04-03 15:36                   ` David Howells
  2 siblings, 0 replies; 99+ messages in thread
From: Richard Weinberger @ 2020-04-03 12:05 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Lennart Poettering, Ian Kent, David Howells, Christian Brauner,
	Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi,
	Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel,
	LKML, Aleksa Sarai

On Fri, Apr 3, 2020 at 1:40 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Fri, Apr 3, 2020 at 1:11 PM Lennart Poettering <mzxreary@0pointer.de> wrote:
> >
> > On Fr, 03.04.20 09:44, Ian Kent (raven@themaw.net) wrote:
> >
> > > > Currently the only way to find the mount id from a path is by parsing
> > > > /proc/self/fdinfo/$fd.  It is trivial, however, to extend statx(2) to
> > > > return it directly from a path.   Also the mount notification queue
> > > > that David implemented contains the mount ID of the changed mount.
> >
> > I would love to have the mount ID exposed via statx().
>
> Here's a patch.

I was looking more than once for a nice way to get the mount id.
Having it exposed via statx() would be great!

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-03 11:48                           ` Miklos Szeredi
@ 2020-04-03 15:01                             ` Lennart Poettering
  2020-04-06  9:22                               ` Miklos Szeredi
  0 siblings, 1 reply; 99+ messages in thread
From: Lennart Poettering @ 2020-04-03 15:01 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds,
	Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

On Fr, 03.04.20 13:48, Miklos Szeredi (miklos@szeredi.hu) wrote:

> > > Does that make any sense?
> >
> > When all mounts in the init mount namespace are unmounted and all
> > remaining processes killed we switch root back to the initrd, so that
> > even the root fs can be unmounted, and then we disassemble any backing
> > complex storage if there is, i.e. lvm, luks, raid, …
>
> I think it could be done the other way round, much simpler:
>
>  - switch back to initrd
>  - umount root, keeping the tree intact (UMOUNT_DETACHED)
>  - kill all remaining processes, wait for all to exit

Nah. What I wrote above is drastically simplified. It's IRL more
complex. Specific services need to be killed between certain mounts
are unmounted, since they are a backend for another mount. NFS, or
FUSE or stuff like that usually has some processes backing them
around, and we need to stop the mounts they provide before these
services, and then the mounts these services reside on after that, and
so on. It's a complex dependency tree of stuff that needs to be done
in order, so that we can deal with arbitrarily nested mounts, storage
subsystems, and backing services.

Anyway, this all works fine in systemd, the dependency logic is
there. We want a more efficient way to watch mounts, that's
all. Subscribing and constantly reparsing /proc/self/mountinfo is
awful, that's all.

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-03 11:38                 ` Miklos Szeredi
  2020-04-03 12:05                   ` Richard Weinberger
@ 2020-04-03 15:12                   ` Lennart Poettering
  2020-04-03 20:30                     ` J. Bruce Fields
  2020-04-03 15:36                   ` David Howells
  2 siblings, 1 reply; 99+ messages in thread
From: Lennart Poettering @ 2020-04-03 15:12 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds,
	Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

On Fr, 03.04.20 13:38, Miklos Szeredi (miklos@szeredi.hu) wrote:

> On Fri, Apr 3, 2020 at 1:11 PM Lennart Poettering <mzxreary@0pointer.de> wrote:
> >
> > On Fr, 03.04.20 09:44, Ian Kent (raven@themaw.net) wrote:
> >
> > > > Currently the only way to find the mount id from a path is by parsing
> > > > /proc/self/fdinfo/$fd.  It is trivial, however, to extend statx(2) to
> > > > return it directly from a path.   Also the mount notification queue
> > > > that David implemented contains the mount ID of the changed mount.
> >
> > I would love to have the mount ID exposed via statx().
>
> Here's a patch.

Oh, this is excellent. I love it, thanks!

BTW, while we are at it: one more thing I'd love to see exposed by
statx() is a simple flag whether the inode is a mount point. There's
plenty code that implements a test like this all over the place, and
it usually isn't very safe. There's one implementation in util-linux
for example (in the /usr/bin/mountpoint binary), and another one in
systemd. Would be awesome to just have a statx() return flag for that,
that would make things *so* much easier and more robust. because in
fact most code isn't very good that implements this, as much of it
just compares st_dev of the specified file and its parent. Better code
compares the mount ID, but as mentioned that's not as pretty as it
could be so far...

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-03 11:38                 ` Miklos Szeredi
  2020-04-03 12:05                   ` Richard Weinberger
  2020-04-03 15:12                   ` Lennart Poettering
@ 2020-04-03 15:36                   ` David Howells
  2020-04-03 15:41                     ` Lennart Poettering
  2 siblings, 1 reply; 99+ messages in thread
From: David Howells @ 2020-04-03 15:36 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: dhowells, Miklos Szeredi, Ian Kent, Christian Brauner,
	Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi,
	Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel,
	linux-kernel, Aleksa Sarai

Lennart Poettering <mzxreary@0pointer.de> wrote:

> BTW, while we are at it: one more thing I'd love to see exposed by
> statx() is a simple flag whether the inode is a mount point.

Note that an inode or a dentry might be a mount point in one namespace, but
not in another.  Do you actually mean an inode - or do you actually mean the
(mount,dentry) pair that you're looking at?  (Ie. should it be namespace
specific?)

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-03 15:36                   ` David Howells
@ 2020-04-03 15:41                     ` Lennart Poettering
  0 siblings, 0 replies; 99+ messages in thread
From: Lennart Poettering @ 2020-04-03 15:41 UTC (permalink / raw)
  To: David Howells
  Cc: Miklos Szeredi, Ian Kent, Christian Brauner, Linus Torvalds,
	Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

On Fr, 03.04.20 16:36, David Howells (dhowells@redhat.com) wrote:

> Lennart Poettering <mzxreary@0pointer.de> wrote:
>
> > BTW, while we are at it: one more thing I'd love to see exposed by
> > statx() is a simple flag whether the inode is a mount point.
>
> Note that an inode or a dentry might be a mount point in one namespace, but
> not in another.  Do you actually mean an inode - or do you actually mean the
> (mount,dentry) pair that you're looking at?  (Ie. should it be namespace
> specific?)

yes, it should be specific to the mount hierarchy in the current namespace.

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-03 15:12                   ` Lennart Poettering
@ 2020-04-03 20:30                     ` J. Bruce Fields
  2020-04-06  8:35                       ` Miklos Szeredi
  2020-04-06  9:17                       ` Karel Zak
  0 siblings, 2 replies; 99+ messages in thread
From: J. Bruce Fields @ 2020-04-03 20:30 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Miklos Szeredi, Ian Kent, David Howells, Christian Brauner,
	Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi,
	Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel,
	linux-kernel, Aleksa Sarai

On Fri, Apr 03, 2020 at 05:12:23PM +0200, Lennart Poettering wrote:
> BTW, while we are at it: one more thing I'd love to see exposed by
> statx() is a simple flag whether the inode is a mount point. There's
> plenty code that implements a test like this all over the place, and
> it usually isn't very safe. There's one implementation in util-linux
> for example (in the /usr/bin/mountpoint binary), and another one in
> systemd. Would be awesome to just have a statx() return flag for that,
> that would make things *so* much easier and more robust. because in
> fact most code isn't very good that implements this, as much of it
> just compares st_dev of the specified file and its parent. Better code
> compares the mount ID, but as mentioned that's not as pretty as it
> could be so far...

nfs-utils/support/misc/mountpoint.c:check_is_mountpoint() stats the file
and ".." and returns true if they have different st_dev or the same
st_ino.  Comparing mount ids sounds better.

So anyway, yes, everybody reinvents the wheel here, and this would be
useful.  (And, yes, we want to know for the vfsmount, we don't care
whether the same inode is used as a mountpoint someplace else.)

--b.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [GIT PULL] Mount and superblock notifications
  2020-03-30 14:36 ` [GIT PULL] Mount and superblock notifications David Howells
@ 2020-04-04 21:13   ` Linus Torvalds
  2020-04-05 22:52     ` Andres Freund
  0 siblings, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2020-04-04 21:13 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse,
	jlayton, Ian Kent, andres, Christian Brauner, keyrings,
	linux-fsdevel, Linux Kernel Mailing List

On Mon, Mar 30, 2020 at 7:37 AM David Howells <dhowells@redhat.com> wrote:
>
> If you could consider pulling this - or would you prefer it to go through
> Al?  It adds a couple of VFS-related event sources for the general
> notification mechanism:

<y issue with these remains the same it was last time, so I'll just
quote what I said back then:

 "So I no longer hate the implementation, but I do want to see the
  actual user space users come out of the woodwork and try this out for
  their use cases.

  I'd hate to see a new event queue interface that people then can't
  really use due to it not fulfilling their needs, or can't use for some
  other reason."

I want to see somebody step up enough to say "yes, I actually use
this, and have the patches for the user space side, and it helps my
load by 3000%, and here are the numbers, and the event overflow case
isn't an issue because Y"

Or whatever. It doesn't have to be performance, but the separate
discussion I've seen has been about that being the reason for it.

I just don't want it to be a _hypothetical_ reason. I want it to be a
tested reason where people said "yeah, this is easy to use and
actually fixes the problems".

Because if what happens is that when the events overflow, and maybe
people fall back on the old model (or whatever) then that probably
just means that you do better up until a point where you start doing
_worse_ than we used to.

Or people find out that they needed more information anyway, and the
event model doesn't work when you restart your special server because
you've lost the original state. Or any other number of "cool feature,
but I can't really use it".

IOW, I really want to know that yes, the design is what people will
then use and it actually fixes real-world issues.

And it needs to be interesting and pressing enough that those people
actually at least do a working prototype on top of a patch-set that
hasn't made it into the kernel yet.

Now, I realize that other projects won't _upstream_ their support
before the kernel has the infrastructure, so I'm not looking for
_that_ kind of "yeah, look, project XYZ already does this and Red Hat
ships it". No, I'm looking for those outside developers who say more
than "this is a pet peeve of mine with the existing interface". I want
to see some actual use - even if it's just in a development
environment - that shows that it's (a) sufficient and (b) actually
fixes problems.

             Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [GIT PULL] Mount and superblock notifications
  2020-04-04 21:13   ` Linus Torvalds
@ 2020-04-05 22:52     ` Andres Freund
  0 siblings, 0 replies; 99+ messages in thread
From: Andres Freund @ 2020-04-05 22:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Al Viro, dray, Karel Zak, Miklos Szeredi,
	Steven Whitehouse, jlayton, Ian Kent, Christian Brauner,
	keyrings, linux-fsdevel, Linux Kernel Mailing List

Hi,

On 2020-04-04 14:13:03 -0700, Linus Torvalds wrote:
> And it needs to be interesting and pressing enough that those people
> actually at least do a working prototype on top of a patch-set that
> hasn't made it into the kernel yet.
> 
> Now, I realize that other projects won't _upstream_ their support
> before the kernel has the infrastructure, so I'm not looking for
> _that_ kind of "yeah, look, project XYZ already does this and Red Hat
> ships it". No, I'm looking for those outside developers who say more
> than "this is a pet peeve of mine with the existing interface". I want
> to see some actual use - even if it's just in a development
> environment - that shows that it's (a) sufficient and (b) actually
> fixes problems.

FWIW, postgres remains interested in using the per-superblock events.

On 2020-03-30 15:36:54 +0100, David Howells wrote:
>  (2) Superblock notifications.
> 
>      This one is provided to allow systemd or the desktop to more easily
>      detect events such as I/O errors and EDQUOT/ENOSPC.  This would be of
>      interest to Postgres:
> 
> 	https://lore.kernel.org/linux-fsdevel/20200211005626.7yqjf5rbs3vbwagd@alap3.anarazel.de/
> 
>      But could also be used to indicate to systemd when a superblock has
>      had its configuration changed.

What prevents me from coming up with a prototype is that the error
handling pieces aren't complete, as far as I can tell:

On 2020-03-30 15:36:54 +0100, David Howells wrote:
>  (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>      errors (not complete yet).

There's afaict no notify_sb_error() callers, making it hard for me to
actually test anything.

The important issue for us is I/O errors, but EDQUOT/ENOSPC could also
be useful (but is not urgent).

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-03 20:30                     ` J. Bruce Fields
@ 2020-04-06  8:35                       ` Miklos Szeredi
  2020-04-06 16:07                         ` J. Bruce Fields
  2020-04-06  9:17                       ` Karel Zak
  1 sibling, 1 reply; 99+ messages in thread
From: Miklos Szeredi @ 2020-04-06  8:35 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Lennart Poettering, Ian Kent, David Howells, Christian Brauner,
	Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi,
	Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel,
	linux-kernel, Aleksa Sarai

[-- Attachment #1: Type: text/plain, Size: 2010 bytes --]

On Fri, Apr 3, 2020 at 10:30 PM J. Bruce Fields <bfields@fieldses.org> wrote:
>
> On Fri, Apr 03, 2020 at 05:12:23PM +0200, Lennart Poettering wrote:
> > BTW, while we are at it: one more thing I'd love to see exposed by
> > statx() is a simple flag whether the inode is a mount point. There's
> > plenty code that implements a test like this all over the place, and
> > it usually isn't very safe. There's one implementation in util-linux
> > for example (in the /usr/bin/mountpoint binary), and another one in
> > systemd. Would be awesome to just have a statx() return flag for that,
> > that would make things *so* much easier and more robust. because in
> > fact most code isn't very good that implements this, as much of it
> > just compares st_dev of the specified file and its parent. Better code
> > compares the mount ID, but as mentioned that's not as pretty as it
> > could be so far...
>
> nfs-utils/support/misc/mountpoint.c:check_is_mountpoint() stats the file
> and ".." and returns true if they have different st_dev or the same
> st_ino.  Comparing mount ids sounds better.
>
> So anyway, yes, everybody reinvents the wheel here, and this would be
> useful.  (And, yes, we want to know for the vfsmount, we don't care
> whether the same inode is used as a mountpoint someplace else.)

Attaching a patch.

There's some ambiguity about what is a "mountpoint" and what these
tools are interested in.  My guess is that they are not interested in
an object being a mount point (something where another object is
mounted) but being a mount root (this is the object mounted at the
mount point).  I.e

fd = open("/mnt", O_PATH);
mount("/bin", "/mnt", NULL, MS_BIND, NULL);
statx(AT_FDCWD, "/mnt", 0, 0, &stx1);
statx(fd, "", AT_EMPTY_PATH, 0, &stx2);
printf("mount_root(/mnt) = %c, mount_root(fd) = %c\n",
    stx1.stx_attributes & STATX_ATTR_MOUNT_ROOT ? 'y' : 'n',
    stx2.stx_attributes & STATX_ATTR_MOUNT_ROOT ? 'y' : 'n');

Would print:
mount_root(/mnt) = y, mount_root(fd) = n

Thanks,
Miklos

[-- Attachment #2: statx-add-mount_root.patch --]
[-- Type: text/x-patch, Size: 1520 bytes --]

From: Miklos Szeredi <mszeredi@redhat.com>
Subject: statx: add mount_root

Determining whether a path or file descriptor refers to a mountpoint (or
more precisely a mount root) is not trivial using current tools.

Add a flag to statx that indicates whether the path or fd refers to the
root of a mount or not.

Reported-by: Lennart Poettering <mzxreary@0pointer.de>
Reported-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/stat.c                 |    3 +++
 include/uapi/linux/stat.h |    1 +
 2 files changed, 4 insertions(+)

--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -172,6 +172,7 @@ struct statx {
 #define STATX_ATTR_NODUMP		0x00000040 /* [I] File is not to be dumped */
 #define STATX_ATTR_ENCRYPTED		0x00000800 /* [I] File requires key to decrypt in fs */
 #define STATX_ATTR_AUTOMOUNT		0x00001000 /* Dir: Automount trigger */
+#define STATX_ATTR_MOUNT_ROOT		0x00002000 /* Root of a mount */
 #define STATX_ATTR_VERITY		0x00100000 /* [I] Verity protected file */
 
 
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -202,6 +202,9 @@ int vfs_statx(int dfd, const char __user
 	error = vfs_getattr(&path, stat, request_mask, flags);
 	stat->mnt_id = real_mount(path.mnt)->mnt_id;
 	stat->result_mask |= STATX_MNT_ID;
+	if (path.mnt->mnt_root == path.dentry)
+		stat->attributes |= STATX_ATTR_MOUNT_ROOT;
+	stat->attributes_mask |= STATX_ATTR_MOUNT_ROOT;
 	path_put(&path);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-03 20:30                     ` J. Bruce Fields
  2020-04-06  8:35                       ` Miklos Szeredi
@ 2020-04-06  9:17                       ` Karel Zak
  2020-04-06 16:34                         ` Linus Torvalds
  1 sibling, 1 reply; 99+ messages in thread
From: Karel Zak @ 2020-04-06  9:17 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Lennart Poettering, Miklos Szeredi, Ian Kent, David Howells,
	Christian Brauner, Linus Torvalds, Al Viro, dray, Miklos Szeredi,
	Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel,
	linux-kernel, Aleksa Sarai

On Fri, Apr 03, 2020 at 04:30:24PM -0400, J. Bruce Fields wrote:
> On Fri, Apr 03, 2020 at 05:12:23PM +0200, Lennart Poettering wrote:
> > BTW, while we are at it: one more thing I'd love to see exposed by
> > statx() is a simple flag whether the inode is a mount point. There's
> > plenty code that implements a test like this all over the place, and
> > it usually isn't very safe. There's one implementation in util-linux
> > for example (in the /usr/bin/mountpoint binary), and another one in
> > systemd. Would be awesome to just have a statx() return flag for that,
> > that would make things *so* much easier and more robust. because in
> > fact most code isn't very good that implements this, as much of it
> > just compares st_dev of the specified file and its parent. Better code
> > compares the mount ID, but as mentioned that's not as pretty as it
> > could be so far...
> 
> nfs-utils/support/misc/mountpoint.c:check_is_mountpoint() stats the file
> and ".." and returns true if they have different st_dev or the same
> st_ino.  Comparing mount ids sounds better.

BTW, this traditional st_dev+st_ino way is not reliable for bind mounts.
For mountpoint(1) we search the directory in /proc/self/mountinfo.

> So anyway, yes, everybody reinvents the wheel here, and this would be
> useful.

 +1

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-03 15:01                             ` Lennart Poettering
@ 2020-04-06  9:22                               ` Miklos Szeredi
  2020-04-06 17:29                                 ` Lennart Poettering
  0 siblings, 1 reply; 99+ messages in thread
From: Miklos Szeredi @ 2020-04-06  9:22 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds,
	Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

On Fri, Apr 3, 2020 at 5:01 PM Lennart Poettering <mzxreary@0pointer.de> wrote:
>
> On Fr, 03.04.20 13:48, Miklos Szeredi (miklos@szeredi.hu) wrote:
>
> > > > Does that make any sense?
> > >
> > > When all mounts in the init mount namespace are unmounted and all
> > > remaining processes killed we switch root back to the initrd, so that
> > > even the root fs can be unmounted, and then we disassemble any backing
> > > complex storage if there is, i.e. lvm, luks, raid, …
> >
> > I think it could be done the other way round, much simpler:
> >
> >  - switch back to initrd
> >  - umount root, keeping the tree intact (UMOUNT_DETACHED)
> >  - kill all remaining processes, wait for all to exit
>
> Nah. What I wrote above is drastically simplified. It's IRL more
> complex. Specific services need to be killed between certain mounts
> are unmounted, since they are a backend for another mount. NFS, or
> FUSE or stuff like that usually has some processes backing them
> around, and we need to stop the mounts they provide before these
> services, and then the mounts these services reside on after that, and
> so on. It's a complex dependency tree of stuff that needs to be done
> in order, so that we can deal with arbitrarily nested mounts, storage
> subsystems, and backing services.

That still doesn't explain why you need to keep track of all mounts in
the system.

If you are aware of the dependency, then you need to keep track of
that particular mount. If not, then why?

What I'm starting to see is that there's a fundamental conflict
between how systemd people want to deal with new mounts and how some
other people want to use mounts (i.e. tens of thousands of mounts in
an automount map).

I'm really curious how much the mount notification ring + per mount
query (any implementation) can help that use case.

> Anyway, this all works fine in systemd, the dependency logic is
> there. We want a more efficient way to watch mounts, that's
> all. Subscribing and constantly reparsing /proc/self/mountinfo is
> awful, that's all.

I'm not sure that is all.   To handle storms of tens of thousands of
mounts, my guess is that the fundamental way of dealing with these
changes will need to be updated in systemd.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-06  8:35                       ` Miklos Szeredi
@ 2020-04-06 16:07                         ` J. Bruce Fields
  0 siblings, 0 replies; 99+ messages in thread
From: J. Bruce Fields @ 2020-04-06 16:07 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Lennart Poettering, Ian Kent, David Howells, Christian Brauner,
	Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi,
	Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel,
	linux-kernel, Aleksa Sarai

The patch makes sense to me, thanks!

In the NFS case it's implementing the "mountpoint" export option:

       mountpoint=path

       mp     This option makes it possible to only export a directory  if  it
              has  successfully  been  mounted.   If  no  path  is given (e.g.
              mountpoint or mp) then the export point must  also  be  a  mount
              point.  If it isn't then the export point is not exported.  This
              allows you to be sure that the directory underneath a mountpoint
              will never be exported by accident if, for example, the filesys‐
              tem failed to mount due to a disc error.

              If a path is given (e.g.  mountpoint=/path or mp=/path) then the
              nominated  path  must  be a mountpoint for the exportpoint to be
              exported.

--b.

On Mon, Apr 06, 2020 at 10:35:55AM +0200, Miklos Szeredi wrote:
> From: Miklos Szeredi <mszeredi@redhat.com>
> Subject: statx: add mount_root
> 
> Determining whether a path or file descriptor refers to a mountpoint (or
> more precisely a mount root) is not trivial using current tools.
> 
> Add a flag to statx that indicates whether the path or fd refers to the
> root of a mount or not.
> 
> Reported-by: Lennart Poettering <mzxreary@0pointer.de>
> Reported-by: J. Bruce Fields <bfields@fieldses.org>
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> ---
>  fs/stat.c                 |    3 +++
>  include/uapi/linux/stat.h |    1 +
>  2 files changed, 4 insertions(+)
> 
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -172,6 +172,7 @@ struct statx {
>  #define STATX_ATTR_NODUMP		0x00000040 /* [I] File is not to be dumped */
>  #define STATX_ATTR_ENCRYPTED		0x00000800 /* [I] File requires key to decrypt in fs */
>  #define STATX_ATTR_AUTOMOUNT		0x00001000 /* Dir: Automount trigger */
> +#define STATX_ATTR_MOUNT_ROOT		0x00002000 /* Root of a mount */
>  #define STATX_ATTR_VERITY		0x00100000 /* [I] Verity protected file */
>  
>  
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -202,6 +202,9 @@ int vfs_statx(int dfd, const char __user
>  	error = vfs_getattr(&path, stat, request_mask, flags);
>  	stat->mnt_id = real_mount(path.mnt)->mnt_id;
>  	stat->result_mask |= STATX_MNT_ID;
> +	if (path.mnt->mnt_root = path.dentry)
> +		stat->attributes |= STATX_ATTR_MOUNT_ROOT;
> +	stat->attributes_mask |= STATX_ATTR_MOUNT_ROOT;
>  	path_put(&path);
>  	if (retry_estale(error, lookup_flags)) {
>  		lookup_flags |= LOOKUP_REVAL;

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-06  9:17                       ` Karel Zak
@ 2020-04-06 16:34                         ` Linus Torvalds
  2020-04-06 18:46                           ` J. Bruce Fields
  2020-04-06 18:48                           ` Lennart Poettering
  0 siblings, 2 replies; 99+ messages in thread
From: Linus Torvalds @ 2020-04-06 16:34 UTC (permalink / raw)
  To: Karel Zak
  Cc: J. Bruce Fields, Lennart Poettering, Miklos Szeredi, Ian Kent,
	David Howells, Christian Brauner, Al Viro, dray, Miklos Szeredi,
	Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel,
	Linux Kernel Mailing List, Aleksa Sarai

On Mon, Apr 6, 2020 at 2:17 AM Karel Zak <kzak@redhat.com> wrote:
>
> On Fri, Apr 03, 2020 at 04:30:24PM -0400, J. Bruce Fields wrote:
> >
> > nfs-utils/support/misc/mountpoint.c:check_is_mountpoint() stats the file
> > and ".." and returns true if they have different st_dev or the same
> > st_ino.  Comparing mount ids sounds better.
>
> BTW, this traditional st_dev+st_ino way is not reliable for bind mounts.
> For mountpoint(1) we search the directory in /proc/self/mountinfo.

These days you should probably use openat2() with RESOLVE_NO_XDEV.

No need for any mountinfo or anything like that. Just look up the
pathname and say "don't cross mount-points", and you'll get an error
if it's a mount crossing lookup.

So this kind of thing is _not_ an argument for another kernel querying
interface.  We got a new (and better) model for a lot of this.

              Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-06  9:22                               ` Miklos Szeredi
@ 2020-04-06 17:29                                 ` Lennart Poettering
  2020-04-07  2:21                                   ` Ian Kent
  0 siblings, 1 reply; 99+ messages in thread
From: Lennart Poettering @ 2020-04-06 17:29 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds,
	Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

On Mo, 06.04.20 11:22, Miklos Szeredi (miklos@szeredi.hu) wrote:

> > Nah. What I wrote above is drastically simplified. It's IRL more
> > complex. Specific services need to be killed between certain mounts
> > are unmounted, since they are a backend for another mount. NFS, or
> > FUSE or stuff like that usually has some processes backing them
> > around, and we need to stop the mounts they provide before these
> > services, and then the mounts these services reside on after that, and
> > so on. It's a complex dependency tree of stuff that needs to be done
> > in order, so that we can deal with arbitrarily nested mounts, storage
> > subsystems, and backing services.
>
> That still doesn't explain why you need to keep track of all mounts in
> the system.
>
> If you are aware of the dependency, then you need to keep track of
> that particular mount. If not, then why?

it works the other way round in systemd: something happens, i.e. a
device pops up or a mount is established and systemd figures our if
there's something to do. i.e. whether services shall be pulled in or
so.

It's that way for a reason: there are plenty services that want to
instantiated once for each object of a certain kind to pop up (this
happens very often for devices, but could also happen for any other
kind of "unit" systemd manages, and one of those kinds are mount
units). For those we don't know the unit to pull in yet (because it's
not going to be a well-named singleton, but an instance incorporating
some identifier from the source unit) when the unit that pops up does
so, thus we can only wait for the the latter to determine what to pull
in.

> What I'm starting to see is that there's a fundamental conflict
> between how systemd people want to deal with new mounts and how some
> other people want to use mounts (i.e. tens of thousands of mounts in
> an automount map).

Well, I am not sure what automount has to do with anything. You can
have 10K mounts with or without automount, it's orthogonal to that. In
fact, I assumed the point of automount was to pretend there are 10K
mounts but not actually have them most of the time, no?

I mean, whether there's room to optimize D-Bus IPC or not is entirely
orthogonal to anything discussed here regarding fsinfo(). Don't make
this about systemd sending messages over D-Bus, that's a very
different story, and a non-issue if you ask me:

Right now, when you have n mounts, and any mount changes, or one is
added or removed then we have to parse the whole mount table again,
asynchronously, processing all n entries again, every frickin
time. This means the work to process n mounts popping up at boot is
O(nē). That sucks, it should be obvious to anyone. Now if we get that
fixed, by some mount API that can send us minimal notifications about
what happened and where, then this becomes O(n), which is totally OK.

You keep talking about filtering, which will just lower the "n" a bit
in particular cases to some value "m" maybe (with m < n), it does not
address the fact that O(mē) is still a big problem.

hence, filtering is great, no problem, add it if you want it. I
personally don't care about filtering though, and I doubt we'd use it
in systemd, I just care about the O(nē) issue.

If you ask me if D-Bus can handle 10K messages sent over the bus
during boot, then yes, it totally can handle that. Can systemd nicely
process O(nē) mounts internally though equally well? No, obviously not,
if n grows too large. Anyone computer scientist should understand that..

Anyway, I have the suspicion this discussion has stopped being
useful. I think you are trying to fix problems that userspce actually
doesn't have. I can just tell you what we understand the problems are,
but if you are out trying to fix other percieved ones, then great, but
I mostly lost interest.

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-06 16:34                         ` Linus Torvalds
@ 2020-04-06 18:46                           ` J. Bruce Fields
  2020-04-06 18:48                           ` Lennart Poettering
  1 sibling, 0 replies; 99+ messages in thread
From: J. Bruce Fields @ 2020-04-06 18:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Karel Zak, Lennart Poettering, Miklos Szeredi, Ian Kent,
	David Howells, Christian Brauner, Al Viro, dray, Miklos Szeredi,
	Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel,
	Linux Kernel Mailing List, Aleksa Sarai

On Mon, Apr 06, 2020 at 09:34:08AM -0700, Linus Torvalds wrote:
> On Mon, Apr 6, 2020 at 2:17 AM Karel Zak <kzak@redhat.com> wrote:
> >
> > On Fri, Apr 03, 2020 at 04:30:24PM -0400, J. Bruce Fields wrote:
> > >
> > > nfs-utils/support/misc/mountpoint.c:check_is_mountpoint() stats the file
> > > and ".." and returns true if they have different st_dev or the same
> > > st_ino.  Comparing mount ids sounds better.
> >
> > BTW, this traditional st_dev+st_ino way is not reliable for bind mounts.
> > For mountpoint(1) we search the directory in /proc/self/mountinfo.
> 
> These days you should probably use openat2() with RESOLVE_NO_XDEV.
> 
> No need for any mountinfo or anything like that. Just look up the
> pathname and say "don't cross mount-points", and you'll get an error
> if it's a mount crossing lookup.

OK, I can't see why that wouldn't work, thanks.

--b.

> 
> So this kind of thing is _not_ an argument for another kernel querying
> interface.  We got a new (and better) model for a lot of this.
> 
>               Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-06 16:34                         ` Linus Torvalds
  2020-04-06 18:46                           ` J. Bruce Fields
@ 2020-04-06 18:48                           ` Lennart Poettering
  2020-04-08  3:36                             ` Linus Torvalds
  1 sibling, 1 reply; 99+ messages in thread
From: Lennart Poettering @ 2020-04-06 18:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Karel Zak, J. Bruce Fields, Miklos Szeredi, Ian Kent,
	David Howells, Christian Brauner, Al Viro, dray, Miklos Szeredi,
	Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel,
	Linux Kernel Mailing List, Aleksa Sarai

On Mo, 06.04.20 09:34, Linus Torvalds (torvalds@linux-foundation.org) wrote:

> On Mon, Apr 6, 2020 at 2:17 AM Karel Zak <kzak@redhat.com> wrote:
> >
> > On Fri, Apr 03, 2020 at 04:30:24PM -0400, J. Bruce Fields wrote:
> > >
> > > nfs-utils/support/misc/mountpoint.c:check_is_mountpoint() stats the file
> > > and ".." and returns true if they have different st_dev or the same
> > > st_ino.  Comparing mount ids sounds better.
> >
> > BTW, this traditional st_dev+st_ino way is not reliable for bind mounts.
> > For mountpoint(1) we search the directory in /proc/self/mountinfo.
>
> These days you should probably use openat2() with RESOLVE_NO_XDEV.

Note that opening a file is relatively "heavy" i.e. typically triggers
autofs and stuff, and results in security checks (which can fail and
such, and show up in audit).

statx() doesn't do that, and that's explicitly documented
(i.e. AT_NO_AUTOMOUNT and stuff).

Hence, unless openat2() has some mechanism of doing something like an
"open() but not really" (O_PATH isn't really sufficient for this, no?)
I don't think it could be a good replacement for a statx() type check
if something is a mount point or not.

I mean, think about usecases: a common usecase for "is this a
mountpoint" checks are tools that traverse directory trees and want to
stop at submounts. They generally try to minimize operations and hence
stat stuff but don't open anything unless its what they look foor (or a
subdir they identified as a non-submount). Doing an extra openat2() in
between there doesn't sound so attractive, since you pay heavily...

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-06 17:29                                 ` Lennart Poettering
@ 2020-04-07  2:21                                   ` Ian Kent
  2020-04-07 13:59                                     ` Miklos Szeredi
  0 siblings, 1 reply; 99+ messages in thread
From: Ian Kent @ 2020-04-07  2:21 UTC (permalink / raw)
  To: Lennart Poettering, Miklos Szeredi
  Cc: David Howells, Christian Brauner, Linus Torvalds, Al Viro, dray,
	Karel Zak, Miklos Szeredi, Steven Whitehouse, Jeff Layton,
	andres, keyrings, linux-fsdevel, linux-kernel, Aleksa Sarai

On Mon, 2020-04-06 at 19:29 +0200, Lennart Poettering wrote:
> On Mo, 06.04.20 11:22, Miklos Szeredi (miklos@szeredi.hu) wrote:
> 
> > > Nah. What I wrote above is drastically simplified. It's IRL more
> > > complex. Specific services need to be killed between certain
> > > mounts
> > > are unmounted, since they are a backend for another mount. NFS,
> > > or
> > > FUSE or stuff like that usually has some processes backing them
> > > around, and we need to stop the mounts they provide before these
> > > services, and then the mounts these services reside on after
> > > that, and
> > > so on. It's a complex dependency tree of stuff that needs to be
> > > done
> > > in order, so that we can deal with arbitrarily nested mounts,
> > > storage
> > > subsystems, and backing services.
> > 
> > That still doesn't explain why you need to keep track of all mounts
> > in
> > the system.
> > 
> > If you are aware of the dependency, then you need to keep track of
> > that particular mount. If not, then why?
> 
> it works the other way round in systemd: something happens, i.e. a
> device pops up or a mount is established and systemd figures our if
> there's something to do. i.e. whether services shall be pulled in or
> so.
> 
> It's that way for a reason: there are plenty services that want to
> instantiated once for each object of a certain kind to pop up (this
> happens very often for devices, but could also happen for any other
> kind of "unit" systemd manages, and one of those kinds are mount
> units). For those we don't know the unit to pull in yet (because it's
> not going to be a well-named singleton, but an instance incorporating
> some identifier from the source unit) when the unit that pops up does
> so, thus we can only wait for the the latter to determine what to
> pull
> in.
> 
> > What I'm starting to see is that there's a fundamental conflict
> > between how systemd people want to deal with new mounts and how
> > some
> > other people want to use mounts (i.e. tens of thousands of mounts
> > in
> > an automount map).
> 
> Well, I am not sure what automount has to do with anything. You can
> have 10K mounts with or without automount, it's orthogonal to that.
> In
> fact, I assumed the point of automount was to pretend there are 10K
> mounts but not actually have them most of the time, no?

Yes, but automount, when using a large direct mount map will, be the
source of lots of mounts which of an autofs file system.

> 
> I mean, whether there's room to optimize D-Bus IPC or not is entirely
> orthogonal to anything discussed here regarding fsinfo(). Don't make
> this about systemd sending messages over D-Bus, that's a very
> different story, and a non-issue if you ask me:

Quite probably, yes, that's something you can care about if it really
is an issue but isn't something I care about myself either.

> 
> Right now, when you have n mounts, and any mount changes, or one is
> added or removed then we have to parse the whole mount table again,
> asynchronously, processing all n entries again, every frickin
> time. This means the work to process n mounts popping up at boot is
> O(n²). That sucks, it should be obvious to anyone. Now if we get that
> fixed, by some mount API that can send us minimal notifications about
> what happened and where, then this becomes O(n), which is totally OK.

But this is clearly a problem and is what I do care about and the
infrastructure being proposed here can be used to achieve this.

Unfortunately, and I was mistaken about what systemd does, I don't
see a simple way of improving this. This is because it appears that
systemd, having had to scan the entire mount table every time has,
necessarily, lead to code that can't easily accommodate the ability
to directly get the info immediately for a single mount.

So to improve this I think quite a few changes will be needed in
systemd and libmount. I'm not quite sure how to get that started.
After all it needs to be done how Karel would like to see it done
in libmount and how systemd folks would like to see it done in
systemd which is very probably not how I would approach it myself.

> 
> You keep talking about filtering, which will just lower the "n" a bit
> in particular cases to some value "m" maybe (with m < n), it does not
> address the fact that O(m²) is still a big problem.
> 
> hence, filtering is great, no problem, add it if you want it. I
> personally don't care about filtering though, and I doubt we'd use it
> in systemd, I just care about the O(n²) issue.
> 
> If you ask me if D-Bus can handle 10K messages sent over the bus
> during boot, then yes, it totally can handle that. Can systemd nicely
> process O(n²) mounts internally though equally well? No, obviously
> not,
> if n grows too large. Anyone computer scientist should understand
> that..
> 
> Anyway, I have the suspicion this discussion has stopped being
> useful. I think you are trying to fix problems that userspce actually
> doesn't have. I can just tell you what we understand the problems
> are,
> but if you are out trying to fix other percieved ones, then great,
> but
> I mostly lost interest.

Yes, filtering sounds like we've wandered off topic, ;)

Ian

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-07  2:21                                   ` Ian Kent
@ 2020-04-07 13:59                                     ` Miklos Szeredi
  2020-04-07 15:53                                       ` Lennart Poettering
  0 siblings, 1 reply; 99+ messages in thread
From: Miklos Szeredi @ 2020-04-07 13:59 UTC (permalink / raw)
  To: Ian Kent
  Cc: Lennart Poettering, David Howells, Christian Brauner,
	Linus Torvalds, Al Viro, dray, Karel Zak, Miklos Szeredi,
	Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel,
	linux-kernel, Aleksa Sarai

[-- Attachment #1: Type: text/plain, Size: 2797 bytes --]

On Tue, Apr 7, 2020 at 4:22 AM Ian Kent <raven@themaw.net> wrote:
> > Right now, when you have n mounts, and any mount changes, or one is
> > added or removed then we have to parse the whole mount table again,
> > asynchronously, processing all n entries again, every frickin
> > time. This means the work to process n mounts popping up at boot is
> > O(n²). That sucks, it should be obvious to anyone. Now if we get that
> > fixed, by some mount API that can send us minimal notifications about
> > what happened and where, then this becomes O(n), which is totally OK.

Something's not right with the above statement.  Hint: if there are
lots of events in quick succession, you can batch them quite easily to
prevent overloading the system.

Wrote a pair of utilities to check out the capabilities of the current
API.   The first one just creates N mounts, optionally sleeping
between each.  The second one watches /proc/self/mountinfo and
generates individual (add/del/change) events based on POLLPRI and
comparing contents with previous instance.

First use case: create 10,000 mounts, then start the watcher and
create 1000 mounts with a 50ms sleep between them.  Total time (user +
system) consumed by the watcher: 25s.  This is indeed pretty dismal,
and a per-mount query will help tremendously.  But it's still "just"
25ms per mount, so if the mounts are far apart (which is what this
test is about), this won't thrash the system.  Note, how this is self
regulating: if the load is high, it will automatically batch more
requests, preventing overload.  It is also prone to lose pairs of add
+ remove in these case (and so is the ring buffer based one from
David).

Second use case: start the watcher and create 50,000 mounts with no
sleep between them.   Total time consumed by the watcher: 0.154s or
3.08us/event.    Note, the same test case adds about 5ms for the
50,000 umount events, which is 0.1us/event.

Real life will probably be between these extremes, but it's clear that
there's room for improvement in userspace as well as kernel
interfaces.  The current kernel interface is very efficient in
retrieving a lot of state in one go.  It is not efficient in handling
small differences.

> > Anyway, I have the suspicion this discussion has stopped being
> > useful. I think you are trying to fix problems that userspce actually
> > doesn't have. I can just tell you what we understand the problems
> > are,
> > but if you are out trying to fix other percieved ones, then great,
> > but
> > I mostly lost interest.

I was, and still am, trying to see the big picture.

Whatever.   I think it's your turn to show some numbers about how the
new API improves performance of systemd with a large number of mounts.

Thanks,
Miklos

[-- Attachment #2: many-mounts.c --]
[-- Type: text/x-csrc, Size: 1155 bytes --]

#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <err.h>
#include <sys/stat.h>
#include <sys/mount.h>

int main(int argc, char *argv[])
{
	char *base_path = argv[1];
	char name[4096];
	int nr_mounts, i, sleep_ms = 0;

	if (argc < 3 || argc > 4)
		errx(1, "usage: %s base_path nr_mounts [sleep_ms]", argv[0]);

	nr_mounts = atoi(argv[2]);
	if (argc > 3)
		sleep_ms = atoi(argv[3]);

	fprintf(stderr, "Mounting...\n");
	if (mount("none", base_path, "tmpfs", 0, NULL) == -1)
		err(1, "mount/tmpfs");
	if (mount("none", base_path, NULL, MS_PRIVATE, NULL) == -1)
		err(1, "mount/MS_PRIVATE");
	for (i = 0; i < nr_mounts; i++) {
		sprintf(name, "%s/%d", base_path, i);
		if (mkdir(name, 0755) == -1)
			err(1, "mkdir");
		if (mount("none", name, "tmpfs", 0, NULL) == -1)
			err(1, "mount/tmpfs");
		if (mount("none", name, NULL, MS_PRIVATE, NULL) == -1)
			err(1, "mount/MS_PRIVATE");
		if (sleep_ms)
			usleep(sleep_ms * 1000);
	}
	fprintf(stderr, "Press ENTER\n");
	getchar();

	fprintf(stderr, "Unmounting...\n");
	if (umount2(base_path, MNT_DETACH) == -1)
		err(1, "umount");

	fprintf(stderr, "Done\n");

	return 0;
}

[-- Attachment #3: watch_mounts.c --]
[-- Type: text/x-csrc, Size: 3380 bytes --]

#include <stdio.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <poll.h>
#include <err.h>

struct index {
	struct index *next;
	struct index *prev;
	const char *line;
};

struct state {
	size_t bufsize;
	char *buf;
	size_t index_size;
	struct index *index;
	struct index head;
};

static void read_mountinfo(struct pollfd *pfd, char *buf, size_t bufsize)
{
	int readcnt, backoff = 0, retry = 0;
	size_t len;
	ssize_t res;

retry:
	if (lseek(pfd->fd, 0, SEEK_SET) == (off_t) -1)
		err(1, "lseek");
	len = 0;
	readcnt = 0;
	do {
		if (len >= bufsize - 4096)
			errx(1, "buffer overrun");
		res = read(pfd->fd, buf + len, bufsize - len);
		if (res == -1)
			err(1, "read");
		len += res;

		if (!res || !(++readcnt % 16)) {
			if (poll(pfd, 1, 0) == -1)
				err(1, "poll/0");
			if (pfd->revents & POLLPRI) {
				if (!backoff) {
					backoff++;
					goto retry;
				}
				if (!retry) {
					fprintf(stderr, "retry.");
					retry = 1;
				}
				do {
					usleep(backoff * 1000);
					if (backoff < 128)
						backoff *= 2;
					if (poll(pfd, 1, 0) == -1)
						err(1, "poll/0");
				} while (pfd->revents & POLLPRI);
				goto retry;
			}
		}
	} while (res);
	buf[len] = '\0';

	if (retry) {
		fprintf(stderr, "..\n");
		retry = 0;
	}
}

static void add_index(struct state *s, struct index *this, const char *line)
{
	struct index *prev = s->head.prev, *next = &s->head;

	if (this->line)
		errx(1, "index corruption");

	this->line = line;
	this->next = next;
	this->prev = prev;
	prev->next = next->prev = this;
}

static void del_index(struct index *this)
{
	struct index *prev = this->prev, *next = this->next;

	this->line = NULL;
	prev->next = next;
	next->prev = prev;
}

static void diff_mountinfo(struct state *old, struct state *cur)
{
	char *line, *end;
	struct index *this;
	int mntid;

	cur->head.next = cur->head.prev = &cur->head;
	for (line = cur->buf; line[0]; line = end + 1) {
		end = strchr(line, '\n');
		if (!end)
			errx(1, "parsing (1)");
		*end = '\0';
		if (sscanf(line, "%i", &mntid) != 1)
			errx(1, "parsing (2)");
		if (mntid < 0 || (size_t) mntid >= cur->index_size)
			errx(1, "index overflow");
		add_index(cur, &cur->index[mntid], line);

		this = &old->index[mntid];
		if (this->line) {
			if (strcmp(this->line, line))
				printf("* %s\n", line);
			del_index(this);
		} else {
			printf("+ %s\n", line);
		}
	}
	while (old->head.next != &old->head) {
		this = old->head.next;
		printf("- %s\n", this->line);
		del_index(this);
	}
	fflush(stdout);
}

int main(void)
{
	struct state state[2], *old = &state[0], *cur = &state[1], *tmp;
	struct pollfd pfd = { .events = POLLPRI };

	old->index_size = cur->index_size = 131072;
	old->bufsize = cur->bufsize = cur->index_size * 128;
	old->index = calloc(old->index_size, sizeof(struct index));
	cur->index = calloc(cur->index_size, sizeof(struct index));
	old->buf = malloc(old->bufsize);
	cur->buf = malloc(cur->bufsize);
	if (!old->index || !cur->index || !old->buf || !cur->buf)
		err(1, "allocating buffers");

	old->buf[0] = '\0';
	old->head.prev = old->head.next = &old->head;

	pfd.fd = open("/proc/self/mountinfo", O_RDONLY);
	if (pfd.fd == -1)
		err(1, "open");

	while (1) {
		read_mountinfo(&pfd, cur->buf, cur->bufsize);
		diff_mountinfo(old, cur);

		tmp = old;
		old = cur;
		cur = tmp;

		if (poll(&pfd, 1, -1) == -1)
			err(1, "poll/inf");
	}
}

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-07 13:59                                     ` Miklos Szeredi
@ 2020-04-07 15:53                                       ` Lennart Poettering
  2020-04-07 16:06                                         ` Miklos Szeredi
  0 siblings, 1 reply; 99+ messages in thread
From: Lennart Poettering @ 2020-04-07 15:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds,
	Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

On Di, 07.04.20 15:59, Miklos Szeredi (miklos@szeredi.hu) wrote:

> On Tue, Apr 7, 2020 at 4:22 AM Ian Kent <raven@themaw.net> wrote:
> > > Right now, when you have n mounts, and any mount changes, or one is
> > > added or removed then we have to parse the whole mount table again,
> > > asynchronously, processing all n entries again, every frickin
> > > time. This means the work to process n mounts popping up at boot is
> > > O(n²). That sucks, it should be obvious to anyone. Now if we get that
> > > fixed, by some mount API that can send us minimal notifications about
> > > what happened and where, then this becomes O(n), which is totally OK.
>
> Something's not right with the above statement.  Hint: if there are
> lots of events in quick succession, you can batch them quite easily to
> prevent overloading the system.
>
> Wrote a pair of utilities to check out the capabilities of the current
> API.   The first one just creates N mounts, optionally sleeping
> between each.  The second one watches /proc/self/mountinfo and
> generates individual (add/del/change) events based on POLLPRI and
> comparing contents with previous instance.
>
> First use case: create 10,000 mounts, then start the watcher and
> create 1000 mounts with a 50ms sleep between them.  Total time (user +
> system) consumed by the watcher: 25s.  This is indeed pretty dismal,
> and a per-mount query will help tremendously.  But it's still "just"
> 25ms per mount, so if the mounts are far apart (which is what this
> test is about), this won't thrash the system.  Note, how this is self
> regulating: if the load is high, it will automatically batch more
> requests, preventing overload.  It is also prone to lose pairs of add
> + remove in these case (and so is the ring buffer based one from
> David).

We will batch requests too in systemd, of course, necessarily, given
that the /p/s/mi inotify stuff is async. Thing though is that this
means we buy lower CPU usage — working around the O(n²) issue — by
introducing artifical higher latencies. We usually want to boot
quickly, and not artificially slow.

Sure one can come up with some super smart scheme how to tweak the
artifical latencies, how to grow them, how to shrink them, depending
on a perceived flood of events, some backing off scheme. But that's
just polishing a turd, if all we want is proper queued change
notification without the O(n²) behaviour.

I mean, the fix for an O(n²) algorithm is to make it O(n) or so. By
coalescing wake-up events you just lower the n again, probably
linearly, but that still means we pay O(n²), which sucks.

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-07 15:53                                       ` Lennart Poettering
@ 2020-04-07 16:06                                         ` Miklos Szeredi
  0 siblings, 0 replies; 99+ messages in thread
From: Miklos Szeredi @ 2020-04-07 16:06 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Ian Kent, David Howells, Christian Brauner, Linus Torvalds,
	Al Viro, dray, Karel Zak, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, andres, keyrings, linux-fsdevel, linux-kernel,
	Aleksa Sarai

On Tue, Apr 7, 2020 at 5:53 PM Lennart Poettering <mzxreary@0pointer.de> wrote:
>
> On Di, 07.04.20 15:59, Miklos Szeredi (miklos@szeredi.hu) wrote:
>
> > On Tue, Apr 7, 2020 at 4:22 AM Ian Kent <raven@themaw.net> wrote:
> > > > Right now, when you have n mounts, and any mount changes, or one is
> > > > added or removed then we have to parse the whole mount table again,
> > > > asynchronously, processing all n entries again, every frickin
> > > > time. This means the work to process n mounts popping up at boot is
> > > > O(n²). That sucks, it should be obvious to anyone. Now if we get that
> > > > fixed, by some mount API that can send us minimal notifications about
> > > > what happened and where, then this becomes O(n), which is totally OK.
> >
> > Something's not right with the above statement.  Hint: if there are
> > lots of events in quick succession, you can batch them quite easily to
> > prevent overloading the system.
> >
> > Wrote a pair of utilities to check out the capabilities of the current
> > API.   The first one just creates N mounts, optionally sleeping
> > between each.  The second one watches /proc/self/mountinfo and
> > generates individual (add/del/change) events based on POLLPRI and
> > comparing contents with previous instance.
> >
> > First use case: create 10,000 mounts, then start the watcher and
> > create 1000 mounts with a 50ms sleep between them.  Total time (user +
> > system) consumed by the watcher: 25s.  This is indeed pretty dismal,
> > and a per-mount query will help tremendously.  But it's still "just"
> > 25ms per mount, so if the mounts are far apart (which is what this
> > test is about), this won't thrash the system.  Note, how this is self
> > regulating: if the load is high, it will automatically batch more
> > requests, preventing overload.  It is also prone to lose pairs of add
> > + remove in these case (and so is the ring buffer based one from
> > David).
>
> We will batch requests too in systemd, of course, necessarily, given
> that the /p/s/mi inotify stuff is async. Thing though is that this
> means we buy lower CPU usage — working around the O(n²) issue — by
> introducing artifical higher latencies. We usually want to boot
> quickly, and not artificially slow.
>
> Sure one can come up with some super smart scheme how to tweak the
> artifical latencies, how to grow them, how to shrink them, depending
> on a perceived flood of events, some backing off scheme. But that's
> just polishing a turd, if all we want is proper queued change
> notification without the O(n²) behaviour.
>
> I mean, the fix for an O(n²) algorithm is to make it O(n) or so. By
> coalescing wake-up events you just lower the n again, probably
> linearly, but that still means we pay O(n²), which sucks.

Obviously.  But you keep ignoring event queue overflows; it's
basically guaranteed to happen with a sizable mount storm and then you
are back to O(n^2).

Give it some testing please, as Linus is not going to take any
solution without an actual use case and testing.  When you come back
and say that fsinfo(2) works fine with systemd and a 100k mount/umount
storm, then we can have a look at the performance budget and revisit
the fine points of API design.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Upcoming: Notifications, FS notifications and fsinfo()
  2020-04-06 18:48                           ` Lennart Poettering
@ 2020-04-08  3:36                             ` Linus Torvalds
  0 siblings, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2020-04-08  3:36 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Karel Zak, J. Bruce Fields, Miklos Szeredi, Ian Kent,
	David Howells, Christian Brauner, Al Viro, dray, Miklos Szeredi,
	Steven Whitehouse, Jeff Layton, andres, keyrings, linux-fsdevel,
	Linux Kernel Mailing List, Aleksa Sarai

On Mon, Apr 6, 2020 at 11:48 AM Lennart Poettering <mzxreary@0pointer.de> wrote:
>
> On Mo, 06.04.20 09:34, Linus Torvalds (torvalds@linux-foundation.org) wrote:
>
> > On Mon, Apr 6, 2020 at 2:17 AM Karel Zak <kzak@redhat.com> wrote:
> > >
> > > On Fri, Apr 03, 2020 at 04:30:24PM -0400, J. Bruce Fields wrote:
> > > >
> > > > nfs-utils/support/misc/mountpoint.c:check_is_mountpoint() stats the file
> > > > and ".." and returns true if they have different st_dev or the same
> > > > st_ino.  Comparing mount ids sounds better.
> > >
> > > BTW, this traditional st_dev+st_ino way is not reliable for bind mounts.
> > > For mountpoint(1) we search the directory in /proc/self/mountinfo.
> >
> > These days you should probably use openat2() with RESOLVE_NO_XDEV.
>
> Note that opening a file is relatively "heavy" i.e. typically triggers
> autofs and stuff, and results in security checks (which can fail and
> such, and show up in audit).

For the use that Bruce outlined, openat2() with RESOLVE_NO_XDEV is
absolutely the right thing.

He already did the stat() of the file (and ".."), RESOLVE_NO_XDEV is
only an improvement. It's also a lot better than trying to parse
mountinfo.

Now, I don't disagree that a statx() flag to also indicate "that's a
top-level mount" might be a good idea, and may be the right answer for
other cases.

I'm just saying that considering what Bruce does now, RESOLVE_NO_XDEV
sounds like the nobrainer approach, and needs no new support outside
of what we already had for other reasons.

(And O_PATH _may_ or may not be part of what you want to do, it's an
independent separate issue, but automount behavior wrt a O_PATH lookup
is somewhat unclear - see Al's other emails on that subject)

             Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [GIT PULL] General notification queue and key notifications
  2020-03-30 14:31 ` [GIT PULL] General notification queue and key notifications David Howells
  2020-03-31  6:51   ` Stephen Rothwell
@ 2020-06-02 15:55   ` David Howells
  2020-06-03  2:15     ` Ian Kent
                       ` (4 more replies)
  1 sibling, 5 replies; 99+ messages in thread
From: David Howells @ 2020-06-02 15:55 UTC (permalink / raw)
  To: torvalds
  Cc: dhowells, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven,
	andres, christian.brauner, jarkko.sakkinen, keyrings,
	linux-fsdevel, linux-kernel

Date: Tue, 02 Jun 2020 16:51:44 +0100

Hi Linus,

Can you pull this, please?  It adds a general notification queue concept
and adds an event source for keys/keyrings, such as linking and unlinking
keys and changing their attributes.

Thanks to Debarshi Ray, we do have a pull request to use this to fix a
problem with gnome-online-accounts - as mentioned last time:

    https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47

Without this, g-o-a has to constantly poll a keyring-based kerberos cache
to find out if kinit has changed anything.

[[ With regard to the mount/sb notifications and fsinfo(), Karel Zak and
   Ian Kent have been working on making libmount use them, preparatory to
   working on systemd:

	https://github.com/karelzak/util-linux/commits/topic/fsinfo
	https://github.com/raven-au/util-linux/commits/topic/fsinfo.public

   Development has stalled briefly due to other commitments, so I'm not
   sure I can ask you to pull those parts of the series for now.  Christian
   Brauner would like to use them in lxc, but hasn't started.
   ]]


LSM hooks are included:

 (1) A set of hooks are provided that allow an LSM to rule on whether or
     not a watch may be set.  Each of these hooks takes a different
     "watched object" parameter, so they're not really shareable.  The LSM
     should use current's credentials.  [Wanted by SELinux & Smack]

 (2) A hook is provided to allow an LSM to rule on whether or not a
     particular message may be posted to a particular queue.  This is given
     the credentials from the event generator (which may be the system) and
     the watch setter.  [Wanted by Smack]

I've provided SELinux and Smack with implementations of some of these hooks.


WHY
===

Key/keyring notifications are desirable because if you have your kerberos
tickets in a file/directory, your Gnome desktop will monitor that using
something like fanotify and tell you if your credentials cache changes.

However, we also have the ability to cache your kerberos tickets in the
session, user or persistent keyring so that it isn't left around on disk
across a reboot or logout.  Keyrings, however, cannot currently be
monitored asynchronously, so the desktop has to poll for it - not so good
on a laptop.  This facility will allow the desktop to avoid the need to
poll.


DESIGN DECISIONS
================

 (1) The notification queue is built on top of a standard pipe.  Messages
     are effectively spliced in.  The pipe is opened with a special flag:

	pipe2(fds, O_NOTIFICATION_PIPE);

     The special flag has the same value as O_EXCL (which doesn't seem like
     it will ever be applicable in this context)[?].  It is given up front
     to make it a lot easier to prohibit splice and co. from accessing the
     pipe.

     [?] Should this be done some other way?  I'd rather not use up a new
     	 O_* flag if I can avoid it - should I add a pipe3() system call
     	 instead?

     The pipe is then configured::

	ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
	ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter);

     Messages are then read out of the pipe using read().

 (2) It should be possible to allow write() to insert data into the
     notification pipes too, but this is currently disabled as the kernel
     has to be able to insert messages into the pipe *without* holding
     pipe->mutex and the code to make this work needs careful auditing.

 (3) sendfile(), splice() and vmsplice() are disabled on notification pipes
     because of the pipe->mutex issue and also because they sometimes want
     to revert what they just did - but one or more notification messages
     might've been interleaved in the ring.

 (4) The kernel inserts messages with the wait queue spinlock held.  This
     means that pipe_read() and pipe_write() have to take the spinlock to
     update the queue pointers.

 (5) Records in the buffer are binary, typed and have a length so that they
     can be of varying size.

     This allows multiple heterogeneous sources to share a common buffer;
     there are 16 million types available, of which I've used just a few,
     so there is scope for others to be used.  Tags may be specified when a
     watchpoint is created to help distinguish the sources.

 (6) Records are filterable as types have up to 256 subtypes that can be
     individually filtered.  Other filtration is also available.

 (7) Notification pipes don't interfere with each other; each may be bound
     to a different set of watches.  Any particular notification will be
     copied to all the queues that are currently watching for it - and only
     those that are watching for it.

 (8) When recording a notification, the kernel will not sleep, but will
     rather mark a queue as having lost a message if there's insufficient
     space.  read() will fabricate a loss notification message at an
     appropriate point later.

 (9) The notification pipe is created and then watchpoints are attached to
     it, using one of:

	keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
	watch_mount(AT_FDCWD, "/", 0, fd, 0x02);
	watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03);

     where in both cases, fd indicates the queue and the number after is a
     tag between 0 and 255.

(10) Watches are removed if either the notification pipe is destroyed or
     the watched object is destroyed.  In the latter case, a message will
     be generated indicating the enforced watch removal.


Things I want to avoid:

 (1) Introducing features that make the core VFS dependent on the network
     stack or networking namespaces (ie. usage of netlink).

 (2) Dumping all this stuff into dmesg and having a daemon that sits there
     parsing the output and distributing it as this then puts the
     responsibility for security into userspace and makes handling
     namespaces tricky.  Further, dmesg might not exist or might be
     inaccessible inside a container.

 (3) Letting users see events they shouldn't be able to see.


TESTING AND MANPAGES
====================

 (*) The keyutils tree has a pipe-watch branch that has keyctl commands for
     making use of notifications.  Proposed manual pages can also be found
     on this branch, though a couple of them really need to go to the main
     manpages repository instead.

     If the kernel supports the watching of keys, then running "make test"
     on that branch will cause the testing infrastructure to spawn a
     monitoring process on the side that monitors a notifications pipe for
     all the key/keyring changes induced by the tests and they'll all be
     checked off to make sure they happened.

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch

 (*) A test program is provided (samples/watch_queue/watch_test) that can
     be used to monitor for keyrings, mount and superblock events.
     Information on the notifications is simply logged to stdout.

Thanks,
David
---
The following changes since commit b9bbe6ed63b2b9f2c9ee5cbd0f2c946a2723f4ce:

  Linux 5.7-rc6 (2020-05-17 16:48:37 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/notifications-20200601

for you to fetch changes up to a8478a602913dc89a7cd2060e613edecd07e1dbd:

  smack: Implement the watch_key and post_notification hooks (2020-05-19 15:47:38 +0100)

----------------------------------------------------------------
Notifications over pipes + Keyring notifications

----------------------------------------------------------------
David Howells (12):
      uapi: General notification queue definitions
      security: Add a hook for the point of notification insertion
      pipe: Add O_NOTIFICATION_PIPE
      pipe: Add general notification queue support
      security: Add hooks to rule on setting a watch
      watch_queue: Add a key/keyring notification facility
      Add sample notification program
      pipe: Allow buffers to be marked read-whole-or-error for notifications
      pipe: Add notification lossage handling
      keys: Make the KEY_NEED_* perms an enum rather than a mask
      selinux: Implement the watch_key security hook
      smack: Implement the watch_key and post_notification hooks

 Documentation/security/keys/core.rst               |  57 ++
 Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
 Documentation/watch_queue.rst                      | 339 +++++++++++
 fs/pipe.c                                          | 242 +++++---
 fs/splice.c                                        |  12 +-
 include/linux/key.h                                |  33 +-
 include/linux/lsm_audit.h                          |   1 +
 include/linux/lsm_hook_defs.h                      |   9 +
 include/linux/lsm_hooks.h                          |  14 +
 include/linux/pipe_fs_i.h                          |  27 +-
 include/linux/security.h                           |  30 +-
 include/linux/watch_queue.h                        | 127 ++++
 include/uapi/linux/keyctl.h                        |   2 +
 include/uapi/linux/watch_queue.h                   | 104 ++++
 init/Kconfig                                       |  12 +
 kernel/Makefile                                    |   1 +
 kernel/watch_queue.c                               | 659 +++++++++++++++++++++
 samples/Kconfig                                    |   6 +
 samples/Makefile                                   |   1 +
 samples/watch_queue/Makefile                       |   7 +
 samples/watch_queue/watch_test.c                   | 186 ++++++
 security/keys/Kconfig                              |   9 +
 security/keys/compat.c                             |   3 +
 security/keys/gc.c                                 |   5 +
 security/keys/internal.h                           |  38 +-
 security/keys/key.c                                |  38 +-
 security/keys/keyctl.c                             | 115 +++-
 security/keys/keyring.c                            |  20 +-
 security/keys/permission.c                         |  31 +-
 security/keys/process_keys.c                       |  46 +-
 security/keys/request_key.c                        |   4 +-
 security/security.c                                |  22 +-
 security/selinux/hooks.c                           |  51 +-
 security/smack/smack_lsm.c                         | 112 +++-
 34 files changed, 2185 insertions(+), 179 deletions(-)
 create mode 100644 Documentation/watch_queue.rst
 create mode 100644 include/linux/watch_queue.h
 create mode 100644 include/uapi/linux/watch_queue.h
 create mode 100644 kernel/watch_queue.c
 create mode 100644 samples/watch_queue/Makefile
 create mode 100644 samples/watch_queue/watch_test.c

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [GIT PULL] General notification queue and key notifications
  2020-06-02 15:55   ` David Howells
@ 2020-06-03  2:15     ` Ian Kent
  2020-06-08  0:49       ` Ian Kent
  2020-06-10  9:56     ` Christian Brauner
                       ` (3 subsequent siblings)
  4 siblings, 1 reply; 99+ messages in thread
From: Ian Kent @ 2020-06-03  2:15 UTC (permalink / raw)
  To: David Howells, torvalds
  Cc: viro, dray, kzak, mszeredi, swhiteho, jlayton, andres,
	christian.brauner, jarkko.sakkinen, keyrings, linux-fsdevel,
	linux-kernel

On Tue, 2020-06-02 at 16:55 +0100, David Howells wrote:
> 
> [[ With regard to the mount/sb notifications and fsinfo(), Karel Zak
> and
>    Ian Kent have been working on making libmount use them,
> preparatory to
>    working on systemd:
> 
> 	https://github.com/karelzak/util-linux/commits/topic/fsinfo
> 	
> https://github.com/raven-au/util-linux/commits/topic/fsinfo.public
> 
>    Development has stalled briefly due to other commitments, so I'm
> not
>    sure I can ask you to pull those parts of the series for
> now.  Christian
>    Brauner would like to use them in lxc, but hasn't started.
>    ]]

Linus,

Just so your aware of what has been done and where we are at here's
a summary.

Karel has done quite a bit of work on libmount (at this stage it's
getting hold of the mount information, aka. fsinfo()) and most of
what I have done is included in that too which you can see in Karel's
repo above). You can see a couple of bug fixes and a little bit of
new code present in my repo which hasn't been sent over to Karel
yet.

This infrastructure is essential before notifications work is started
which is where we will see the most improvement.

It turns out that while systemd uses libmount it has it's own
notifications handling sub-system as it deals with several event
types, not just mount information, in the same area. So, unfortunately,
changes will need to be made there as well as in libmount, more so
than the trivial changes to use fsinfo() via libmount.

That's where we are at the moment and I will get back to it once
I've dealt with a few things I postponed to work on libmount.

If you would like a more detailed account of what we have found I
can provide that too.

Is there anything else you would like from me or Karel?

Ian

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [GIT PULL] General notification queue and key notifications
  2020-06-03  2:15     ` Ian Kent
@ 2020-06-08  0:49       ` Ian Kent
  0 siblings, 0 replies; 99+ messages in thread
From: Ian Kent @ 2020-06-08  0:49 UTC (permalink / raw)
  To: David Howells, torvalds
  Cc: viro, dray, kzak, mszeredi, swhiteho, jlayton, andres,
	christian.brauner, jarkko.sakkinen, keyrings, linux-fsdevel,
	linux-kernel

On Wed, 2020-06-03 at 10:15 +0800, Ian Kent wrote:
> On Tue, 2020-06-02 at 16:55 +0100, David Howells wrote:
> > [[ With regard to the mount/sb notifications and fsinfo(), Karel
> > Zak
> > and
> >    Ian Kent have been working on making libmount use them,
> > preparatory to
> >    working on systemd:
> > 
> > 	https://github.com/karelzak/util-linux/commits/topic/fsinfo
> > 	
> > https://github.com/raven-au/util-linux/commits/topic/fsinfo.public
> > 
> >    Development has stalled briefly due to other commitments, so I'm
> > not
> >    sure I can ask you to pull those parts of the series for
> > now.  Christian
> >    Brauner would like to use them in lxc, but hasn't started.
> >    ]]
> 
> Linus,
> 
> Just so your aware of what has been done and where we are at here's
> a summary.
> 
> Karel has done quite a bit of work on libmount (at this stage it's
> getting hold of the mount information, aka. fsinfo()) and most of
> what I have done is included in that too which you can see in Karel's
> repo above). You can see a couple of bug fixes and a little bit of
> new code present in my repo which hasn't been sent over to Karel
> yet.
> 
> This infrastructure is essential before notifications work is started
> which is where we will see the most improvement.
> 
> It turns out that while systemd uses libmount it has it's own
> notifications handling sub-system as it deals with several event
> types, not just mount information, in the same area. So,
> unfortunately,
> changes will need to be made there as well as in libmount, more so
> than the trivial changes to use fsinfo() via libmount.
> 
> That's where we are at the moment and I will get back to it once
> I've dealt with a few things I postponed to work on libmount.
> 
> If you would like a more detailed account of what we have found I
> can provide that too.
> 
> Is there anything else you would like from me or Karel?

I think there's a bit more I should say about this.

One reason work hasn't progressed further on this is I spent
quite a bit of time looking at the affects of using fsinfo().

My testing was done by using a large autofs direct mount map of
20000 entries which means that at autofs startup 20000 autofs
mounts must be done and at autofs shutdown those 20000 mounts
must be umounted. Not very scientific but something to use to
get a feel for the affect of our changes.

Initially just using fsinfo() to load all the mount entries was
done to see how that would perform. This was done in a way that
required no modifications to library user code but didn't get
much improvement.

Next loading all the mount ids (alone) for mount entry traversal
was done and the various fields retrieved on-demand (implemented
by Karel).

Loading the entire mount table and then traversing the entries
means the mount table is always possibly out of date. And loading
the ids and getting the fields on-demand might have made that
problem worse. But loading only the mount ids and using an
on-demand method to get needed fields worked surprisingly well.

The main issue is a mount going away while getting the fields.
Testing showed that simply checking the field is valid and
ignoring the entry if it isn't is enough to handle that case.

Also the mount going away after the needed fields have been
retrieved must be handled by callers of libmount as mounts
can just as easily go away after reading the proc based tables.

The case of the underlying mount information changing needs to
be considered too. We will need to do better on that in the
future but it too is a problem with the proc table handing and
hasn't seen problems logged against libmount for it AFAIK.

So, all in all, this approach worked pretty well as libmount
users do use the getter access methods to retrieve the mount
entry fields (which is required for the on-demand method to
work). Certainly systemd always uses them (and it looks like
udisks2 does too).

Unfortunately using the libmount on-demand implementation
requires library user code be modified (only a little in
the systemd case) to use the implementation.

Testing showed that we get between 10-15% reduction in
overhead and CPU usage remained high.

I think processing large numbers of mounts is simply a lot
of work and there are particular cases that will remain that
require the use of the load and traverse method. For example
matching all mounts with a given prefix string (one of the
systemd use cases).

It's hard to get information about this but I can say that
running pref during the autofs start and stop shows the bulk
of the counter hits on the fsinfo() table construction code
so that ahs to be where the overhead is.

The unavoidable conclusion is that the load and traverse method
that's been imposed on us for so long (even before libmount)
for mount handling is what we need to get away from. After all,
this is essentially where the problem comes from in the first
place. And fsinfo() is designed to not need to use this method
for getting mount information for that reason.

There's also the notifications side of things which is the next
area to work on. Looking at systemd I see that monitoring the
proc mount table leads to a load, traverse, and process of the
entire table for every single notification. It's clear that's
because of the (what I'll call) anonymous notifications that we
have now.

The notifications in David's series carry event specific
information, for example the mount id for mount notifications
and the libmount fsinfo() implementation is written to use the
mount id (lowest overhead lookup option), so there has to be
significant improvement for this case.

But systemd has it's own notifications handling code so there
will need to be non-trivial changes there as well as changes
in libmount.

Bottom line is we have a bit of a challenge with this because we
are trying to change coding practices developed over many years
that, necessarily, use a load/traverse method and it's going to
take quite a while to change these coding practices.

My question is, is there something specific, besides what we are
doing, that you'd like to see done now in order to get the series
merged?

Ian

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [GIT PULL] General notification queue and key notifications
  2020-06-02 15:55   ` David Howells
  2020-06-03  2:15     ` Ian Kent
@ 2020-06-10  9:56     ` Christian Brauner
  2020-06-10 11:12     ` Karel Zak
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 99+ messages in thread
From: Christian Brauner @ 2020-06-10  9:56 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven,
	andres, jarkko.sakkinen, keyrings, linux-fsdevel, linux-kernel

On Tue, Jun 02, 2020 at 04:55:04PM +0100, David Howells wrote:
> Date: Tue, 02 Jun 2020 16:51:44 +0100
> 
> Hi Linus,
> 
> Can you pull this, please?  It adds a general notification queue concept
> and adds an event source for keys/keyrings, such as linking and unlinking
> keys and changing their attributes.
> 
> Thanks to Debarshi Ray, we do have a pull request to use this to fix a
> problem with gnome-online-accounts - as mentioned last time:
> 
>     https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47
> 
> Without this, g-o-a has to constantly poll a keyring-based kerberos cache
> to find out if kinit has changed anything.
> 
> [[ With regard to the mount/sb notifications and fsinfo(), Karel Zak and

The mount/sb notification and fsinfo() stuff is something we'd like to
use. (And then later extend to allow for supervised mounts where a
container manager can supervise the mounts of an unprivileged
container.)
I'm not sure if the mount notifications are already part of this pr.

Christian

>    Ian Kent have been working on making libmount use them, preparatory to
>    working on systemd:
> 
> 	https://github.com/karelzak/util-linux/commits/topic/fsinfo
> 	https://github.com/raven-au/util-linux/commits/topic/fsinfo.public
> 
>    Development has stalled briefly due to other commitments, so I'm not
>    sure I can ask you to pull those parts of the series for now.  Christian
>    Brauner would like to use them in lxc, but hasn't started.
>    ]]
> 
> 
> LSM hooks are included:
> 
>  (1) A set of hooks are provided that allow an LSM to rule on whether or
>      not a watch may be set.  Each of these hooks takes a different
>      "watched object" parameter, so they're not really shareable.  The LSM
>      should use current's credentials.  [Wanted by SELinux & Smack]
> 
>  (2) A hook is provided to allow an LSM to rule on whether or not a
>      particular message may be posted to a particular queue.  This is given
>      the credentials from the event generator (which may be the system) and
>      the watch setter.  [Wanted by Smack]
> 
> I've provided SELinux and Smack with implementations of some of these hooks.
> 
> 
> WHY
> => 
> Key/keyring notifications are desirable because if you have your kerberos
> tickets in a file/directory, your Gnome desktop will monitor that using
> something like fanotify and tell you if your credentials cache changes.
> 
> However, we also have the ability to cache your kerberos tickets in the
> session, user or persistent keyring so that it isn't left around on disk
> across a reboot or logout.  Keyrings, however, cannot currently be
> monitored asynchronously, so the desktop has to poll for it - not so good
> on a laptop.  This facility will allow the desktop to avoid the need to
> poll.
> 
> 
> DESIGN DECISIONS
> ========
> 
>  (1) The notification queue is built on top of a standard pipe.  Messages
>      are effectively spliced in.  The pipe is opened with a special flag:
> 
> 	pipe2(fds, O_NOTIFICATION_PIPE);
> 
>      The special flag has the same value as O_EXCL (which doesn't seem like
>      it will ever be applicable in this context)[?].  It is given up front
>      to make it a lot easier to prohibit splice and co. from accessing the
>      pipe.
> 
>      [?] Should this be done some other way?  I'd rather not use up a new
>      	 O_* flag if I can avoid it - should I add a pipe3() system call
>      	 instead?
> 
>      The pipe is then configured::
> 
> 	ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
> 	ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter);
> 
>      Messages are then read out of the pipe using read().
> 
>  (2) It should be possible to allow write() to insert data into the
>      notification pipes too, but this is currently disabled as the kernel
>      has to be able to insert messages into the pipe *without* holding
>      pipe->mutex and the code to make this work needs careful auditing.
> 
>  (3) sendfile(), splice() and vmsplice() are disabled on notification pipes
>      because of the pipe->mutex issue and also because they sometimes want
>      to revert what they just did - but one or more notification messages
>      might've been interleaved in the ring.
> 
>  (4) The kernel inserts messages with the wait queue spinlock held.  This
>      means that pipe_read() and pipe_write() have to take the spinlock to
>      update the queue pointers.
> 
>  (5) Records in the buffer are binary, typed and have a length so that they
>      can be of varying size.
> 
>      This allows multiple heterogeneous sources to share a common buffer;
>      there are 16 million types available, of which I've used just a few,
>      so there is scope for others to be used.  Tags may be specified when a
>      watchpoint is created to help distinguish the sources.
> 
>  (6) Records are filterable as types have up to 256 subtypes that can be
>      individually filtered.  Other filtration is also available.
> 
>  (7) Notification pipes don't interfere with each other; each may be bound
>      to a different set of watches.  Any particular notification will be
>      copied to all the queues that are currently watching for it - and only
>      those that are watching for it.
> 
>  (8) When recording a notification, the kernel will not sleep, but will
>      rather mark a queue as having lost a message if there's insufficient
>      space.  read() will fabricate a loss notification message at an
>      appropriate point later.
> 
>  (9) The notification pipe is created and then watchpoints are attached to
>      it, using one of:
> 
> 	keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
> 	watch_mount(AT_FDCWD, "/", 0, fd, 0x02);
> 	watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03);
> 
>      where in both cases, fd indicates the queue and the number after is a
>      tag between 0 and 255.
> 
> (10) Watches are removed if either the notification pipe is destroyed or
>      the watched object is destroyed.  In the latter case, a message will
>      be generated indicating the enforced watch removal.
> 
> 
> Things I want to avoid:
> 
>  (1) Introducing features that make the core VFS dependent on the network
>      stack or networking namespaces (ie. usage of netlink).
> 
>  (2) Dumping all this stuff into dmesg and having a daemon that sits there
>      parsing the output and distributing it as this then puts the
>      responsibility for security into userspace and makes handling
>      namespaces tricky.  Further, dmesg might not exist or might be
>      inaccessible inside a container.
> 
>  (3) Letting users see events they shouldn't be able to see.
> 
> 
> TESTING AND MANPAGES
> ==========
> 
>  (*) The keyutils tree has a pipe-watch branch that has keyctl commands for
>      making use of notifications.  Proposed manual pages can also be found
>      on this branch, though a couple of them really need to go to the main
>      manpages repository instead.
> 
>      If the kernel supports the watching of keys, then running "make test"
>      on that branch will cause the testing infrastructure to spawn a
>      monitoring process on the side that monitors a notifications pipe for
>      all the key/keyring changes induced by the tests and they'll all be
>      checked off to make sure they happened.
> 
> 	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch
> 
>  (*) A test program is provided (samples/watch_queue/watch_test) that can
>      be used to monitor for keyrings, mount and superblock events.
>      Information on the notifications is simply logged to stdout.
> 
> Thanks,
> David
> ---
> The following changes since commit b9bbe6ed63b2b9f2c9ee5cbd0f2c946a2723f4ce:
> 
>   Linux 5.7-rc6 (2020-05-17 16:48:37 -0700)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/notifications-20200601
> 
> for you to fetch changes up to a8478a602913dc89a7cd2060e613edecd07e1dbd:
> 
>   smack: Implement the watch_key and post_notification hooks (2020-05-19 15:47:38 +0100)
> 
> ----------------------------------------------------------------
> Notifications over pipes + Keyring notifications
> 
> ----------------------------------------------------------------
> David Howells (12):
>       uapi: General notification queue definitions
>       security: Add a hook for the point of notification insertion
>       pipe: Add O_NOTIFICATION_PIPE
>       pipe: Add general notification queue support
>       security: Add hooks to rule on setting a watch
>       watch_queue: Add a key/keyring notification facility
>       Add sample notification program
>       pipe: Allow buffers to be marked read-whole-or-error for notifications
>       pipe: Add notification lossage handling
>       keys: Make the KEY_NEED_* perms an enum rather than a mask
>       selinux: Implement the watch_key security hook
>       smack: Implement the watch_key and post_notification hooks
> 
>  Documentation/security/keys/core.rst               |  57 ++
>  Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
>  Documentation/watch_queue.rst                      | 339 +++++++++++
>  fs/pipe.c                                          | 242 +++++---
>  fs/splice.c                                        |  12 +-
>  include/linux/key.h                                |  33 +-
>  include/linux/lsm_audit.h                          |   1 +
>  include/linux/lsm_hook_defs.h                      |   9 +
>  include/linux/lsm_hooks.h                          |  14 +
>  include/linux/pipe_fs_i.h                          |  27 +-
>  include/linux/security.h                           |  30 +-
>  include/linux/watch_queue.h                        | 127 ++++
>  include/uapi/linux/keyctl.h                        |   2 +
>  include/uapi/linux/watch_queue.h                   | 104 ++++
>  init/Kconfig                                       |  12 +
>  kernel/Makefile                                    |   1 +
>  kernel/watch_queue.c                               | 659 +++++++++++++++++++++
>  samples/Kconfig                                    |   6 +
>  samples/Makefile                                   |   1 +
>  samples/watch_queue/Makefile                       |   7 +
>  samples/watch_queue/watch_test.c                   | 186 ++++++
>  security/keys/Kconfig                              |   9 +
>  security/keys/compat.c                             |   3 +
>  security/keys/gc.c                                 |   5 +
>  security/keys/internal.h                           |  38 +-
>  security/keys/key.c                                |  38 +-
>  security/keys/keyctl.c                             | 115 +++-
>  security/keys/keyring.c                            |  20 +-
>  security/keys/permission.c                         |  31 +-
>  security/keys/process_keys.c                       |  46 +-
>  security/keys/request_key.c                        |   4 +-
>  security/security.c                                |  22 +-
>  security/selinux/hooks.c                           |  51 +-
>  security/smack/smack_lsm.c                         | 112 +++-
>  34 files changed, 2185 insertions(+), 179 deletions(-)
>  create mode 100644 Documentation/watch_queue.rst
>  create mode 100644 include/linux/watch_queue.h
>  create mode 100644 include/uapi/linux/watch_queue.h
>  create mode 100644 kernel/watch_queue.c
>  create mode 100644 samples/watch_queue/Makefile
>  create mode 100644 samples/watch_queue/watch_test.c
> 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [GIT PULL] General notification queue and key notifications
  2020-06-02 15:55   ` David Howells
  2020-06-03  2:15     ` Ian Kent
  2020-06-10  9:56     ` Christian Brauner
@ 2020-06-10 11:12     ` Karel Zak
  2020-06-12 21:32       ` Linus Torvalds
                         ` (3 more replies)
  2020-06-13 18:00     ` pr-tracker-bot
  2020-06-17  1:15     ` Williams, Dan J
  4 siblings, 4 replies; 99+ messages in thread
From: Karel Zak @ 2020-06-10 11:12 UTC (permalink / raw)
  To: torvalds
  Cc: David Howells, viro, dray, mszeredi, swhiteho, jlayton, raven,
	andres, christian.brauner, jarkko.sakkinen, keyrings,
	linux-fsdevel, linux-kernel


 Hi Linus,

On Tue, Jun 02, 2020 at 04:55:04PM +0100, David Howells wrote:
> Can you pull this, please?  It adds a general notification queue concept

I'm trying to use David's notification stuff in userspace, and I guess
feedback is welcome :-)

The notification stuff looks pretty promising, but I do not understand
why we need to use pipe for this purpose, see typical userspace use-case:

        int pipefd[2], fd;

        if (pipe2(pipefd, O_NOTIFICATION_PIPE) = -1)
                err(EXIT_FAILURE, "pipe2 failed");

        fd = pipefd[0];

All the next operations are done with "fd". It's nowhere used as a
pipe, and nothing uses pipefd[1]. The first impression from this code
is "oh, this is strange; why?".

Is it because we need to create a new file descriptor from nothing?
Why O_NOTIFICATION_PIPE is better than introduce a new syscall
notifyfd()?

(We have signalfd(), no O_SIGNAL_PIPE, etc.) 

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [GIT PULL] General notification queue and key notifications
  2020-06-10 11:12     ` Karel Zak
@ 2020-06-12 21:32       ` Linus Torvalds
  2020-06-12 22:01       ` Linus Torvalds
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2020-06-12 21:32 UTC (permalink / raw)
  To: Karel Zak
  Cc: David Howells, Al Viro, dray, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, Ian Kent, andres, Christian Brauner,
	Jarkko Sakkinen, keyrings, linux-fsdevel,
	Linux Kernel Mailing List

[ Finally getting around to this since my normal pull queue is now empty ]

On Wed, Jun 10, 2020 at 4:13 AM Karel Zak <kzak@redhat.com> wrote:
>
> The notification stuff looks pretty promising, but I do not understand
> why we need to use pipe for this purpose

The original intent was never to use the "pipe()" system call itself,
only use pipes as the actual transport mechanism (because I do not for
a second believe in the crazy "use sockets" model that a lot of other
people seem to blindly believe in).

But using "pipe()" also allows for non-kernel notification queues (ie
where the events come from a user space process). Then you'd not use
O_NOTIFICATION_PIPE, but O_DIRECT (for a packetized pipe).

> Is it because we need to create a new file descriptor from nothing?
> Why O_NOTIFICATION_PIPE is better than introduce a new syscall
> notifyfd()?

We could eventually introduce a new system call.

But I most definitely did *NOT* want to see anything like that for any
first gen stuff.  Especially since it wasn't clear who was going to
use it, and whether early trials would literally be done with that
user-space emulation model of using a perfectly regular pipe (just
with packetization).

I'm not even convinced O_NOTIFICATION_PIPE is necessary, but at worst
it will be a useful marker. I think the only real reason for it was to
avoid any clashes with splice(), which has more complex use of the
pipe buffers.

I'm so far just reading this thread and the arguments for users, and I
haven't yet looked at all the actual details in the pull request - but
last time I had objections to things it wasn't the code, it was the
lack of any use.

             Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [GIT PULL] General notification queue and key notifications
  2020-06-10 11:12     ` Karel Zak
  2020-06-12 21:32       ` Linus Torvalds
@ 2020-06-12 22:01       ` Linus Torvalds
  2020-06-13 13:04       ` David Howells
  2020-06-13 13:24       ` David Howells
  3 siblings, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2020-06-12 22:01 UTC (permalink / raw)
  To: Karel Zak
  Cc: David Howells, Al Viro, dray, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, Ian Kent, andres, Christian Brauner,
	Jarkko Sakkinen, keyrings, linux-fsdevel,
	Linux Kernel Mailing List

[ Actually going through the code now ]

On Wed, Jun 10, 2020 at 4:13 AM Karel Zak <kzak@redhat.com> wrote:
>
> All the next operations are done with "fd". It's nowhere used as a
> pipe, and nothing uses pipefd[1].

As an aside, that isn't necessarily true.

In some of the examples, pipefd[1] is used for configuration (sizing
and adding filters), although I think right now that's not really
enforced, and other examples seem to have pipefd[0] do that too.

DavidH: should that perhaps be a hard rule, so that you can pass a
pipefd[0] to readers, while knowing that they can't then change the
kinds of notifications they see.

In the "pipe: Add general notification queue support" commit message,
the code example uses pipefd[0] for IOC_WATCH_QUEUE_SET_SIZE, but then
in the commit message for "watch_queue: Add a key/keyring notification
facility" it uses pipefd[1].

And that latter example does make sense: using the write-side
pipefd[1] for configuration, while the read-side pipefd[0] is the side
that sees the results. That is also how it would work if you have a
user-mode pipe with the notification source controlling the writing
side - the reading side can obviously not add filters or change the
semantics of the watches.

So that allows a trusted side to add and create filters, while some
untrusted entity can then see the results.

This isn't going to hold up me merging the code, but it would be good
to clarify and make that something that gets enforced if we decide
it's worth it.

It does seem conceptually like a good idea, and potentially actually
useful to clearly separate the domain of "you can add watches and
filters" from "you can see the notifications".

               Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [GIT PULL] General notification queue and key notifications
  2020-06-10 11:12     ` Karel Zak
  2020-06-12 21:32       ` Linus Torvalds
  2020-06-12 22:01       ` Linus Torvalds
@ 2020-06-13 13:04       ` David Howells
  2020-06-13 16:47         ` Linus Torvalds
  2020-06-13 19:22         ` Miklos Szeredi
  2020-06-13 13:24       ` David Howells
  3 siblings, 2 replies; 99+ messages in thread
From: David Howells @ 2020-06-13 13:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Karel Zak, Al Viro, dray, Miklos Szeredi,
	Steven Whitehouse, Jeff Layton, Ian Kent, andres,
	Christian Brauner, Jarkko Sakkinen, keyrings, linux-fsdevel,
	Linux Kernel Mailing List

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I'm not even convinced O_NOTIFICATION_PIPE is necessary, but at worst
> it will be a useful marker. I think the only real reason for it was to
> avoid any clashes with splice(), which has more complex use of the
> pipe buffers.

The main reason is to prevent splice because the iov_iter rewind for splice
gets quite tricky if the kernel can randomly insert packets into the pipe
buffer in between what splice is inserting.

> I'm so far just reading this thread and the arguments for users, and I
> haven't yet looked at all the actual details in the pull request - but
> last time I had objections to things it wasn't the code, it was the
> lack of any use.

Would you be willing at this point to consider pulling the mount notifications
and fsinfo() which helps support that?  I could whip up pull reqs for those
two pieces - or do you want to see more concrete patches that use it?

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [GIT PULL] General notification queue and key notifications
  2020-06-10 11:12     ` Karel Zak
                         ` (2 preceding siblings ...)
  2020-06-13 13:04       ` David Howells
@ 2020-06-13 13:24       ` David Howells
  3 siblings, 0 replies; 99+ messages in thread
From: David Howells @ 2020-06-13 13:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Karel Zak, Al Viro, dray, Miklos Szeredi,
	Steven Whitehouse, Jeff Layton, Ian Kent, andres,
	Christian Brauner, Jarkko Sakkinen, keyrings, linux-fsdevel,
	Linux Kernel Mailing List

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > All the next operations are done with "fd". It's nowhere used as a
> > pipe, and nothing uses pipefd[1].
> 
> As an aside, that isn't necessarily true.
> 
> In some of the examples, pipefd[1] is used for configuration (sizing
> and adding filters), although I think right now that's not really
> enforced, and other examples seem to have pipefd[0] do that too.

The configuration can happen on either end of the pipe.  I just need to be
able to find the pipe object.

> DavidH: should that perhaps be a hard rule, so that you can pass a
> pipefd[0] to readers, while knowing that they can't then change the
> kinds of notifications they see.

You can argue that the other way: that it should be a hard rule that you can
pass pipefd[1] to writers, whilst knowing that they can't then change the kind
of notifications that the kernel can insert into the pipe.  My feeling is that
it's more likely that you would keep the read end yourself and give the write
end away - if at all.  Most likely, IMO, would be that you attach notification
sources and never use the write end directly.

There is some argument for making it so that the notification sources belong
to the read end only and that they keep the write side alive internally -
meaning that you can just close the write end.  All the notification sources
just then disappear when the read end is closed - but dup() might make this
kind of tricky as there is only one pipe object and its shared between both
ends.  The existence of O_RDWR FIFOs might also make this tricky.

> In the "pipe: Add general notification queue support" commit message,
> the code example uses pipefd[0] for IOC_WATCH_QUEUE_SET_SIZE, but then
> in the commit message for "watch_queue: Add a key/keyring notification
> facility" it uses pipefd[1].
>
> And that latter example does make sense: using the write-side
> pipefd[1] for configuration, while the read-side pipefd[0] is the side
> that sees the results. That is also how it would work if you have a
> user-mode pipe with the notification source controlling the writing
> side - the reading side can obviously not add filters or change the
> semantics of the watches.
> 
> So that allows a trusted side to add and create filters, while some
> untrusted entity can then see the results.

As stated above, I think you should be looking at this the other way round -
you're more likely to keep the read end for yourself.  If you attach multiple
sources to a pipe, everything they produce comes out mixed together from the
read end of the pipe.

You might even pass the write end to multiple userspace-side event generators,
but I'm not sure it would make sense to pass the read end around unless you
have sufficient flow that you need multiple consumers to keep up with it.

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [GIT PULL] General notification queue and key notifications
  2020-06-13 13:04       ` David Howells
@ 2020-06-13 16:47         ` Linus Torvalds
  2020-06-13 17:03           ` Linus Torvalds
  2020-06-13 19:22         ` Miklos Szeredi
  1 sibling, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2020-06-13 16:47 UTC (permalink / raw)
  To: David Howells
  Cc: Karel Zak, Al Viro, dray, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, Ian Kent, andres, Christian Brauner,
	Jarkko Sakkinen, keyrings, linux-fsdevel,
	Linux Kernel Mailing List

On Sat, Jun 13, 2020 at 6:05 AM David Howells <dhowells@redhat.com> wrote:
>
> Would you be willing at this point to consider pulling the mount notifications
> and fsinfo() which helps support that?  I could whip up pull reqs for those
> two pieces - or do you want to see more concrete patches that use it?

I'd want to see more concrete use cases, but I'd also like to see that
this keyring thing gets used and doesn't find any show-stoppers when
it does.

If we have multiple uses, and one of them notices some problem that
requires any ABI changes, but the other one has already started using
it, we'll have more problems.

          Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [GIT PULL] General notification queue and key notifications
  2020-06-13 16:47         ` Linus Torvalds
@ 2020-06-13 17:03           ` Linus Torvalds
  0 siblings, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2020-06-13 17:03 UTC (permalink / raw)
  To: David Howells
  Cc: Karel Zak, Al Viro, dray, Miklos Szeredi, Steven Whitehouse,
	Jeff Layton, Ian Kent, andres, Christian Brauner,
	Jarkko Sakkinen, keyrings, linux-fsdevel,
	Linux Kernel Mailing List

On Sat, Jun 13, 2020 at 9:47 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> If we have multiple uses, and one of them notices some problem that
> requires any ABI changes, but the other one has already started using
> it, we'll have more problems.

Ok, it's merged in my tree, although I was somewhat unhappy about the
incomprehensible calling conventions of "get_pipe_info()". The random
second argument just makes no sense when you read the code, it would
have probably been better as a helper function or #define to clarify
the whole "for_splice" thing.

But let's see how it works and what actually happens.

               Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [GIT PULL] General notification queue and key notifications
  2020-06-02 15:55   ` David Howells
                       ` (2 preceding siblings ...)
  2020-06-10 11:12     ` Karel Zak
@ 2020-06-13 18:00     ` pr-tracker-bot
  2020-06-17  1:15     ` Williams, Dan J
  4 siblings, 0 replies; 99+ messages in thread
From: pr-tracker-bot @ 2020-06-13 18:00 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven,
	andres, christian.brauner, jarkko.sakkinen, keyrings,
	linux-fsdevel, linux-kernel

The pull request you sent on Tue, 02 Jun 2020 16:55:04 +0100:

> git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/notifications-20200601

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/6c3297841472b4e53e22e53826eea9e483d993e5

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [GIT PULL] General notification queue and key notifications
  2020-06-13 13:04       ` David Howells
  2020-06-13 16:47         ` Linus Torvalds
@ 2020-06-13 19:22         ` Miklos Szeredi
  1 sibling, 0 replies; 99+ messages in thread
From: Miklos Szeredi @ 2020-06-13 19:22 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Karel Zak, Al Viro, dray, Steven Whitehouse,
	Jeff Layton, Ian Kent, andres, Christian Brauner,
	Jarkko Sakkinen, keyrings, linux-fsdevel,
	Linux Kernel Mailing List

On Sat, Jun 13, 2020 at 3:05 PM David Howells <dhowells@redhat.com> wrote:

> > I'm so far just reading this thread and the arguments for users, and I
> > haven't yet looked at all the actual details in the pull request - but
> > last time I had objections to things it wasn't the code, it was the
> > lack of any use.
>
> Would you be willing at this point to consider pulling the mount notifications
> and fsinfo() which helps support that?  I could whip up pull reqs for those
> two pieces - or do you want to see more concrete patches that use it?

Well, I had some questions and comments for the mount notifications
last time around[1] and didn't yet get a reply.

And the fsinfo stuff is simply immature, please lets not merge it just
yet.  When we have some uses (most notably systemd) running on top of
the current fsinfo interface, we can sit down and discuss how the API
can be cleaned up.

BTW I had a similar experience with the fsconfig() merge, which was
pushed with some unpolished bits and where my comments were also
largely ignored.  So, before asking to pull, please at least *answer*
reviews.  You don't have to agree, but at least consider and think
about the comments.

Thanks,
Miklos

[1] https://lore.kernel.org/linux-fsdevel/CAJfpegspWA6oUtdcYvYF?ij=Bnq03b8VMbU9RNMKc+zzjbag@mail.gmail.com/

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [GIT PULL] General notification queue and key notifications
  2020-06-02 15:55   ` David Howells
                       ` (3 preceding siblings ...)
  2020-06-13 18:00     ` pr-tracker-bot
@ 2020-06-17  1:15     ` Williams, Dan J
  2020-06-23 23:38       ` Dan Williams
  2020-06-24  0:55       ` David Howells
  4 siblings, 2 replies; 99+ messages in thread
From: Williams, Dan J @ 2020-06-17  1:15 UTC (permalink / raw)
  To: torvalds, dhowells
  Cc: raven, kzak, jarkko.sakkinen, linux-nvdimm, dray, swhiteho,
	linux-kernel, linux-fsdevel, mszeredi, jlayton, viro, andres,
	keyrings, christian.brauner

Hi David,

On Tue, 2020-06-02 at 16:55 +-0100, David Howells wrote:
+AD4- Date: Tue, 02 Jun 2020 16:51:44 +-0100
+AD4- 
+AD4- Hi Linus,
+AD4- 
+AD4- Can you pull this, please?  It adds a general notification queue
+AD4- concept
+AD4- and adds an event source for keys/keyrings, such as linking and
+AD4- unlinking
+AD4- keys and changing their attributes.
+AFs-..+AF0-

This commit:

+AD4-       keys: Make the KEY+AF8-NEED+AF8AKg- perms an enum rather than a mask

...upstream as:

    8c0637e950d6 keys: Make the KEY+AF8-NEED+AF8AKg- perms an enum rather than a mask

...triggers a regression in the libnvdimm unit test that exercises the
encrypted keys used to store nvdimm passphrases. It results in the
below warning.

---

WARNING: CPU: 15 PID: 6276 at security/keys/permission.c:35 key+AF8-task+AF8-permission+-0xd3/0x140
Modules linked in: nd+AF8-blk(OE) nfit+AF8-test(OE) device+AF8-dax(OE) ebtable+AF8-filter(E) ebtables(E) ip6table+AF8-filter(E) ip6+AF8-tables(E) kvm+AF8-intel(E) kvm(E) irqbypass(E) nd+AF8-pmem(OE) dax+AF8-pmem(OE) nd+AF8-btt(OE) dax+AF8-p
ct10dif+AF8-pclmul(E) nd+AF8-e820(OE) nfit(OE) crc32+AF8-pclmul(E) libnvdimm(OE) crc32c+AF8-intel(E) ghash+AF8-clmulni+AF8-intel(E) serio+AF8-raw(E) encrypted+AF8-keys(E) trusted(E) nfit+AF8-test+AF8-iomap(OE) tpm(E) drm(E)
CPU: 15 PID: 6276 Comm: lt-ndctl Tainted: G           OE     5.7.0-rc6+- +ACM-155
Hardware name: QEMU Standard PC (i440FX +- PIIX, 1996), BIOS 0.0.0 02/06/2015
RIP: 0010:key+AF8-task+AF8-permission+-0xd3/0x140
Code: c8 21 d9 39 d9 75 25 48 83 c4 08 4c 89 e6 48 89 ef 5b 5d 41 5c 41 5d e9 1b a7 00 00 bb 01 00 00 00 83 fa 01 0f 84 68 ff ff ff +ADw-0f+AD4- 0b 48 83 c4 08 b8 f3 ff ff ff 5b 5d 41 5c 41 5d c3 83 fa 06

RSP: 0018:ffffaddc42db7c90 EFLAGS: 00010297
RAX: 0000000000000001 RBX: 0000000000000001 RCX: ffffaddc42db7c7c
RDX: 0000000000000000 RSI: ffff985e1c46e840 RDI: ffff985e3a03de01
RBP: ffff985e3a03de01 R08: 0000000000000000 R09: 5461e7bc000002a0
R10: 0000000000000004 R11: 0000000066666666 R12: ffff985e1c46e840
R13: 0000000000000000 R14: ffffaddc42db7cd8 R15: ffff985e248c6540
FS:  00007f863c18a780(0000) GS:ffff985e3bbc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000006d3708 CR3: 0000000125a1e006 CR4: 0000000000160ee0
Call Trace:
 lookup+AF8-user+AF8-key+-0xeb/0x6b0
 ? vsscanf+-0x3df/0x840
 ? key+AF8-validate+-0x50/0x50
 ? key+AF8-default+AF8-cmp+-0x20/0x20
 nvdimm+AF8-get+AF8-user+AF8-key+AF8-payload.part.0+-0x21/0x110 +AFs-libnvdimm+AF0-
 nvdimm+AF8-security+AF8-store+-0x67d/0xb20 +AFs-libnvdimm+AF0-
 security+AF8-store+-0x67/0x1a0 +AFs-libnvdimm+AF0-
 kernfs+AF8-fop+AF8-write+-0xcf/0x1c0
 vfs+AF8-write+-0xde/0x1d0
 ksys+AF8-write+-0x68/0xe0
 do+AF8-syscall+AF8-64+-0x5c/0xa0
 entry+AF8-SYSCALL+AF8-64+AF8-after+AF8-hwframe+-0x49/0xb3
RIP: 0033:0x7f863c624547
Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 +ADw-48+AD4- 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007ffd61d8f5e8 EFLAGS: 00000246 ORIG+AF8-RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007ffd61d8f640 RCX: 00007f863c624547
RDX: 0000000000000014 RSI: 00007ffd61d8f640 RDI: 0000000000000005
RBP: 0000000000000005 R08: 0000000000000014 R09: 00007ffd61d8f4a0
R10: fffffffffffff455 R11: 0000000000000246 R12: 00000000006dbbf0
R13: 00000000006cd710 R14: 00007f863c18a6a8 R15: 00007ffd61d8fae0
irq event stamp: 36976
hardirqs last  enabled at (36975): +AFsAPA-ffffffff9131fa40+AD4AXQ- +AF8AXw-slab+AF8-alloc+-0x70/0x90
hardirqs last disabled at (36976): +AFsAPA-ffffffff910049c7+AD4AXQ- trace+AF8-hardirqs+AF8-off+AF8-thunk+-0x1a/0x1c
softirqs last  enabled at (35474): +AFsAPA-ffffffff91e00357+AD4AXQ- +AF8AXw-do+AF8-softirq+-0x357/0x466
softirqs last disabled at (35467): +AFsAPA-ffffffff910eae96+AD4AXQ- irq+AF8-exit+-0xe6/0xf0

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [GIT PULL] General notification queue and key notifications
  2020-06-17  1:15     ` Williams, Dan J
@ 2020-06-23 23:38       ` Dan Williams
  2020-06-24  0:55       ` David Howells
  1 sibling, 0 replies; 99+ messages in thread
From: Dan Williams @ 2020-06-23 23:38 UTC (permalink / raw)
  To: torvalds, dhowells
  Cc: raven, kzak, jarkko.sakkinen, linux-nvdimm, dray, swhiteho,
	linux-kernel, linux-fsdevel, mszeredi, jlayton, viro, andres,
	keyrings, christian.brauner

On Tue, Jun 16, 2020 at 6:15 PM Williams, Dan J
<dan.j.williams@intel.com> wrote:
>
> Hi David,
>
> On Tue, 2020-06-02 at 16:55 +0100, David Howells wrote:
> > Date: Tue, 02 Jun 2020 16:51:44 +0100
> >
> > Hi Linus,
> >
> > Can you pull this, please?  It adds a general notification queue
> > concept
> > and adds an event source for keys/keyrings, such as linking and
> > unlinking
> > keys and changing their attributes.
> [..]
>
> This commit:
>
> >       keys: Make the KEY_NEED_* perms an enum rather than a mask
>
> ...upstream as:
>
>     8c0637e950d6 keys: Make the KEY_NEED_* perms an enum rather than a mask
>
> ...triggers a regression in the libnvdimm unit test that exercises the
> encrypted keys used to store nvdimm passphrases. It results in the
> below warning.

This regression is still present in tip of tree. David, have you had a
chance to take a look?



>
> ---
>
> WARNING: CPU: 15 PID: 6276 at security/keys/permission.c:35 key_task_permission+0xd3/0x140
> Modules linked in: nd_blk(OE) nfit_test(OE) device_dax(OE) ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E) kvm_intel(E) kvm(E) irqbypass(E) nd_pmem(OE) dax_pmem(OE) nd_btt(OE) dax_p
> ct10dif_pclmul(E) nd_e820(OE) nfit(OE) crc32_pclmul(E) libnvdimm(OE) crc32c_intel(E) ghash_clmulni_intel(E) serio_raw(E) encrypted_keys(E) trusted(E) nfit_test_iomap(OE) tpm(E) drm(E)
> CPU: 15 PID: 6276 Comm: lt-ndctl Tainted: G           OE     5.7.0-rc6+ #155
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
> RIP: 0010:key_task_permission+0xd3/0x140
> Code: c8 21 d9 39 d9 75 25 48 83 c4 08 4c 89 e6 48 89 ef 5b 5d 41 5c 41 5d e9 1b a7 00 00 bb 01 00 00 00 83 fa 01 0f 84 68 ff ff ff <0f> 0b 48 83 c4 08 b8 f3 ff ff ff 5b 5d 41 5c 41 5d c3 83 fa 06
>
> RSP: 0018:ffffaddc42db7c90 EFLAGS: 00010297
> RAX: 0000000000000001 RBX: 0000000000000001 RCX: ffffaddc42db7c7c
> RDX: 0000000000000000 RSI: ffff985e1c46e840 RDI: ffff985e3a03de01
> RBP: ffff985e3a03de01 R08: 0000000000000000 R09: 5461e7bc000002a0
> R10: 0000000000000004 R11: 0000000066666666 R12: ffff985e1c46e840
> R13: 0000000000000000 R14: ffffaddc42db7cd8 R15: ffff985e248c6540
> FS:  00007f863c18a780(0000) GS:ffff985e3bbc0000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000006d3708 CR3: 0000000125a1e006 CR4: 0000000000160ee0
> Call Trace:
>  lookup_user_key+0xeb/0x6b0
>  ? vsscanf+0x3df/0x840
>  ? key_validate+0x50/0x50
>  ? key_default_cmp+0x20/0x20
>  nvdimm_get_user_key_payload.part.0+0x21/0x110 [libnvdimm]
>  nvdimm_security_store+0x67d/0xb20 [libnvdimm]
>  security_store+0x67/0x1a0 [libnvdimm]
>  kernfs_fop_write+0xcf/0x1c0
>  vfs_write+0xde/0x1d0
>  ksys_write+0x68/0xe0
>  do_syscall_64+0x5c/0xa0
>  entry_SYSCALL_64_after_hwframe+0x49/0xb3
> RIP: 0033:0x7f863c624547
> Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> RSP: 002b:00007ffd61d8f5e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 00007ffd61d8f640 RCX: 00007f863c624547
> RDX: 0000000000000014 RSI: 00007ffd61d8f640 RDI: 0000000000000005
> RBP: 0000000000000005 R08: 0000000000000014 R09: 00007ffd61d8f4a0
> R10: fffffffffffff455 R11: 0000000000000246 R12: 00000000006dbbf0
> R13: 00000000006cd710 R14: 00007f863c18a6a8 R15: 00007ffd61d8fae0
> irq event stamp: 36976
> hardirqs last  enabled at (36975): [<ffffffff9131fa40>] __slab_alloc+0x70/0x90
> hardirqs last disabled at (36976): [<ffffffff910049c7>] trace_hardirqs_off_thunk+0x1a/0x1c
> softirqs last  enabled at (35474): [<ffffffff91e00357>] __do_softirq+0x357/0x466
> softirqs last disabled at (35467): [<ffffffff910eae96>] irq_exit+0xe6/0xf0
>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [GIT PULL] General notification queue and key notifications
  2020-06-17  1:15     ` Williams, Dan J
  2020-06-23 23:38       ` Dan Williams
@ 2020-06-24  0:55       ` David Howells
  2020-06-24  1:03         ` Dan Williams
  2020-06-24  1:17         ` David Howells
  1 sibling, 2 replies; 99+ messages in thread
From: David Howells @ 2020-06-24  0:55 UTC (permalink / raw)
  To: Dan Williams
  Cc: dhowells, torvalds, raven, kzak, jarkko.sakkinen, linux-nvdimm,
	dray, swhiteho, linux-kernel, linux-fsdevel, mszeredi, jlayton,
	viro, andres, keyrings, christian.brauner

Dan Williams <dan.j.williams@intel.com> wrote:

> > This commit:
> >
> > >       keys: Make the KEY_NEED_* perms an enum rather than a mask
> >
> > ...upstream as:
> >
> >     8c0637e950d6 keys: Make the KEY_NEED_* perms an enum rather than a mask
> >
> > ...triggers a regression in the libnvdimm unit test that exercises the
> > encrypted keys used to store nvdimm passphrases. It results in the
> > below warning.
> 
> This regression is still present in tip of tree. David, have you had a
> chance to take a look?

nvdimm_lookup_user_key() needs to indicate to lookup_user_key() what it wants
the key for so that the appropriate security checks can take place in SELinux
and Smack.  Note that I have a patch in the works that changes this still
further.

Does setting the third argument of lookup_user_key() to KEY_NEED_SEARCH work
for you?

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [GIT PULL] General notification queue and key notifications
  2020-06-24  0:55       ` David Howells
@ 2020-06-24  1:03         ` Dan Williams
  2020-06-24  1:17         ` David Howells
  1 sibling, 0 replies; 99+ messages in thread
From: Dan Williams @ 2020-06-24  1:03 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, raven, kzak, jarkko.sakkinen, linux-nvdimm, dray,
	swhiteho, linux-kernel, linux-fsdevel, mszeredi, jlayton, viro,
	andres, keyrings, christian.brauner

On Tue, Jun 23, 2020 at 5:55 PM David Howells <dhowells@redhat.com> wrote:
>
> Dan Williams <dan.j.williams@intel.com> wrote:
>
> > > This commit:
> > >
> > > >       keys: Make the KEY_NEED_* perms an enum rather than a mask
> > >
> > > ...upstream as:
> > >
> > >     8c0637e950d6 keys: Make the KEY_NEED_* perms an enum rather than a mask
> > >
> > > ...triggers a regression in the libnvdimm unit test that exercises the
> > > encrypted keys used to store nvdimm passphrases. It results in the
> > > below warning.
> >
> > This regression is still present in tip of tree. David, have you had a
> > chance to take a look?
>
> nvdimm_lookup_user_key() needs to indicate to lookup_user_key() what it wants
> the key for so that the appropriate security checks can take place in SELinux
> and Smack.  Note that I have a patch in the works that changes this still
> further.
>
> Does setting the third argument of lookup_user_key() to KEY_NEED_SEARCH work
> for you?

It does, thanks.

Shall I wait for your further reworks to fix this for v5.8, or is that
v5.9 material?

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [GIT PULL] General notification queue and key notifications
  2020-06-24  0:55       ` David Howells
  2020-06-24  1:03         ` Dan Williams
@ 2020-06-24  1:17         ` David Howells
  1 sibling, 0 replies; 99+ messages in thread
From: David Howells @ 2020-06-24  1:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: dhowells, torvalds, raven, kzak, jarkko.sakkinen, linux-nvdimm,
	dray, swhiteho, linux-kernel, linux-fsdevel, mszeredi, jlayton,
	viro, andres, keyrings, christian.brauner

Dan Williams <dan.j.williams@intel.com> wrote:

> Shall I wait for your further reworks to fix this for v5.8, or is that
> v5.9 material?

It could do with stewing in linux-next for a while, so 5.9 probably.

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [GIT PULL] General notification queue and key notifications
  2020-06-02 15:51 [GIT PULL] General notification queue and key notifications David Howells
@ 2020-06-02 15:54 ` David Howells
  0 siblings, 0 replies; 99+ messages in thread
From: David Howells @ 2020-06-02 15:54 UTC (permalink / raw)
  Cc: dhowells, torvalds, viro, dray, kzak, mszeredi, swhiteho,
	jlayton, raven, andres, christian.brauner, jarkko.sakkinen,
	keyrings, linux-fsdevel, linux-kernel

Oops - I forgot to include the pull request.  Will resend.

David

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [GIT PULL] General notification queue and key notifications
@ 2020-06-02 15:51 David Howells
  2020-06-02 15:54 ` David Howells
  0 siblings, 1 reply; 99+ messages in thread
From: David Howells @ 2020-06-02 15:51 UTC (permalink / raw)
  To: torvalds
  Cc: dhowells, viro, dray, kzak, mszeredi, swhiteho, jlayton, raven,
	andres, christian.brauner, jarkko.sakkinen, keyrings,
	linux-fsdevel, linux-kernel

Hi Linus,

Can you pull this, please?  It adds a general notification queue concept
and adds an event source for keys/keyrings, such as linking and unlinking
keys and changing their attributes.

Thanks to Debarshi Ray, we do have a pull request to use this to fix a
problem with gnome-online-accounts - as mentioned last time:

    https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47

Without this, g-o-a has to constantly poll a keyring-based kerberos cache
to find out if kinit has changed anything.

[[ With regard to the mount/sb notifications and fsinfo(), Karel Zak and
   Ian Kent have been working on making libmount use them, preparatory to
   working on systemd:

	https://github.com/karelzak/util-linux/commits/topic/fsinfo
	https://github.com/raven-au/util-linux/commits/topic/fsinfo.public

   Development has stalled briefly due to other commitments, so I'm not
   sure I can ask you to pull those parts of the series for now.  Christian
   Brauner would like to use them in lxc, but hasn't started.
   ]]


LSM hooks are included:

 (1) A set of hooks are provided that allow an LSM to rule on whether or
     not a watch may be set.  Each of these hooks takes a different
     "watched object" parameter, so they're not really shareable.  The LSM
     should use current's credentials.  [Wanted by SELinux & Smack]

 (2) A hook is provided to allow an LSM to rule on whether or not a
     particular message may be posted to a particular queue.  This is given
     the credentials from the event generator (which may be the system) and
     the watch setter.  [Wanted by Smack]

I've provided SELinux and Smack with implementations of some of these hooks.


WHY
===

Key/keyring notifications are desirable because if you have your kerberos
tickets in a file/directory, your Gnome desktop will monitor that using
something like fanotify and tell you if your credentials cache changes.

However, we also have the ability to cache your kerberos tickets in the
session, user or persistent keyring so that it isn't left around on disk
across a reboot or logout.  Keyrings, however, cannot currently be
monitored asynchronously, so the desktop has to poll for it - not so good
on a laptop.  This facility will allow the desktop to avoid the need to
poll.


DESIGN DECISIONS
================

 (1) The notification queue is built on top of a standard pipe.  Messages
     are effectively spliced in.  The pipe is opened with a special flag:

	pipe2(fds, O_NOTIFICATION_PIPE);

     The special flag has the same value as O_EXCL (which doesn't seem like
     it will ever be applicable in this context)[?].  It is given up front
     to make it a lot easier to prohibit splice and co. from accessing the
     pipe.

     [?] Should this be done some other way?  I'd rather not use up a new
     	 O_* flag if I can avoid it - should I add a pipe3() system call
     	 instead?

     The pipe is then configured::

	ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
	ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter);

     Messages are then read out of the pipe using read().

 (2) It should be possible to allow write() to insert data into the
     notification pipes too, but this is currently disabled as the kernel
     has to be able to insert messages into the pipe *without* holding
     pipe->mutex and the code to make this work needs careful auditing.

 (3) sendfile(), splice() and vmsplice() are disabled on notification pipes
     because of the pipe->mutex issue and also because they sometimes want
     to revert what they just did - but one or more notification messages
     might've been interleaved in the ring.

 (4) The kernel inserts messages with the wait queue spinlock held.  This
     means that pipe_read() and pipe_write() have to take the spinlock to
     update the queue pointers.

 (5) Records in the buffer are binary, typed and have a length so that they
     can be of varying size.

     This allows multiple heterogeneous sources to share a common buffer;
     there are 16 million types available, of which I've used just a few,
     so there is scope for others to be used.  Tags may be specified when a
     watchpoint is created to help distinguish the sources.

 (6) Records are filterable as types have up to 256 subtypes that can be
     individually filtered.  Other filtration is also available.

 (7) Notification pipes don't interfere with each other; each may be bound
     to a different set of watches.  Any particular notification will be
     copied to all the queues that are currently watching for it - and only
     those that are watching for it.

 (8) When recording a notification, the kernel will not sleep, but will
     rather mark a queue as having lost a message if there's insufficient
     space.  read() will fabricate a loss notification message at an
     appropriate point later.

 (9) The notification pipe is created and then watchpoints are attached to
     it, using one of:

	keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
	watch_mount(AT_FDCWD, "/", 0, fd, 0x02);
	watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03);

     where in both cases, fd indicates the queue and the number after is a
     tag between 0 and 255.

(10) Watches are removed if either the notification pipe is destroyed or
     the watched object is destroyed.  In the latter case, a message will
     be generated indicating the enforced watch removal.


Things I want to avoid:

 (1) Introducing features that make the core VFS dependent on the network
     stack or networking namespaces (ie. usage of netlink).

 (2) Dumping all this stuff into dmesg and having a daemon that sits there
     parsing the output and distributing it as this then puts the
     responsibility for security into userspace and makes handling
     namespaces tricky.  Further, dmesg might not exist or might be
     inaccessible inside a container.

 (3) Letting users see events they shouldn't be able to see.


TESTING AND MANPAGES
====================

 (*) The keyutils tree has a pipe-watch branch that has keyctl commands for
     making use of notifications.  Proposed manual pages can also be found
     on this branch, though a couple of them really need to go to the main
     manpages repository instead.

     If the kernel supports the watching of keys, then running "make test"
     on that branch will cause the testing infrastructure to spawn a
     monitoring process on the side that monitors a notifications pipe for
     all the key/keyring changes induced by the tests and they'll all be
     checked off to make sure they happened.

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch

 (*) A test program is provided (samples/watch_queue/watch_test) that can
     be used to monitor for keyrings, mount and superblock events.
     Information on the notifications is simply logged to stdout.

Thanks,
David

^ permalink raw reply	[flat|nested] 99+ messages in thread

end of thread, other threads:[~2020-06-24  1:17 UTC | newest]

Thread overview: 99+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-30 13:58 Upcoming: Notifications, FS notifications and fsinfo() David Howells
2020-03-30 14:31 ` [GIT PULL] General notification queue and key notifications David Howells
2020-03-31  6:51   ` Stephen Rothwell
2020-06-02 15:55   ` David Howells
2020-06-03  2:15     ` Ian Kent
2020-06-08  0:49       ` Ian Kent
2020-06-10  9:56     ` Christian Brauner
2020-06-10 11:12     ` Karel Zak
2020-06-12 21:32       ` Linus Torvalds
2020-06-12 22:01       ` Linus Torvalds
2020-06-13 13:04       ` David Howells
2020-06-13 16:47         ` Linus Torvalds
2020-06-13 17:03           ` Linus Torvalds
2020-06-13 19:22         ` Miklos Szeredi
2020-06-13 13:24       ` David Howells
2020-06-13 18:00     ` pr-tracker-bot
2020-06-17  1:15     ` Williams, Dan J
2020-06-23 23:38       ` Dan Williams
2020-06-24  0:55       ` David Howells
2020-06-24  1:03         ` Dan Williams
2020-06-24  1:17         ` David Howells
2020-03-30 14:36 ` [GIT PULL] Mount and superblock notifications David Howells
2020-04-04 21:13   ` Linus Torvalds
2020-04-05 22:52     ` Andres Freund
2020-03-30 14:43 ` [GIT PULL] fsinfo: Filesystem information query David Howells
2020-03-30 20:28 ` Upcoming: Notifications, FS notifications and fsinfo() Miklos Szeredi
2020-03-31  9:21   ` Karel Zak
2020-03-30 21:17 ` Christian Brauner
2020-03-31  5:11   ` Miklos Szeredi
2020-03-31  8:15     ` Christian Brauner
2020-03-31  8:34       ` Miklos Szeredi
2020-03-31  8:34     ` Karel Zak
2020-03-31  8:56       ` Miklos Szeredi
2020-03-31  9:49         ` Karel Zak
2020-03-31 12:25         ` Lennart Poettering
2020-03-31 15:10           ` Miklos Szeredi
2020-03-31 15:24             ` Lennart Poettering
2020-03-31 21:56         ` David Howells
2020-03-31 21:54     ` David Howells
2020-04-01  8:43       ` Karel Zak
2020-03-31  7:22   ` Lennart Poettering
2020-03-31 17:31 ` David Howells
2020-03-31 19:42   ` Miklos Szeredi
2020-03-31 19:47   ` David Howells
2020-03-31 21:14   ` David Howells
2020-03-31 21:23   ` David Howells
2020-03-31 21:52 ` David Howells
2020-04-01  9:04   ` Karel Zak
2020-04-01 13:34     ` Miklos Szeredi
2020-04-01 13:55     ` David Howells
2020-04-01 13:58     ` David Howells
2020-04-01 15:25       ` Miklos Szeredi
2020-04-03  9:11         ` Karel Zak
2020-04-01 16:01       ` David Howells
2020-04-01 16:30         ` Miklos Szeredi
2020-04-02 15:22         ` David Howells
2020-04-02 15:24           ` Miklos Szeredi
2020-04-02 15:42           ` David Howells
2020-04-02 15:24         ` David Howells
2020-04-01 14:41   ` Lennart Poettering
2020-04-01 15:33     ` Miklos Szeredi
2020-04-01 16:06     ` David Howells
2020-04-01 16:40       ` Miklos Szeredi
2020-04-02  2:52         ` Ian Kent
2020-04-02 13:52           ` Miklos Szeredi
2020-04-02 14:36             ` Lennart Poettering
2020-04-02 15:22               ` Miklos Szeredi
2020-04-02 15:28                 ` Lennart Poettering
2020-04-02 15:35                   ` Miklos Szeredi
2020-04-02 15:50                     ` Lennart Poettering
2020-04-02 17:20                       ` Miklos Szeredi
2020-04-03 11:08                         ` Lennart Poettering
2020-04-03 11:48                           ` Miklos Szeredi
2020-04-03 15:01                             ` Lennart Poettering
2020-04-06  9:22                               ` Miklos Szeredi
2020-04-06 17:29                                 ` Lennart Poettering
2020-04-07  2:21                                   ` Ian Kent
2020-04-07 13:59                                     ` Miklos Szeredi
2020-04-07 15:53                                       ` Lennart Poettering
2020-04-07 16:06                                         ` Miklos Szeredi
2020-04-02 15:51                 ` David Howells
2020-04-02 15:56                 ` David Howells
2020-04-03  1:44             ` Ian Kent
2020-04-03 11:11               ` Lennart Poettering
2020-04-03 11:38                 ` Miklos Szeredi
2020-04-03 12:05                   ` Richard Weinberger
2020-04-03 15:12                   ` Lennart Poettering
2020-04-03 20:30                     ` J. Bruce Fields
2020-04-06  8:35                       ` Miklos Szeredi
2020-04-06 16:07                         ` J. Bruce Fields
2020-04-06  9:17                       ` Karel Zak
2020-04-06 16:34                         ` Linus Torvalds
2020-04-06 18:46                           ` J. Bruce Fields
2020-04-06 18:48                           ` Lennart Poettering
2020-04-08  3:36                             ` Linus Torvalds
2020-04-03 15:36                   ` David Howells
2020-04-03 15:41                     ` Lennart Poettering
2020-06-02 15:51 [GIT PULL] General notification queue and key notifications David Howells
2020-06-02 15:54 ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).