linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Howells <dhowells@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: dhowells@redhat.com, viro@zeniv.linux.org.uk, raven@themaw.net,
	kzak@redhat.com, jlayton@redhat.com, mszeredi@redhat.com,
	nicolas.dichtel@6wind.com, christian@brauner.io,
	lennart@poettering.net, linux-api@vger.kernel.org,
	linux-fsdevel@vger.kernel.org,
	linux-security-module@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [GIT PULL] Filesystem Information
Date: Mon, 03 Aug 2020 16:49:55 +0100	[thread overview]
Message-ID: <1845353.1596469795@warthog.procyon.org.uk> (raw)
In-Reply-To: <1842689.1596468469@warthog.procyon.org.uk>


Hi Linus,

Here's a set of patches that adds a system call, fsinfo(), that allows
information about the VFS, mount topology, superblock and files to be
retrieved.

The patchset is based on top of the mount notifications patchset so that
the mount notification mechanism can be hooked to provide event counters
that can be retrieved with fsinfo(), thereby making it a lot faster to work
out which mounts have changed.

Note that there was a last minute change requested by Miklós: the event
counter bits got moved from the mount notification patchset to this one.
The counters got made atomic_long_t inside the kernel and __u64 in the
UAPI.  The aggregate changes can be assessed by comparing pre-change tag,
fsinfo-core-20200724 to the requested pull tag.

Karel Zak has created preliminary patches that add support to libmount[*]
and Ian Kent has started working on making systemd use these and mount
notifications[**].

[*] https://github.com/karelzak/util-linux/commits/topic/fsinfo

[**] https://lists.freedesktop.org/archives/systemd-devel/2020-July/044914.html


=======
THE WHY
=======

Why do we want this?

Using /proc/mounts (or similar) has problems:

 (1) Reading from it holds a global lock (namespace_sem) that prevents
     mounting and unmounting.  Lots of data is encoded and mangled into
     text whilst the lock is held, including superblock option strings and
     mount point paths.  This causes performance problems when there are a
     lot of mount objects in a system.

 (2) Even though namespace_sem is held during a read, reading the whole
     file isn't necessarily atomic with respect to mount-type operations.
     If a read isn't satisfied in one go, then it may return to userspace
     briefly and then continue reading some way into the file.  But changes
     can occur in the interval that may then go unseen.

 (3) Determining what has changed means parsing and comparing consecutive
     outputs of /proc/mounts.

 (4) Querying a specific mount or superblock means searching through
     /proc/mounts and searching by path or mount ID - but we might have an
     fd we want to query.

 (5) Whilst you can poll() it for events, it only tells you that something
     changed in the namespace, not what or whether you can even see the
     change.

To fix the notification issues, the preceding notifications patchset added
mount watch notifications whereby you can watch for notifications in a
specific mount subtree.  The notification messages include the ID(s) of the
affected mounts.

To support notifications, however, we need to be able to handle overruns in
the notification queue.  I added a number of event counters to struct
super_block and struct mount to allow you to pin down the changes, but
there needs to be a way to retrieve them.  Exposing them through /proc
would require adding yet another /proc/mounts-type file.  We could add
per-mount directories full of attributes in sysfs, but that has issues also
(see below).

Adding an extensible system call interface for retrieving filesystem
information also allows other things to be exposed:

 (1) Jeff Layton's error handling changes need a way to allow error event
     information to be retrieved.

 (2) Bits in masks returned by things like statx() and FS_IOC_GETFLAGS are
     actually 3-state { Set, Unset, Not supported }.  It could be useful to
     provide a way to expose information like this[*].

 (3) Limits of the numerical metadata values in a filesystem[*].

 (4) Filesystem capability information[*].  Filesystems don't all have the
     same capabilities, and even different instances may have different
     capabilities, particularly with network filesystems where the set of
     may be server-dependent.  Capabilities might even vary at file
     granularity - though possibly such information should be conveyed
     through statx() instead.

 (5) ID mapping/shifting tables in use for a superblock.

 (6) Filesystem-specific information.  I need something for AFS so that I
     can do pioctl()-emulation, thereby allowing me to implement certain of
     the AFS command line utilities that query state of a particular file.
     This could also have application for other filesystems, such as NFS,
     CIFS and ext4.

 [*] In a lot of cases these are probably invariant and can be memcpy'd
     from static data.

There's a further consideration: I want to make it possible to have
fsconfig(fd, FSCONFIG_CMD_CREATE) be intercepted by a container manager
such that the manager can supervise a mount attempted inside the container.
The manager would be given an fd pointing to the fs_context struct and
would then need some way to query it (fsinfo()) and modify it (fsconfig()).
This could also be used to arbitrate user-requested mounts when containers
are not in play.


================
DESIGN DECISIONS
================

 (1) Information is partitioned into sets of attributes.

 (2) Attribute IDs are integers as they're fast to compare.

 (3) Attribute values are typed (struct, list of structs, string, opaque
     blob).  They type is fixed for a particular attribute.

 (4) For structure types, the length is also a version.  New fields can be
     tacked onto the end.

 (5) When copying a versioned struct to userspace, the core handles a
     version mismatch by truncating or zero-padding the data as necessary.
     This is transparent to the filesystem.

 (6) The core handles all the buffering and buffer resizing.

 (7) The filesystem never gets any access to the userspace parameter buffer
     or result buffer.

 (8) "Meta" attributes can describe other attributes.


========
OVERVIEW
========

fsinfo() is a system call that allows information about the filesystem at a
particular path point to be queried as a set of attributes.

Attribute values are of four basic types:

 (1) Structure with version-dependent length (the length is the version).

 (2) Variable-length string.

 (3) List of structures (all the same length).

 (4) Opaque blob.

Attributes can have multiple values either as a sequence of values or a
sequence-of-sequences of values and all the values of a particular
attribute must be of the same type.  Values can be up to INT_MAX size,
subject to memory availability.

Note that the values of an attribute *are* allowed to vary between dentries
within a single superblock, depending on the specific dentry that you're
looking at, but the values still have to be of the type for that attribute.

I've tried to make the interface as light as possible, so integer attribute
IDs rather than string and the core does all the buffer allocation and
expansion and all the extensibility support work rather than leaving that
to the filesystems.  This also means that userspace pointers are not
exposed to the filesystem.


fsinfo() allows a variety of information to be retrieved about a filesystem
and the mount topology:

 (1) General superblock attributes:

     - Filesystem identifiers (UUID, volume label, device numbers, ...)
     - The limits on a filesystem's capabilities
     - Information on supported statx fields and attributes and IOC flags.
     - A variety single-bit flags indicating supported capabilities.
     - Timestamp resolution and range.
     - The amount of space/free space in a filesystem (as statfs()).
     - Superblock notification counter.

 (2) Filesystem-specific superblock attributes:

     - Superblock-level timestamps.
     - Cell name, workgroup or other netfs grouping concept.
     - Server names and addresses.

 (3) VFS information:

     - Mount topology information.
     - Mount attributes.
     - Mount notification counter.
     - Mount point path.

 (4) Information about what the fsinfo() syscall itself supports, including
     the type and struct size of attributes.

The system is extensible:

 (1) New attributes can be added.  There is no requirement that a
     filesystem implement every attribute.  A helper function is provided
     to scan a list of attributes and a filesystem can have multiple such
     lists.

 (2) Version length-dependent structure attributes can be made larger and
     have additional information tacked on the end, provided it keeps the
     layout of the existing fields.  If an older process asks for a shorter
     structure, it will only be given the bits it asks for.  If a newer
     process asks for a longer structure on an older kernel, the extra
     space will be set to 0.  In all cases, the size of the data actually
     available is returned.

     In essence, the size of a structure is that structure's version: a
     smaller size is an earlier version and a later version includes
     everything that the earlier version did.

 (3) New single-bit capability flags can be added.  This is a structure-typed
     attribute and, as such, (2) applies.  Any bits you wanted but the kernel
     doesn't support are automatically set to 0.

fsinfo() may be called like the following, for example:

	struct fsinfo_params params = {
		.at_flags	= AT_SYMLINK_NOFOLLOW,
		.flags		= FSINFO_FLAGS_QUERY_PATH,
		.request	= FSINFO_ATTR_AFS_SERVER_ADDRESSES,
		.Nth		= 2,
	};
	struct fsinfo_server_address address;
	len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc",
		     &params, sizeof(params),
		     &address, sizeof(address));

The above example would query an AFS filesystem to retrieve the address
list for the 3rd server, and:

	struct fsinfo_params params = {
		.at_flags	= AT_SYMLINK_NOFOLLOW,
		.flags		= FSINFO_FLAGS_QUERY_PATH,
		.request	= FSINFO_ATTR_NFS_SERVER_NAME;
	};
	char server_name[256];
	len = fsinfo(AT_FDCWD, "/home/dhowells/",
		     &params, sizeof(params),
		     &server_name, sizeof(server_name));

would retrieve the name of the NFS server as a string.

In future, I want to make fsinfo() capable of querying a context created by
fsopen() or fspick(), e.g.:

	fd = fsopen("ext4", 0);
	struct fsinfo_params params = {
		.flags		= FSINFO_FLAGS_QUERY_FSCONTEXT,
		.request	= FSINFO_ATTR_CONFIGURATION;
	};
	char buffer[65536];
	fsinfo(fd, NULL, &params, sizeof(params), &buffer, sizeof(buffer));

even if that context doesn't currently have a superblock attached.


David
---
The following changes since commit 841a0dfa511364fa9a8d67512e0643669f1f03e3:

  watch_queue: sample: Display mount tree change notifications (2020-08-03 12:15:38 +0100)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/fsinfo-core-20200803

for you to fetch changes up to 19d687f5902b65a158c5bec904d65d0525ea4efe:

  samples: add error state information to test-fsinfo.c (2020-08-03 13:43:46 +0100)

----------------------------------------------------------------
Filesystem information

----------------------------------------------------------------
David Howells (15):
      fsinfo: Introduce a non-repeating system-unique superblock ID
      fsinfo: Add fsinfo() syscall to query filesystem information
      fsinfo: Provide a bitmap of the features a filesystem supports
      fsinfo: Allow retrieval of superblock devname, options and stats
      fsinfo: Allow fsinfo() to look up a mount object by ID
      fsinfo: Add a uniquifier ID to struct mount
      fsinfo: Allow mount information to be queried
      fsinfo: Allow mount topology and propagation info to be retrieved
      watch_queue: Mount event counters
      fsinfo: Provide notification overrun handling support
      fsinfo: sample: Mount listing program
      fsinfo: Add API documentation
      fsinfo: Add support for AFS
      fsinfo: Add support to ext4
      fsinfo: Add an attribute that lists all the visible mounts in a namespace

Jeff Layton (3):
      errseq: add a new errseq_scrape function
      vfs: allow fsinfo to fetch the current state of s_wb_err
      samples: add error state information to test-fsinfo.c

 Documentation/filesystems/fsinfo.rst        | 574 ++++++++++++++++++
 arch/alpha/kernel/syscalls/syscall.tbl      |   1 +
 arch/arm/tools/syscall.tbl                  |   1 +
 arch/arm64/include/asm/unistd.h             |   2 +-
 arch/arm64/include/asm/unistd32.h           |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |   1 +
 arch/s390/kernel/syscalls/syscall.tbl       |   1 +
 arch/sh/kernel/syscalls/syscall.tbl         |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |   1 +
 fs/Kconfig                                  |   7 +
 fs/Makefile                                 |   1 +
 fs/afs/internal.h                           |   1 +
 fs/afs/super.c                              | 216 ++++++-
 fs/d_path.c                                 |   2 +-
 fs/ext4/Makefile                            |   1 +
 fs/ext4/ext4.h                              |   6 +
 fs/ext4/fsinfo.c                            |  97 +++
 fs/ext4/super.c                             |   3 +
 fs/fsinfo.c                                 | 748 +++++++++++++++++++++++
 fs/internal.h                               |  16 +
 fs/mount.h                                  |   6 +
 fs/mount_notify.c                           |  10 +-
 fs/namespace.c                              | 427 +++++++++++++-
 fs/super.c                                  |   2 +
 include/linux/errseq.h                      |   1 +
 include/linux/fs.h                          |   7 +
 include/linux/fsinfo.h                      | 112 ++++
 include/linux/syscalls.h                    |   4 +
 include/uapi/asm-generic/unistd.h           |   4 +-
 include/uapi/linux/fsinfo.h                 | 344 +++++++++++
 include/uapi/linux/mount.h                  |  13 +-
 kernel/sys_ni.c                             |   1 +
 lib/errseq.c                                |  33 +-
 samples/vfs/Makefile                        |   6 +-
 samples/vfs/test-fsinfo.c                   | 883 ++++++++++++++++++++++++++++
 samples/vfs/test-mntinfo.c                  | 277 +++++++++
 46 files changed, 3808 insertions(+), 14 deletions(-)
 create mode 100644 Documentation/filesystems/fsinfo.rst
 create mode 100644 fs/ext4/fsinfo.c
 create mode 100644 fs/fsinfo.c
 create mode 100644 include/linux/fsinfo.h
 create mode 100644 include/uapi/linux/fsinfo.h
 create mode 100644 samples/vfs/test-fsinfo.c
 create mode 100644 samples/vfs/test-mntinfo.c


  reply	other threads:[~2020-08-03 15:50 UTC|newest]

Thread overview: 89+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-03 15:27 [GIT PULL] Mount notifications David Howells
2020-08-03 15:49 ` David Howells [this message]
2020-08-03 16:42   ` [GIT PULL] Filesystem Information Miklos Szeredi
2020-08-04  2:15     ` Ian Kent
2020-08-04 14:36       ` Miklos Szeredi
2020-08-05  1:33         ` Ian Kent
2020-08-05  8:00           ` Miklos Szeredi
2020-08-05 11:13             ` Ian Kent
2020-08-05  8:24         ` file metadata via fs API (was: [GIT PULL] Filesystem Information) Miklos Szeredi
2020-08-11 13:54           ` Miklos Szeredi
2020-08-11 14:08             ` Al Viro
2020-08-11 14:22               ` Miklos Szeredi
2020-08-11 14:31                 ` Al Viro
     [not found]                   ` <CAAgocE07=vVKpQhG+rjEGO=NEBKZ02gjg4TRPxECAc+RKrzn=Q@mail.gmail.com>
2020-08-11 14:36                     ` Al Viro
2020-08-11 14:36                   ` Miklos Szeredi
2020-08-11 14:42                     ` Al Viro
2020-08-11 14:47                       ` Miklos Szeredi
2020-08-11 15:20             ` Linus Torvalds
2020-08-11 15:30               ` Miklos Szeredi
2020-08-11 16:05                 ` Linus Torvalds
2020-08-11 18:49                   ` Miklos Szeredi
2020-08-11 19:31                     ` Lennart Poettering
2020-08-11 19:50                       ` Christian Brauner
2020-08-11 19:39                   ` Christian Brauner
2020-08-12  0:53                     ` Ian Kent
2020-08-11 15:39               ` Andy Lutomirski
2020-08-11 16:17                 ` Casey Schaufler
2020-08-11 16:30                   ` Linus Torvalds
2020-08-11 20:28                   ` Miklos Szeredi
2020-08-11 20:36                     ` Jann Horn
2020-08-11 20:56                       ` Miklos Szeredi
2020-08-11 21:17                         ` Andy Lutomirski
2020-08-11 21:18                         ` Linus Torvalds
2020-08-12  7:23                           ` Miklos Szeredi
2020-08-12 14:39                             ` Al Viro
2020-08-12 14:46                               ` Miklos Szeredi
2020-08-12 15:08                                 ` Al Viro
2020-08-12 15:13                                   ` Miklos Szeredi
2020-08-12 16:33                                     ` Al Viro
2020-08-12 17:16                                       ` Miklos Szeredi
2020-08-12 17:39                                         ` Al Viro
2020-08-12 18:33                                           ` Al Viro
2020-08-12 21:30                                             ` Al Viro
2020-08-18  9:41                                               ` Miklos Szeredi
2020-08-18  9:30                                             ` Miklos Szeredi
2020-08-12 15:22                                   ` David Howells
2020-08-11 21:20                     ` Al Viro
2020-08-11 21:35                     ` Casey Schaufler
2020-08-11 16:05               ` Al Viro
2020-08-11 16:09                 ` Linus Torvalds
2020-08-11 16:39                   ` Al Viro
2020-08-12 10:14               ` Karel Zak
2020-08-12 13:09                 ` Miklos Szeredi
2020-08-12 13:33                 ` David Howells
2020-08-12 13:54                   ` Miklos Szeredi
2020-08-12  0:05             ` David Howells
2020-08-12  7:55               ` Miklos Szeredi
2020-08-12  8:29               ` David Howells
2020-08-12  8:37                 ` Miklos Szeredi
2020-08-12  9:43                   ` file metadata via fs API Steven Whitehouse
2020-08-12 10:04                     ` Miklos Szeredi
2020-08-12 11:28                       ` Karel Zak
2020-08-12 12:43                         ` Miklos Szeredi
2020-08-13  8:52                           ` Karel Zak
2020-08-12 13:06                         ` David Howells
2020-08-13  1:01                           ` Ian Kent
2020-08-12 18:18               ` file metadata via fs API (was: [GIT PULL] Filesystem Information) Linus Torvalds
2020-08-12 19:34                 ` file metadata via fs API Steven Whitehouse
2020-08-12 19:50                   ` Linus Torvalds
2020-08-13  3:44                     ` Ian Kent
2020-08-13 10:36                     ` Karel Zak
2020-08-14  7:58                     ` Lennart Poettering
2020-08-17 11:32                     ` Steven Whitehouse
2020-08-17 17:15                       ` Linus Torvalds
2020-08-17 22:44                         ` Linus Torvalds
2020-08-18 12:50                           ` Miklos Szeredi
2020-08-18 18:51                             ` Linus Torvalds
2020-08-18 20:18                               ` Miklos Szeredi
2020-08-18 20:53                                 ` Linus Torvalds
2020-08-21 13:17                                   ` Miklos Szeredi
2020-08-19  2:29                               ` Al Viro
2020-08-13  3:53                 ` file metadata via fs API (was: [GIT PULL] Filesystem Information) Jeffrey E Altman
2020-08-14 17:05                   ` Linus Torvalds
2020-08-18 15:01                     ` Jeffrey E Altman
2020-08-14  8:06                 ` Lennart Poettering
2020-08-12 13:54             ` David Howells
2020-08-12 14:10               ` Miklos Szeredi
2020-08-12 14:23               ` David Howells
2020-08-03 22:48 ` [GIT PULL] Mount notifications Ian Kent

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1845353.1596469795@warthog.procyon.org.uk \
    --to=dhowells@redhat.com \
    --cc=christian@brauner.io \
    --cc=jlayton@redhat.com \
    --cc=kzak@redhat.com \
    --cc=lennart@poettering.net \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-security-module@vger.kernel.org \
    --cc=mszeredi@redhat.com \
    --cc=nicolas.dichtel@6wind.com \
    --cc=raven@themaw.net \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --subject='Re: [GIT PULL] Filesystem Information' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).