Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
@ 2020-02-21 18:01 David Howells
  2020-02-21 18:01 ` [PATCH 01/17] watch_queue: Add security hooks to rule on setting mount and sb watches " David Howells
                   ` (17 more replies)
  0 siblings, 18 replies; 117+ messages in thread
From: David Howells @ 2020-02-21 18:01 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong,
	linux-api, linux-fsdevel, linux-kernel


Here are a set of patches that adds system calls, that (a) allow
information about the VFS, mount topology, superblock and files to be
retrieved and (b) allow for notifications of mount topology rearrangement
events, mount and superblock attribute changes and other superblock events,
such as errors.


========================
FILESYSTEM NOTIFICATIONS
========================

The watch_mount() system call places a watch on a point in the mount
topology specified by the dirfd, path and at_flags parameters.  All mount
topology change and mount attribute change notifications in the subtree
rooted at that point can be intercepted by the watch.  Watches are ducted
through pipes:

	int fd[2];
	pipe2(fd, O_NOTIFICATION_PIPE);
	ioctl(fd[0], IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
	watch_mount(AT_FDCWD, "/", 0, fd[0], 0x02);

Events include:

 - New mount made
 - Mount unmounted
 - Mount expired
 - R/O state changed
 - Other attribute changed
 - Mount moved from
 - Mount moved to

Using filtering, this may be limited in various ways (single mount watch vs
subtree watch, recursive vs non-recursive changes, to-R/O vs to-R/W, mount
vs submount).

Each mount now has a change counter.  Whenever a mount is changed, this
gets incremented.  It can be queried by fsinfo() using either
FSINFO_ATTR_MOUNT_INFO or FSINFO_ATTR_MOUNT_CHILDREN.  The ID of the mount
on which the notification is generated is placed into the notification
message (triggered_on).  If the event involves a second mount as well, such
as creation of a new mount, that gets returned too (changed_mount).


The watch_sb() system call places a watch on the superblock specified by
the dirfd, path and at_flags parameters.  This allows various superblock
events to be monitored for, such as:

 - Transition between R/W and R/O
 - Filesystem errors
 - Quota overrun
 - Network status changes

Each superblock now gets a 64-bit unique superblock identifier and a
notification counter.  The counter is incremented each time one of these
notifications would be generated.  This attributes can be queried using
fsinfo() with FSINFO_ATTR_SB_NOTIFICATIONS.  The identifier is placed into
notification messages.


============================
FILESYSTEM INFORMATION QUERY
============================

The fsinfo() system call allows information about the filesystem at a
particular path point to be queried as a set of attributes, some of which
may have more than one value.

Attribute values are of four basic types:

 (1) Version dependent-length structure (size defined by type).

 (2) Variable-length string (up to 4096, including NUL).

 (3) List of structures (up to INT_MAX size).

 (4) Opaque blob (up to INT_MAX size).

Attributes can have multiple values either as a sequence of values or a
sequence-of-sequences of values and all the values of a particular
attribute must be of the same type.

Note that the values of an attribute *are* allowed to vary between dentries
within a single superblock, depending on the specific dentry that you're
looking at, but all the values of an attribute have to be of the same type.

I've tried to make the interface as light as possible, so integer/enum
attribute selector rather than string and the core does all the allocation
and extensibility support work rather than leaving that to the filesystems.
That means that for the first two attribute types, the filesystem will
always see a sufficiently-sized buffer allocated.  Further, this removes
the possibility of the filesystem gaining access to the userspace buffer.


fsinfo() allows a variety of information to be retrieved about a filesystem
and the mount topology:

 (1) General superblock attributes:

     - Filesystem identifiers (UUID, volume label, device numbers, ...)
     - The limits on a filesystem's capabilities
     - Information on supported statx fields and attributes and IOC flags.
     - A variety single-bit flags indicating supported capabilities.
     - Timestamp resolution and range.
     - The amount of space/free space in a filesystem (as statfs()).
     - Superblock notification counter.

 (2) Filesystem-specific superblock attributes:

     - Superblock-level timestamps.
     - Cell name.
     - Server names and addresses.
     - Filesystem-specific information.

 (3) VFS information:

     - Mount topology information.
     - Mount attributes.
     - Mount notification counter.

 (4) Information about what the fsinfo() syscall itself supports, including
     the type and struct/element size of attributes.

The system is extensible:

 (1) New attributes can be added.  There is no requirement that a
     filesystem implement every attribute.  Note that the core VFS keeps a
     table of types and sizes so it can handle future extensibility rather
     than delegating this to the filesystems.

 (2) Version length-dependent structure attributes can be made larger and
     have additional information tacked on the end, provided it keeps the
     layout of the existing fields.  If an older process asks for a shorter
     structure, it will only be given the bits it asks for.  If a newer
     process asks for a longer structure on an older kernel, the extra
     space will be set to 0.  In all cases, the size of the data actually
     available is returned.

     In essence, the size of a structure is that structure's version: a
     smaller size is an earlier version and a later version includes
     everything that the earlier version did.

 (3) New single-bit capability flags can be added.  This is a structure-typed
     attribute and, as such, (2) applies.  Any bits you wanted but the kernel
     doesn't support are automatically set to 0.

fsinfo() may be called like the following, for example:

	struct fsinfo_params params = {
		.at_flags	= AT_SYMLINK_NOFOLLOW,
		.flags		= FSINFO_FLAGS_QUERY_PATH,
		.request	= FSINFO_ATTR_AFS_SERVER_ADDRESSES,
		.Nth		= 2,
	};
	struct fsinfo_server_address address;
	len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", &params,
		     &address, sizeof(address));

The above example would query an AFS filesystem to retrieve the address
list for the 3rd server, and:

	struct fsinfo_params params = {
		.at_flags	= AT_SYMLINK_NOFOLLOW,
		.flags		= FSINFO_FLAGS_QUERY_PATH,
		.request	= FSINFO_ATTR_AFS_CELL_NAME;
	};
	char cell_name[256];
	len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", &params,
		     &cell_name, sizeof(cell_name));

would retrieve the name of an AFS cell as a string.

In future, I want to make fsinfo() capable of querying a context created by
fsopen() or fspick(), e.g.:

	fd = fsopen("ext4", 0);
	struct fsinfo_params params = {
		.flags		= FSINFO_FLAGS_QUERY_FSCONTEXT,
		.request	= FSINFO_ATTR_PARAMETERS;
	};
	char buffer[65536];
	fsinfo(fd, NULL, &params, &buffer, sizeof(buffer));

even if that context doesn't currently have a superblock attached.  I would
prefer this to contain length-prefixed strings so that there's no need to
insert escaping, especially as any character, including '\', can be used as
the separator in cifs and so that binary parameters can be returned (though
that is a lesser issue).


Two sample programs are provided, one to query filesystem attributes and
the other to display a mount subtree.  Both of them can be given a path or
a mount ID to start at.  Further, the watch_test sample program now watches
for mount events under "/" and for superblock events on whatever superblock
is backing "/mnt" when it the program is started.

The patches can be found here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git

on branch:

	fsinfo-core


===================
SIGNIFICANT CHANGES
===================

 ver #17:

 (*) Applied comments from Jann Horn, Darrick Wong and Christian Brauner.

 (*) Rearranged the order in which fsinfo() does things so that the
     superblock operations table can have a function pointer rather than a
     table pointer.  The ->fsinfo() op is now called at least twice, once
     to determine the size of buffer needed and then to retrieve the data.
     If the retrieval step indicates yet more space is needed, the buffer
     will be expanded and that step repeated.

 (*) Merge the element size into the size in the fsinfo_attribute def and
     don't set size for strings or opaques.  Let a helper work that out.
     This means that strings can actually get larger then 4K.

 (*) A helper is provided to scan a list of attributes and call the
     appropriate get function.  This can be called from a filesystem's
     ->fsinfo() method multiple times.  It also handles attribute
     enumeration and info querying.

 (*) Rearranged the patches to put all the notification patches first.
     This allowed some of the bits to be squashed together.  At some point,
     I'll move the notification patches into a different branch.

 ver #16:

 (*) Split the features bits out of the fsinfo() core into their own patch
     and got rid of the name encoding attributes.

 (*) Renamed the 'array' type to 'list' and made AFS use it for returning
     server address lists.

 (*) Changed the ->fsinfo() method into an ->fsinfo_attributes[] table,
     where each attribute has a ->get() method to deal with it.  These
     tables can then be returned with an fsinfo meta attribute.

 (*) Dropped the fscontext query and parameter/description retrieval
     attributes for now.

 (*) Picked the mount topology attributes into this branch.

 (*) Picked the mount notifications into this branch and rebased on top of
     notifications-pipe-core.

 (*) Picked the superblock notifications into this branch.

 (*) Add sample code for Ext4 and NFS.

David
---
David Howells (17):
      watch_queue: Add security hooks to rule on setting mount and sb watches
      watch_queue: Implement mount topology and attribute change notifications
      watch_queue: sample: Display mount tree change notifications
      watch_queue: Introduce a non-repeating system-unique superblock ID
      watch_queue: Add superblock notifications
      watch_queue: sample: Display superblock notifications
      fsinfo: Add fsinfo() syscall to query filesystem information
      fsinfo: Provide a bitmap of supported features
      fsinfo: Allow fsinfo() to look up a mount object by ID
      fsinfo: Allow mount information to be queried
      fsinfo: sample: Mount listing program
      fsinfo: Allow the mount topology propogation flags to be retrieved
      fsinfo: Query superblock unique ID and notification counter
      fsinfo: Add API documentation
      fsinfo: Add support for AFS
      fsinfo: Add example support for Ext4
      fsinfo: Add example support for NFS


 Documentation/filesystems/fsinfo.rst        |  491 ++++++++++++++++
 arch/alpha/kernel/syscalls/syscall.tbl      |    3 
 arch/arm/tools/syscall.tbl                  |    3 
 arch/arm64/include/asm/unistd.h             |    2 
 arch/ia64/kernel/syscalls/syscall.tbl       |    3 
 arch/m68k/kernel/syscalls/syscall.tbl       |    3 
 arch/microblaze/kernel/syscalls/syscall.tbl |    3 
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    3 
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    3 
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    3 
 arch/parisc/kernel/syscalls/syscall.tbl     |    3 
 arch/powerpc/kernel/syscalls/syscall.tbl    |    3 
 arch/s390/kernel/syscalls/syscall.tbl       |    3 
 arch/sh/kernel/syscalls/syscall.tbl         |    3 
 arch/sparc/kernel/syscalls/syscall.tbl      |    3 
 arch/x86/entry/syscalls/syscall_32.tbl      |    3 
 arch/x86/entry/syscalls/syscall_64.tbl      |    3 
 arch/xtensa/kernel/syscalls/syscall.tbl     |    3 
 fs/Kconfig                                  |   28 +
 fs/Makefile                                 |    2 
 fs/afs/internal.h                           |    1 
 fs/afs/super.c                              |  218 +++++++
 fs/d_path.c                                 |    2 
 fs/ext4/Makefile                            |    1 
 fs/ext4/ext4.h                              |    6 
 fs/ext4/fsinfo.c                            |   45 +
 fs/ext4/super.c                             |    3 
 fs/fsinfo.c                                 |  665 +++++++++++++++++++++
 fs/internal.h                               |   12 
 fs/mount.h                                  |   30 +
 fs/mount_notify.c                           |  185 ++++++
 fs/namespace.c                              |  323 ++++++++++
 fs/nfs/Makefile                             |    1 
 fs/nfs/fsinfo.c                             |  230 +++++++
 fs/nfs/internal.h                           |    6 
 fs/nfs/nfs4super.c                          |    3 
 fs/nfs/super.c                              |    3 
 fs/super.c                                  |  156 +++++
 include/linux/dcache.h                      |    1 
 include/linux/fs.h                          |   87 +++
 include/linux/fsinfo.h                      |  110 ++++
 include/linux/lsm_hooks.h                   |   24 +
 include/linux/security.h                    |   16 +
 include/linux/syscalls.h                    |    8 
 include/uapi/asm-generic/unistd.h           |    8 
 include/uapi/linux/fsinfo.h                 |  361 ++++++++++++
 include/uapi/linux/mount.h                  |   10 
 include/uapi/linux/watch_queue.h            |   61 ++
 include/uapi/linux/windows.h                |   35 +
 kernel/sys_ni.c                             |    7 
 samples/vfs/Makefile                        |    7 
 samples/vfs/test-fsinfo.c                   |  847 +++++++++++++++++++++++++++
 samples/vfs/test-mntinfo.c                  |  243 ++++++++
 samples/watch_queue/watch_test.c            |   76 ++
 security/security.c                         |   14 
 55 files changed, 4365 insertions(+), 11 deletions(-)
 create mode 100644 Documentation/filesystems/fsinfo.rst
 create mode 100644 fs/ext4/fsinfo.c
 create mode 100644 fs/fsinfo.c
 create mode 100644 fs/mount_notify.c
 create mode 100644 fs/nfs/fsinfo.c
 create mode 100644 include/linux/fsinfo.h
 create mode 100644 include/uapi/linux/fsinfo.h
 create mode 100644 include/uapi/linux/windows.h
 create mode 100644 samples/vfs/test-fsinfo.c
 create mode 100644 samples/vfs/test-mntinfo.c



^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 01/17] watch_queue: Add security hooks to rule on setting mount and sb watches [ver #17]
  2020-02-21 18:01 [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] David Howells
@ 2020-02-21 18:01 ` " David Howells
  2020-02-21 18:02 ` [PATCH 02/17] watch_queue: Implement mount topology and attribute change notifications " David Howells
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 117+ messages in thread
From: David Howells @ 2020-02-21 18:01 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong,
	linux-api, linux-fsdevel, linux-kernel

Add security hooks that will allow an LSM to rule on whether or not a watch
may be set on a mount or on a superblock.  More than one hook is required
as the watches watch different types of object.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Casey Schaufler <casey@schaufler-ca.com>
cc: Stephen Smalley <sds@tycho.nsa.gov>
cc: linux-security-module@vger.kernel.org
---

 include/linux/lsm_hooks.h |   24 ++++++++++++++++++++++++
 include/linux/security.h  |   16 ++++++++++++++++
 security/security.c       |   14 ++++++++++++++
 3 files changed, 54 insertions(+)

diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 16530255dc11..c4451ac197ae 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1427,6 +1427,18 @@
  *	Check to see if a process is allowed to watch for event notifications
  *	from devices (as a global set).
  *
+ * @watch_mount:
+ *	Check to see if a process is allowed to watch for mount topology change
+ *	notifications on a mount subtree.
+ *	@watch: The watch object
+ *	@path: The root of the subtree to watch.
+ *
+ * @watch_sb:
+ *	Check to see if a process is allowed to watch for event notifications
+ *	from a superblock.
+ *	@watch: The watch object
+ *	@sb: The superblock to watch.
+ *
  * @post_notification:
  *	Check to see if a watch notification can be posted to a particular
  *	queue.
@@ -1722,6 +1734,12 @@ union security_list_options {
 #ifdef CONFIG_DEVICE_NOTIFICATIONS
 	int (*watch_devices)(void);
 #endif
+#ifdef CONFIG_MOUNT_NOTIFICATIONS
+	int (*watch_mount)(struct watch *watch, struct path *path);
+#endif
+#ifdef CONFIG_SB_NOTIFICATIONS
+	int (*watch_sb)(struct watch *watch, struct super_block *sb);
+#endif
 #ifdef CONFIG_WATCH_QUEUE
 	int (*post_notification)(const struct cred *w_cred,
 				 const struct cred *cred,
@@ -2020,6 +2038,12 @@ struct security_hook_heads {
 #ifdef CONFIG_DEVICE_NOTIFICATIONS
 	struct hlist_head watch_devices;
 #endif
+#ifdef CONFIG_MOUNT_NOTIFICATIONS
+	struct hlist_head watch_mount;
+#endif
+#ifdef CONFIG_SB_NOTIFICATIONS
+	struct hlist_head watch_sb;
+#endif
 #ifdef CONFIG_WATCH_QUEUE
 	struct hlist_head post_notification;
 #endif /* CONFIG_WATCH_QUEUE */
diff --git a/include/linux/security.h b/include/linux/security.h
index 910a1efa9a79..2ca2569bc12c 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -1306,6 +1306,22 @@ static inline int security_post_notification(const struct cred *w_cred,
 	return 0;
 }
 #endif
+#if defined(CONFIG_SECURITY) && defined(CONFIG_MOUNT_NOTIFICATIONS)
+int security_watch_mount(struct watch *watch, struct path *path);
+#else
+static inline int security_watch_mount(struct watch *watch, struct path *path)
+{
+	return 0;
+}
+#endif
+#if defined(CONFIG_SECURITY) && defined(CONFIG_SB_NOTIFICATIONS)
+int security_watch_sb(struct watch *watch, struct super_block *sb);
+#else
+static inline int security_watch_sb(struct watch *watch, struct super_block *sb)
+{
+	return 0;
+}
+#endif
 
 #ifdef CONFIG_SECURITY_NETWORK
 
diff --git a/security/security.c b/security/security.c
index db7b574c9c70..5c0463444a90 100644
--- a/security/security.c
+++ b/security/security.c
@@ -2004,6 +2004,20 @@ int security_watch_key(struct key *key)
 }
 #endif
 
+#ifdef CONFIG_MOUNT_NOTIFICATIONS
+int security_watch_mount(struct watch *watch, struct path *path)
+{
+	return call_int_hook(watch_mount, 0, watch, path);
+}
+#endif
+
+#ifdef CONFIG_SB_NOTIFICATIONS
+int security_watch_sb(struct watch *watch, struct super_block *sb)
+{
+	return call_int_hook(watch_sb, 0, watch, sb);
+}
+#endif
+
 #ifdef CONFIG_DEVICE_NOTIFICATIONS
 int security_watch_devices(void)
 {



^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 02/17] watch_queue: Implement mount topology and attribute change notifications [ver #17]
  2020-02-21 18:01 [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] David Howells
  2020-02-21 18:01 ` [PATCH 01/17] watch_queue: Add security hooks to rule on setting mount and sb watches " David Howells
@ 2020-02-21 18:02 ` " David Howells
  2020-02-21 18:02 ` [PATCH 03/17] watch_queue: sample: Display mount tree " David Howells
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 117+ messages in thread
From: David Howells @ 2020-02-21 18:02 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong,
	linux-api, linux-fsdevel, linux-kernel

Add a mount notification facility whereby notifications about changes in
mount topology and configuration can be received.  Note that this only
covers vfsmount topology changes and not superblock events.  A separate
facility will be added for that.

Every mount is given a change counter than counts the number of topological
rearrangements in which it is involved and the number of attribute changes
it undergoes.  This allows notification loss to be dealt with.  Later
patches will provide a way to quickly retrieve this value, along with
information about topology and parameters for the superblock.

Firstly, an event queue needs to be created:

	fd = open("/dev/event_queue", O_RDWR);
	ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, page_size << n);

then a notification can be set up to report notifications via that queue:

	struct watch_notification_filter filter = {
		.nr_filters = 1,
		.filters = {
			[0] = {
				.type = WATCH_TYPE_MOUNT_NOTIFY,
				.subtype_filter[0] = UINT_MAX,
			},
		},
	};
	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
	watch_mount(AT_FDCWD, "/", 0, fd, 0x02);

In this case, it would let me monitor the mount topology subtree rooted at
"/" for events.  Mount notifications propagate up the tree towards the
root, so a watch will catch all of the events happening in the subtree
rooted at the watch.

After setting the watch, records will be placed into the queue when, for
example, as superblock switches between read-write and read-only.  Records
are of the following format:

	struct mount_notification {
		struct watch_notification watch;
		__u32	triggered_on;
		__u32	changed_mount;
	} *n;

Where:

	n->watch.type will be WATCH_TYPE_MOUNT_NOTIFY.

	n->watch.subtype will indicate the type of event, such as
	NOTIFY_MOUNT_NEW_MOUNT.

	n->watch.info & WATCH_INFO_LENGTH will indicate the length of the
	record.

	n->watch.info & WATCH_INFO_ID will be the fifth argument to
	watch_mount(), shifted.

	n->watch.info & NOTIFY_MOUNT_IN_SUBTREE if true indicates that the
	notifcation was generated in the mount subtree rooted at the watch,
	and not actually in the watch itself.

	n->watch.info & NOTIFY_MOUNT_IS_RECURSIVE if true indicates that
	the notifcation was generated by an event (eg. SETATTR) that was
	applied recursively.  The notification is only generated for the
	object that initially triggered it.

	n->watch.info & NOTIFY_MOUNT_IS_NOW_RO will be used for
	NOTIFY_MOUNT_READONLY, being set if the superblock becomes R/O, and
	being cleared otherwise, and for NOTIFY_MOUNT_NEW_MOUNT, being set
	if the new mount is a submount (e.g. an automount).

	n->watch.info & NOTIFY_MOUNT_IS_SUBMOUNT if true indicates that the
	NOTIFY_MOUNT_NEW_MOUNT notification is in response to a mount
	performed by the kernel (e.g. an automount).

	n->triggered_on indicates the ID of the mount on which the watch
	was installed.

	n->changed_mount indicates the ID of the mount that was affected.

Note that it is permissible for event records to be of variable length -
or, at least, the length may be dependent on the subtype.  Note also that
the queue can be shared between multiple notifications of various types.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/alpha/kernel/syscalls/syscall.tbl      |    1 
 arch/arm/tools/syscall.tbl                  |    1 
 arch/arm64/include/asm/unistd.h             |    2 
 arch/ia64/kernel/syscalls/syscall.tbl       |    1 
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 
 arch/s390/kernel/syscalls/syscall.tbl       |    1 
 arch/sh/kernel/syscalls/syscall.tbl         |    1 
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 
 fs/Kconfig                                  |    9 +
 fs/Makefile                                 |    1 
 fs/mount.h                                  |   30 ++++
 fs/mount_notify.c                           |  185 +++++++++++++++++++++++++++
 fs/namespace.c                              |   22 +++
 include/linux/dcache.h                      |    1 
 include/linux/syscalls.h                    |    2 
 include/uapi/asm-generic/unistd.h           |    4 -
 include/uapi/linux/watch_queue.h            |   32 +++++
 kernel/sys_ni.c                             |    3 
 27 files changed, 304 insertions(+), 3 deletions(-)
 create mode 100644 fs/mount_notify.c

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 36d42da7466a..b869428033ef 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -477,3 +477,4 @@
 # 545 reserved for clone3
 547	common	openat2				sys_openat2
 548	common	pidfd_getfd			sys_pidfd_getfd
+549	common	watch_mount			sys_watch_mount
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 4d1cf74a2caa..9c389da9efcc 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -451,3 +451,4 @@
 435	common	clone3				sys_clone3
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
+439	common	watch_mount			sys_watch_mount
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 1dd22da1c3a9..75f04a1023be 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		439
+#define __NR_compat_syscalls		440
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 042911e670b8..6817f865cc71 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -358,3 +358,4 @@
 # 435 reserved for clone3
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
+439	common	watch_mount			sys_watch_mount
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index f4f49fcb76d0..fbf85da75ecb 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -437,3 +437,4 @@
 435	common	clone3				__sys_clone3
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
+439	common	watch_mount			sys_watch_mount
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 4c67b11f9c9e..b05b192da1e2 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -443,3 +443,4 @@
 435	common	clone3				sys_clone3
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
+439	common	watch_mount			sys_watch_mount
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 1f9e8ad636cc..0f85d2a033f9 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -376,3 +376,4 @@
 435	n32	clone3				__sys_clone3
 437	n32	openat2				sys_openat2
 438	n32	pidfd_getfd			sys_pidfd_getfd
+439	n32	watch_mount			sys_watch_mount
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index c0b9d802dbf6..905cf9ac0792 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -352,3 +352,4 @@
 435	n64	clone3				__sys_clone3
 437	n64	openat2				sys_openat2
 438	n64	pidfd_getfd			sys_pidfd_getfd
+439	n64	watch_mount			sys_watch_mount
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index ac586774c980..834b26b08d74 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -425,3 +425,4 @@
 435	o32	clone3				__sys_clone3
 437	o32	openat2				sys_openat2
 438	o32	pidfd_getfd			sys_pidfd_getfd
+439	o32	watch_mount			sys_watch_mount
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 52a15f5cd130..badd3449db43 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -435,3 +435,4 @@
 435	common	clone3				sys_clone3_wrapper
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
+439	common	watch_mount			sys_watch_mount
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 35b61bfc1b1a..b404361bc929 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -519,3 +519,4 @@
 435	nospu	clone3				ppc_clone3
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
+439	common	watch_mount			sys_watch_mount
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index bd7bd3581a0f..33071de24511 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -440,3 +440,4 @@
 435  common	clone3			sys_clone3			sys_clone3
 437  common	openat2			sys_openat2			sys_openat2
 438  common	pidfd_getfd		sys_pidfd_getfd			sys_pidfd_getfd
+439	common	watch_mount		sys_watch_mount			sys_watch_mount
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index c7a30fcd135f..682c125122f4 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -440,3 +440,4 @@
 # 435 reserved for clone3
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
+439	common	watch_mount			sys_watch_mount
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index f13615ecdecc..febf3cd675e3 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -483,3 +483,4 @@
 # 435 reserved for clone3
 437	common	openat2			sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
+439	common	watch_mount			sys_watch_mount
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index c17cb77eb150..085bcc5afdf1 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -442,3 +442,4 @@
 435	i386	clone3			sys_clone3			__ia32_sys_clone3
 437	i386	openat2			sys_openat2			__ia32_sys_openat2
 438	i386	pidfd_getfd		sys_pidfd_getfd			__ia32_sys_pidfd_getfd
+439	i386	watch_mount		sys_watch_mount			__ia32_sys_watch_mount
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 44d510bc9b78..9cfb6b2eb319 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -359,6 +359,7 @@
 435	common	clone3			__x64_sys_clone3/ptregs
 437	common	openat2			__x64_sys_openat2
 438	common	pidfd_getfd		__x64_sys_pidfd_getfd
+439	common	watch_mount		__x64_sys_watch_mount
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 85a9ab1bc04d..1a066a43a58b 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -408,3 +408,4 @@
 435	common	clone3				sys_clone3
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
+439	common	watch_mount			sys_watch_mount
diff --git a/fs/Kconfig b/fs/Kconfig
index 708ba336e689..d7039137d538 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -117,6 +117,15 @@ source "fs/verity/Kconfig"
 
 source "fs/notify/Kconfig"
 
+config MOUNT_NOTIFICATIONS
+	bool "Mount topology change notifications"
+	select WATCH_QUEUE
+	help
+	  This option provides support for getting change notifications on the
+	  mount tree topology.  This makes use of the /dev/watch_queue misc
+	  device to handle the notification buffer and provides the
+	  mount_notify() system call to enable/disable watchpoints.
+
 source "fs/quota/Kconfig"
 
 source "fs/autofs/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index 505e51166973..4477757780d0 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -22,6 +22,7 @@ obj-y +=	no-block.o
 endif
 
 obj-$(CONFIG_PROC_FS) += proc_namespace.o
+obj-$(CONFIG_MOUNT_NOTIFICATIONS) += mount_notify.o
 
 obj-y				+= notify/
 obj-$(CONFIG_EPOLL)		+= eventpoll.o
diff --git a/fs/mount.h b/fs/mount.h
index 711a4093e475..3abc5fb49e3c 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -4,6 +4,7 @@
 #include <linux/poll.h>
 #include <linux/ns_common.h>
 #include <linux/fs_pin.h>
+#include <linux/watch_queue.h>
 
 struct mnt_namespace {
 	atomic_t		count;
@@ -72,6 +73,10 @@ struct mount {
 	int mnt_expiry_mark;		/* true if marked for expiry */
 	struct hlist_head mnt_pins;
 	struct hlist_head mnt_stuck_children;
+	atomic_t mnt_change_counter;	/* Number of changed applied */
+#ifdef CONFIG_MOUNT_NOTIFICATIONS
+	struct watch_list *mnt_watchers; /* Watches on dentries within this mount */
+#endif
 } __randomize_layout;
 
 #define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */
@@ -153,3 +158,28 @@ static inline bool is_anon_ns(struct mnt_namespace *ns)
 {
 	return ns->seq == 0;
 }
+
+extern void post_mount_notification(struct mount *changed,
+				    struct mount_notification *notify);
+
+static inline void notify_mount(struct mount *changed,
+				struct mount *aux,
+				enum mount_notification_subtype subtype,
+				u32 info_flags)
+{
+	atomic_inc(&changed->mnt_change_counter);
+
+#ifdef CONFIG_MOUNT_NOTIFICATIONS
+	{
+		struct mount_notification n = {
+			.watch.type	= WATCH_TYPE_MOUNT_NOTIFY,
+			.watch.subtype	= subtype,
+			.watch.info	= info_flags | watch_sizeof(n),
+			.triggered_on	= changed->mnt_id,
+			.changed_mount	= aux ? aux->mnt_id : 0,
+		};
+
+		post_mount_notification(changed, &n);
+	}
+#endif
+}
diff --git a/fs/mount_notify.c b/fs/mount_notify.c
new file mode 100644
index 000000000000..2e8ca75d3389
--- /dev/null
+++ b/fs/mount_notify.c
@@ -0,0 +1,185 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Provide mount topology/attribute change notifications.
+ *
+ * Copyright (C) 2019 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/syscalls.h>
+#include <linux/slab.h>
+#include <linux/security.h>
+#include "mount.h"
+
+/*
+ * Post mount notifications to all watches going rootwards along the tree.
+ *
+ * Must be called with the mount_lock held.
+ */
+void post_mount_notification(struct mount *changed,
+			     struct mount_notification *notify)
+{
+	const struct cred *cred = current_cred();
+	struct path cursor;
+	struct mount *mnt;
+	unsigned seq;
+
+	seq = 0;
+	rcu_read_lock();
+restart:
+	cursor.mnt = &changed->mnt;
+	cursor.dentry = changed->mnt.mnt_root;
+	mnt = real_mount(cursor.mnt);
+	notify->watch.info &= ~NOTIFY_MOUNT_IN_SUBTREE;
+
+	read_seqbegin_or_lock(&rename_lock, &seq);
+	for (;;) {
+		if (mnt->mnt_watchers &&
+		    !hlist_empty(&mnt->mnt_watchers->watchers)) {
+			if (cursor.dentry->d_flags & DCACHE_MOUNT_WATCH)
+				post_watch_notification(mnt->mnt_watchers,
+							&notify->watch, cred,
+							(unsigned long)cursor.dentry);
+		} else {
+			cursor.dentry = mnt->mnt.mnt_root;
+		}
+		notify->watch.info |= NOTIFY_MOUNT_IN_SUBTREE;
+
+		if (cursor.dentry == cursor.mnt->mnt_root ||
+		    IS_ROOT(cursor.dentry)) {
+			struct mount *parent = READ_ONCE(mnt->mnt_parent);
+
+			/* Escaped? */
+			if (cursor.dentry != cursor.mnt->mnt_root)
+				break;
+
+			/* Global root? */
+			if (mnt == parent)
+				break;
+
+			cursor.dentry = READ_ONCE(mnt->mnt_mountpoint);
+			mnt = parent;
+			cursor.mnt = &mnt->mnt;
+		} else {
+			cursor.dentry = cursor.dentry->d_parent;
+		}
+	}
+
+	if (need_seqretry(&rename_lock, seq)) {
+		seq = 1;
+		goto restart;
+	}
+
+	done_seqretry(&rename_lock, seq);
+	rcu_read_unlock();
+}
+
+static void release_mount_watch(struct watch *watch)
+{
+	struct dentry *dentry = (struct dentry *)(unsigned long)watch->id;
+
+	dput(dentry);
+}
+
+/**
+ * sys_watch_mount - Watch for mount topology/attribute changes
+ * @dfd: Base directory to pathwalk from or fd referring to mount.
+ * @filename: Path to mount to place the watch upon
+ * @at_flags: Pathwalk control flags
+ * @watch_fd: The watch queue to send notifications to.
+ * @watch_id: The watch ID to be placed in the notification (-1 to remove watch)
+ */
+SYSCALL_DEFINE5(watch_mount,
+		int, dfd,
+		const char __user *, filename,
+		unsigned int, at_flags,
+		int, watch_fd,
+		int, watch_id)
+{
+	struct watch_queue *wqueue;
+	struct watch_list *wlist = NULL;
+	struct watch *watch = NULL;
+	struct mount *m;
+	struct path path;
+	unsigned int lookup_flags =
+		LOOKUP_DIRECTORY | LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
+	int ret;
+
+	if (watch_id < -1 || watch_id > 0xff)
+		return -EINVAL;
+	if ((at_flags & ~(AT_NO_AUTOMOUNT | AT_EMPTY_PATH)) != 0)
+		return -EINVAL;
+	if (at_flags & AT_NO_AUTOMOUNT)
+		lookup_flags &= ~LOOKUP_AUTOMOUNT;
+	if (at_flags & AT_EMPTY_PATH)
+		lookup_flags |= LOOKUP_EMPTY;
+
+	ret = user_path_at(dfd, filename, lookup_flags, &path);
+	if (ret)
+		return ret;
+
+	ret = inode_permission(path.dentry->d_inode, MAY_EXEC);
+	if (ret)
+		goto err_path;
+
+	wqueue = get_watch_queue(watch_fd);
+	if (IS_ERR(wqueue))
+		goto err_path;
+
+	m = real_mount(path.mnt);
+
+	if (watch_id >= 0) {
+		ret = -ENOMEM;
+		if (!READ_ONCE(m->mnt_watchers)) {
+			wlist = kzalloc(sizeof(*wlist), GFP_KERNEL);
+			if (!wlist)
+				goto err_wqueue;
+			init_watch_list(wlist, release_mount_watch);
+		}
+
+		watch = kzalloc(sizeof(*watch), GFP_KERNEL);
+		if (!watch)
+			goto err_wlist;
+
+		init_watch(watch, wqueue);
+		watch->id		= (unsigned long)path.dentry;
+		watch->info_id		= (u32)watch_id << 24;
+
+		ret = security_watch_mount(watch, &path);
+		if (ret < 0)
+			goto err_watch;
+
+		down_write(&m->mnt.mnt_sb->s_umount);
+		if (!m->mnt_watchers) {
+			m->mnt_watchers = wlist;
+			wlist = NULL;
+		}
+
+		ret = add_watch_to_object(watch, m->mnt_watchers);
+		if (ret == 0) {
+			spin_lock(&path.dentry->d_lock);
+			path.dentry->d_flags |= DCACHE_MOUNT_WATCH;
+			spin_unlock(&path.dentry->d_lock);
+			dget(path.dentry);
+			watch = NULL;
+		}
+		up_write(&m->mnt.mnt_sb->s_umount);
+	} else {
+		down_write(&m->mnt.mnt_sb->s_umount);
+		ret = remove_watch_from_object(m->mnt_watchers, wqueue,
+					       (unsigned long)path.dentry,
+					       false);
+		up_write(&m->mnt.mnt_sb->s_umount);
+	}
+
+err_watch:
+	kfree(watch);
+err_wlist:
+	kfree(wlist);
+err_wqueue:
+	put_watch_queue(wqueue);
+err_path:
+	path_put(&path);
+	return ret;
+}
diff --git a/fs/namespace.c b/fs/namespace.c
index 85b5f7bea82e..668f797ae3bd 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -498,6 +498,9 @@ static int mnt_make_readonly(struct mount *mnt)
 	smp_wmb();
 	mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
 	unlock_mount_hash();
+	if (ret == 0)
+		notify_mount(mnt, NULL, NOTIFY_MOUNT_READONLY,
+			     NOTIFY_MOUNT_IS_NOW_RO);
 	return ret;
 }
 
@@ -506,6 +509,7 @@ static int __mnt_unmake_readonly(struct mount *mnt)
 	lock_mount_hash();
 	mnt->mnt.mnt_flags &= ~MNT_READONLY;
 	unlock_mount_hash();
+	notify_mount(mnt, NULL, NOTIFY_MOUNT_READONLY, 0);
 	return 0;
 }
 
@@ -819,6 +823,7 @@ static struct mountpoint *unhash_mnt(struct mount *mnt)
  */
 static void umount_mnt(struct mount *mnt)
 {
+	notify_mount(mnt->mnt_parent, mnt, NOTIFY_MOUNT_UNMOUNT, 0);
 	put_mountpoint(unhash_mnt(mnt));
 }
 
@@ -1159,6 +1164,11 @@ static void mntput_no_expire(struct mount *mnt)
 	mnt->mnt.mnt_flags |= MNT_DOOMED;
 	rcu_read_unlock();
 
+#ifdef CONFIG_MOUNT_NOTIFICATIONS
+	if (mnt->mnt_watchers)
+		remove_watch_list(mnt->mnt_watchers, mnt->mnt_id);
+#endif
+
 	list_del(&mnt->mnt_instance);
 
 	if (unlikely(!list_empty(&mnt->mnt_mounts))) {
@@ -1453,6 +1463,7 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 		p = list_first_entry(&tmp_list, struct mount, mnt_list);
 		list_del_init(&p->mnt_expire);
 		list_del_init(&p->mnt_list);
+
 		ns = p->mnt_ns;
 		if (ns) {
 			ns->mounts--;
@@ -2078,7 +2089,10 @@ static int attach_recursive_mnt(struct mount *source_mnt,
 		lock_mount_hash();
 	}
 	if (moving) {
+		notify_mount(source_mnt->mnt_parent, source_mnt,
+			     NOTIFY_MOUNT_MOVE_FROM, 0);
 		unhash_mnt(source_mnt);
+		notify_mount(dest_mnt, source_mnt, NOTIFY_MOUNT_MOVE_TO, 0);
 		attach_mnt(source_mnt, dest_mnt, dest_mp);
 		touch_mnt_namespace(source_mnt->mnt_ns);
 	} else {
@@ -2087,6 +2101,11 @@ static int attach_recursive_mnt(struct mount *source_mnt,
 			list_del_init(&source_mnt->mnt_ns->list);
 		}
 		mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
+		notify_mount(dest_mnt, source_mnt, NOTIFY_MOUNT_NEW_MOUNT,
+			     (source_mnt->mnt.mnt_sb->s_flags & SB_RDONLY ?
+			      NOTIFY_MOUNT_IS_NOW_RO : 0) |
+			     (source_mnt->mnt.mnt_sb->s_flags & SB_SUBMOUNT ?
+			      NOTIFY_MOUNT_IS_SUBMOUNT : 0));
 		commit_tree(source_mnt);
 	}
 
@@ -2464,6 +2483,8 @@ static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags)
 	mnt->mnt.mnt_flags = mnt_flags;
 	touch_mnt_namespace(mnt->mnt_ns);
 	unlock_mount_hash();
+	notify_mount(mnt, NULL, NOTIFY_MOUNT_SETATTR,
+		     (mnt_flags & SB_RDONLY ? NOTIFY_MOUNT_IS_NOW_RO : 0));
 }
 
 static void mnt_warn_timestamp_expiry(struct path *mountpoint, struct vfsmount *mnt)
@@ -2898,6 +2919,7 @@ void mark_mounts_for_expiry(struct list_head *mounts)
 		if (!xchg(&mnt->mnt_expiry_mark, 1) ||
 			propagate_mount_busy(mnt, 1))
 			continue;
+		notify_mount(mnt, NULL, NOTIFY_MOUNT_EXPIRY, 0);
 		list_move(&mnt->mnt_expire, &graveyard);
 	}
 	while (!list_empty(&graveyard)) {
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index c1488cc84fd9..7b194d778155 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -217,6 +217,7 @@ struct dentry_operations {
 #define DCACHE_PAR_LOOKUP		0x10000000 /* being looked up (with parent locked shared) */
 #define DCACHE_DENTRY_CURSOR		0x20000000
 #define DCACHE_NORCU			0x40000000 /* No RCU delay for freeing */
+#define DCACHE_MOUNT_WATCH		0x80000000 /* There's a mount watch here */
 
 extern seqlock_t rename_lock;
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 1815065d52f3..1fd43af3b22d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1003,6 +1003,8 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
 				       siginfo_t __user *info,
 				       unsigned int flags);
 asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
+asmlinkage long sys_watch_mount(int dfd, const char __user *path,
+				unsigned int at_flags, int watch_fd, int watch_id);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 3a3201e4618e..6b5748287883 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -855,9 +855,11 @@ __SYSCALL(__NR_clone3, sys_clone3)
 __SYSCALL(__NR_openat2, sys_openat2)
 #define __NR_pidfd_getfd 438
 __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
+#define __NR_watch_mount 439
+__SYSCALL(__NR_watch_mount, sys_watch_mount)
 
 #undef __NR_syscalls
-#define __NR_syscalls 439
+#define __NR_syscalls 440
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/watch_queue.h b/include/uapi/linux/watch_queue.h
index c3d8320b5d3a..b0f35cf51394 100644
--- a/include/uapi/linux/watch_queue.h
+++ b/include/uapi/linux/watch_queue.h
@@ -14,7 +14,8 @@
 enum watch_notification_type {
 	WATCH_TYPE_META		= 0,	/* Special record */
 	WATCH_TYPE_KEY_NOTIFY	= 1,	/* Key change event notification */
-	WATCH_TYPE__NR		= 2
+	WATCH_TYPE_MOUNT_NOTIFY	= 2,	/* Mount topology change notification */
+	WATCH_TYPE___NR		= 3
 };
 
 enum watch_meta_notification_subtype {
@@ -101,4 +102,33 @@ struct key_notification {
 	__u32	aux;		/* Per-type auxiliary data */
 };
 
+/*
+ * Type of mount topology change notification.
+ */
+enum mount_notification_subtype {
+	NOTIFY_MOUNT_NEW_MOUNT	= 0, /* New mount added */
+	NOTIFY_MOUNT_UNMOUNT	= 1, /* Mount removed manually */
+	NOTIFY_MOUNT_EXPIRY	= 2, /* Automount expired */
+	NOTIFY_MOUNT_READONLY	= 3, /* Mount R/O state changed */
+	NOTIFY_MOUNT_SETATTR	= 4, /* Mount attributes changed */
+	NOTIFY_MOUNT_MOVE_FROM	= 5, /* Mount moved from here */
+	NOTIFY_MOUNT_MOVE_TO	= 6, /* Mount moved to here (compare op_id) */
+};
+
+#define NOTIFY_MOUNT_IN_SUBTREE		WATCH_INFO_FLAG_0 /* Event not actually at watched dentry */
+#define NOTIFY_MOUNT_IS_RECURSIVE	WATCH_INFO_FLAG_1 /* Change applied recursively */
+#define NOTIFY_MOUNT_IS_NOW_RO		WATCH_INFO_FLAG_2 /* Mount changed to R/O */
+#define NOTIFY_MOUNT_IS_SUBMOUNT	WATCH_INFO_FLAG_3 /* New mount is submount */
+
+/*
+ * Mount topology/configuration change notification record.
+ * - watch.type = WATCH_TYPE_MOUNT_NOTIFY
+ * - watch.subtype = enum mount_notification_subtype
+ */
+struct mount_notification {
+	struct watch_notification watch; /* WATCH_TYPE_MOUNT_NOTIFY */
+	__u32	triggered_on;		/* The mount that the notify was on */
+	__u32	changed_mount;		/* The mount that got changed */
+};
+
 #endif /* _UAPI_LINUX_WATCH_QUEUE_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 3b69a560a7ac..3e1c5c9d2efe 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -85,6 +85,9 @@ COND_SYSCALL(ioprio_get);
 /* fs/locks.c */
 COND_SYSCALL(flock);
 
+/* fs/mount_notify.c */
+COND_SYSCALL(watch_mount);
+
 /* fs/namei.c */
 
 /* fs/namespace.c */



^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 03/17] watch_queue: sample: Display mount tree change notifications [ver #17]
  2020-02-21 18:01 [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] David Howells
  2020-02-21 18:01 ` [PATCH 01/17] watch_queue: Add security hooks to rule on setting mount and sb watches " David Howells
  2020-02-21 18:02 ` [PATCH 02/17] watch_queue: Implement mount topology and attribute change notifications " David Howells
@ 2020-02-21 18:02 ` " David Howells
  2020-02-21 18:02 ` [PATCH 04/17] watch_queue: Introduce a non-repeating system-unique superblock ID " David Howells
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 117+ messages in thread
From: David Howells @ 2020-02-21 18:02 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong,
	linux-api, linux-fsdevel, linux-kernel

This is run like:

	./watch_test

and watches "/" for changes to the mount topology and the attributes of
individual mount objects.

	# mount -t tmpfs none /mnt
	# mount -o remount,ro /mnt
	# mount -o remount,rw /mnt

producing:

	# ./watch_test
	read() = 16
	NOTIFY[000]: ty=000002 sy=00 i=02000010
	MOUNT 00000060 change=0[new_mount] aux=416
	read() = 16
	NOTIFY[000]: ty=000002 sy=04 i=02010010
	MOUNT 000001a0 change=4[setattr] aux=0
	read() = 16
	NOTIFY[000]: ty=000002 sy=04 i=02010010
	MOUNT 000001a0 change=4[setattr] aux=0

Signed-off-by: David Howells <dhowells@redhat.com>
---

 samples/watch_queue/watch_test.c |   39 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/samples/watch_queue/watch_test.c b/samples/watch_queue/watch_test.c
index 0eaff5dc04c3..49d185150506 100644
--- a/samples/watch_queue/watch_test.c
+++ b/samples/watch_queue/watch_test.c
@@ -26,6 +26,9 @@
 #ifndef __NR_watch_devices
 #define __NR_watch_devices -1
 #endif
+#ifndef __NR_watch_mount
+#define __NR_watch_mount -1
+#endif
 
 #define BUF_SIZE 256
 
@@ -58,6 +61,27 @@ static void saw_key_change(struct watch_notification *n, size_t len)
 	       k->key_id, n->subtype, key_subtypes[n->subtype], k->aux);
 }
 
+static const char *mount_subtypes[256] = {
+	[NOTIFY_MOUNT_NEW_MOUNT]	= "new_mount",
+	[NOTIFY_MOUNT_UNMOUNT]		= "unmount",
+	[NOTIFY_MOUNT_EXPIRY]		= "expiry",
+	[NOTIFY_MOUNT_READONLY]		= "readonly",
+	[NOTIFY_MOUNT_SETATTR]		= "setattr",
+	[NOTIFY_MOUNT_MOVE_FROM]	= "move_from",
+	[NOTIFY_MOUNT_MOVE_TO]		= "move_to",
+};
+
+static void saw_mount_change(struct watch_notification *n, size_t len)
+{
+	struct mount_notification *m = (struct mount_notification *)n;
+
+	if (len != sizeof(struct mount_notification))
+		return;
+
+	printf("MOUNT %08x change=%u[%s] aux=%u\n",
+	       m->triggered_on, n->subtype, mount_subtypes[n->subtype], m->changed_mount);
+}
+
 /*
  * Consume and display events.
  */
@@ -134,6 +158,9 @@ static void consumer(int fd)
 			default:
 				printf("other type\n");
 				break;
+			case WATCH_TYPE_MOUNT_NOTIFY:
+				saw_mount_change(&n.n, len);
+				break;
 			}
 
 			p += len;
@@ -142,12 +169,17 @@ static void consumer(int fd)
 }
 
 static struct watch_notification_filter filter = {
-	.nr_filters	= 1,
+	.nr_filters	= 2,
 	.filters = {
 		[0]	= {
 			.type			= WATCH_TYPE_KEY_NOTIFY,
 			.subtype_filter[0]	= UINT_MAX,
 		},
+		[1] = {
+			.type			= WATCH_TYPE_MOUNT_NOTIFY,
+			// Reject move-from notifications
+			.subtype_filter[0]	= UINT_MAX & ~(1 << NOTIFY_MOUNT_MOVE_FROM),
+		},
 	},
 };
 
@@ -181,6 +213,11 @@ int main(int argc, char **argv)
 		exit(1);
 	}
 
+	if (syscall(__NR_watch_mount, AT_FDCWD, "/", 0, fd, 0x02) == -1) {
+		perror("watch_mount");
+		exit(1);
+	}
+
 	consumer(fd);
 	exit(0);
 }



^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 04/17] watch_queue: Introduce a non-repeating system-unique superblock ID [ver #17]
  2020-02-21 18:01 [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] David Howells
                   ` (2 preceding siblings ...)
  2020-02-21 18:02 ` [PATCH 03/17] watch_queue: sample: Display mount tree " David Howells
@ 2020-02-21 18:02 ` " David Howells
  2020-02-21 18:02 ` [PATCH 05/17] watch_queue: Add superblock notifications " David Howells
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 117+ messages in thread
From: David Howells @ 2020-02-21 18:02 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong,
	linux-api, linux-fsdevel, linux-kernel

Introduce an (effectively) non-repeating system-unique superblock ID that
can be used to determine that two object are in the same superblock without
risking reuse of the ID in the meantime (as is possible with device IDs).

The ID is time-based to make it harder to use it as a covert communications
channel.

In future patches, this ID will be used to tag superblock notification
messages.  It will also be made queryable.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/super.c         |   24 ++++++++++++++++++++++++
 include/linux/fs.h |    3 +++
 2 files changed, 27 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index cd352530eca9..a63073e6127e 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -44,6 +44,8 @@ static int thaw_super_locked(struct super_block *sb);
 
 static LIST_HEAD(super_blocks);
 static DEFINE_SPINLOCK(sb_lock);
+static u64 sb_last_identifier;
+static u64 sb_identifier_offset;
 
 static char *sb_writers_name[SB_FREEZE_LEVELS] = {
 	"sb_writers",
@@ -188,6 +190,27 @@ static void destroy_unused_super(struct super_block *s)
 	destroy_super_work(&s->destroy_work);
 }
 
+/*
+ * Generate a unique identifier for a superblock.
+ */
+static void generate_super_id(struct super_block *s)
+{
+	u64 id = ktime_to_ns(ktime_get());
+
+	spin_lock(&sb_lock);
+
+	id += sb_identifier_offset;
+	if (id <= sb_last_identifier) {
+		id = sb_last_identifier + 1;
+		sb_identifier_offset = sb_last_identifier - id;
+	}
+
+	sb_last_identifier = id;
+	spin_unlock(&sb_lock);
+
+	s->s_unique_id = id;
+}
+
 /**
  *	alloc_super	-	create new superblock
  *	@type:	filesystem type superblock should belong to
@@ -273,6 +296,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 		goto fail;
 	if (list_lru_init_memcg(&s->s_inode_lru, &s->s_shrink))
 		goto fail;
+	generate_super_id(s);
 	return s;
 
 fail:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3cd4fe6b845e..9de6bfe41016 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1548,6 +1548,9 @@ struct super_block {
 
 	spinlock_t		s_inode_wblist_lock;
 	struct list_head	s_inodes_wb;	/* writeback inodes */
+
+	/* Superblock event notifications */
+	u64			s_unique_id;
 } __randomize_layout;
 
 /* Helper functions so that in most cases filesystems will



^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 05/17] watch_queue: Add superblock notifications [ver #17]
  2020-02-21 18:01 [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] David Howells
                   ` (3 preceding siblings ...)
  2020-02-21 18:02 ` [PATCH 04/17] watch_queue: Introduce a non-repeating system-unique superblock ID " David Howells
@ 2020-02-21 18:02 ` " David Howells
  2020-02-21 18:02 ` [PATCH 06/17] watch_queue: sample: Display " David Howells
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 117+ messages in thread
From: David Howells @ 2020-02-21 18:02 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong,
	linux-api, linux-fsdevel, linux-kernel

Add a superblock event notification facility whereby notifications about
superblock events, such as I/O errors (EIO), quota limits being hit
(EDQUOT) and running out of space (ENOSPC) can be reported to a monitoring
process asynchronously.  Note that this does not cover vfsmount topology
changes.  watch_mount() is used for that.

Firstly, an event queue needs to be created:

	fd = open("/dev/event_queue", O_RDWR);
	ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, page_size << n);

then a notification can be set up to report notifications via that queue:

	struct watch_notification_filter filter = {
		.nr_filters = 1,
		.filters = {
			[0] = {
				.type = WATCH_TYPE_SB_NOTIFY,
				.subtype_filter[0] = UINT_MAX,
			},
		},
	};
	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
	watch_sb(AT_FDCWD, "/home/dhowells", 0, fd, 0x03);

In this case, it would let me monitor my own homedir for events.  After
setting the watch, records will be placed into the queue when, for example,
as superblock switches between read-write and read-only.  Records are of
the following format:

	struct superblock_notification {
		struct watch_notification watch;
		__u64	sb_id;
	} *n;

Where:

	n->watch.type will be WATCH_TYPE_SB_NOTIFY.

	n->watch.subtype will indicate the type of event, such as
	NOTIFY_SUPERBLOCK_READONLY.

	n->watch.info & WATCH_INFO_LENGTH will indicate the length of the
	record.

	n->watch.info & WATCH_INFO_ID will be the fifth argument to
	watch_sb(), shifted.

	n->watch.info & NOTIFY_SUPERBLOCK_IS_NOW_RO will be used for
	NOTIFY_SUPERBLOCK_READONLY, being set if the superblock becomes
	R/O, and being cleared otherwise.

	n->sb_id will be the ID of the superblock, as can be retrieved with
	the fsinfo() syscall, as part of the fsinfo_sb_notifications
	attribute in the the watch_id field.

Note that it is permissible for event records to be of variable length -
or, at least, the length may be dependent on the subtype.  Note also that
the queue can be shared between multiple notifications of various types.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/alpha/kernel/syscalls/syscall.tbl      |    1 
 arch/arm/tools/syscall.tbl                  |    1 
 arch/arm64/include/asm/unistd.h             |    2 
 arch/ia64/kernel/syscalls/syscall.tbl       |    1 
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 
 arch/s390/kernel/syscalls/syscall.tbl       |    1 
 arch/sh/kernel/syscalls/syscall.tbl         |    1 
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 
 fs/Kconfig                                  |   12 ++
 fs/super.c                                  |  132 +++++++++++++++++++++++++++
 include/linux/fs.h                          |   80 ++++++++++++++++
 include/linux/syscalls.h                    |    2 
 include/uapi/asm-generic/unistd.h           |    4 +
 include/uapi/linux/watch_queue.h            |   31 ++++++
 kernel/sys_ni.c                             |    3 +
 24 files changed, 279 insertions(+), 3 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index b869428033ef..7c0115af9010 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -478,3 +478,4 @@
 547	common	openat2				sys_openat2
 548	common	pidfd_getfd			sys_pidfd_getfd
 549	common	watch_mount			sys_watch_mount
+550	common	watch_sb			sys_watch_sb
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 9c389da9efcc..f256f009a89f 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -452,3 +452,4 @@
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
+440	common	watch_sb			sys_watch_sb
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 75f04a1023be..bc0f923e0e04 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		440
+#define __NR_compat_syscalls		441
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 6817f865cc71..a4dafc659647 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -359,3 +359,4 @@
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
+440	common	watch_sb			sys_watch_sb
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index fbf85da75ecb..893fb4151547 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
+440	common	watch_sb			sys_watch_sb
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index b05b192da1e2..54aaf0d40c64 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -444,3 +444,4 @@
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
+440	common	watch_sb			sys_watch_sb
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 0f85d2a033f9..fd34dd0efed0 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -377,3 +377,4 @@
 437	n32	openat2				sys_openat2
 438	n32	pidfd_getfd			sys_pidfd_getfd
 439	n32	watch_mount			sys_watch_mount
+440	n32	watch_sb			sys_watch_sb
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 905cf9ac0792..db0f4c0a0a0b 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -353,3 +353,4 @@
 437	n64	openat2				sys_openat2
 438	n64	pidfd_getfd			sys_pidfd_getfd
 439	n64	watch_mount			sys_watch_mount
+440	n64	watch_sb			sys_watch_sb
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 834b26b08d74..ce2e1326de8f 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -426,3 +426,4 @@
 437	o32	openat2				sys_openat2
 438	o32	pidfd_getfd			sys_pidfd_getfd
 439	o32	watch_mount			sys_watch_mount
+440	o32	watch_sb			sys_watch_sb
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index badd3449db43..6e4a7c08b64b 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
+440	common	watch_sb			sys_watch_sb
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index b404361bc929..08943f3b8206 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -520,3 +520,4 @@
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
+440	common	watch_sb			sys_watch_sb
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 33071de24511..b3b8529d2b74 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
 437  common	openat2			sys_openat2			sys_openat2
 438  common	pidfd_getfd		sys_pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount		sys_watch_mount			sys_watch_mount
+440	common	watch_sb		sys_watch_sb			sys_watch_sb
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 682c125122f4..89307a20657c 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
+440	common	watch_sb			sys_watch_sb
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index febf3cd675e3..4ff841a00450 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -484,3 +484,4 @@
 437	common	openat2			sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
+440	common	watch_sb			sys_watch_sb
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 085bcc5afdf1..e2731d295f88 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -443,3 +443,4 @@
 437	i386	openat2			sys_openat2			__ia32_sys_openat2
 438	i386	pidfd_getfd		sys_pidfd_getfd			__ia32_sys_pidfd_getfd
 439	i386	watch_mount		sys_watch_mount			__ia32_sys_watch_mount
+440	i386	watch_sb		sys_watch_sb			__ia32_sys_watch_sb
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 9cfb6b2eb319..f4391176102c 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -360,6 +360,7 @@
 437	common	openat2			__x64_sys_openat2
 438	common	pidfd_getfd		__x64_sys_pidfd_getfd
 439	common	watch_mount		__x64_sys_watch_mount
+440	common	watch_sb		__x64_sys_watch_sb
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 1a066a43a58b..8e7d731ed6cf 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -409,3 +409,4 @@
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
+440	common	watch_sb			sys_watch_sb
diff --git a/fs/Kconfig b/fs/Kconfig
index d7039137d538..fef1365c23a5 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -126,6 +126,18 @@ config MOUNT_NOTIFICATIONS
 	  device to handle the notification buffer and provides the
 	  mount_notify() system call to enable/disable watchpoints.
 
+config SB_NOTIFICATIONS
+	bool "Superblock event notifications"
+	select WATCH_QUEUE
+	help
+	  This option provides support for receiving superblock event
+	  notifications.  This makes use of the /dev/watch_queue misc device to
+	  handle the notification buffer and provides the sb_notify() system
+	  call to enable/disable watches.
+
+	  Events can include things like changing between R/W and R/O, EIO
+	  generation, ENOSPC generation and EDQUOT generation.
+
 source "fs/quota/Kconfig"
 
 source "fs/autofs/Kconfig"
diff --git a/fs/super.c b/fs/super.c
index a63073e6127e..0d84cbbf3662 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -37,6 +37,8 @@
 #include <linux/lockdep.h>
 #include <linux/user_namespace.h>
 #include <linux/fs_context.h>
+#include <linux/syscalls.h>
+#include <linux/namei.h>
 #include <uapi/linux/mount.h>
 #include "internal.h"
 
@@ -354,6 +356,10 @@ void deactivate_locked_super(struct super_block *s)
 {
 	struct file_system_type *fs = s->s_type;
 	if (atomic_dec_and_test(&s->s_active)) {
+#ifdef CONFIG_SB_NOTIFICATIONS
+		if (s->s_watchers)
+			remove_watch_list(s->s_watchers, s->s_unique_id);
+#endif
 		cleancache_invalidate_fs(s);
 		unregister_shrinker(&s->s_shrink);
 		fs->kill_sb(s);
@@ -993,6 +999,8 @@ int reconfigure_super(struct fs_context *fc)
 	/* Needs to be ordered wrt mnt_is_readonly() */
 	smp_wmb();
 	sb->s_readonly_remount = 0;
+	notify_sb(sb, NOTIFY_SUPERBLOCK_READONLY,
+		  remount_ro ? NOTIFY_SUPERBLOCK_IS_NOW_RO : 0);
 
 	/*
 	 * Some filesystems modify their metadata via some other path than the
@@ -1891,3 +1899,127 @@ int thaw_super(struct super_block *sb)
 	return thaw_super_locked(sb);
 }
 EXPORT_SYMBOL(thaw_super);
+
+#ifdef CONFIG_SB_NOTIFICATIONS
+/*
+ * Post superblock notifications.
+ */
+void post_sb_notification(struct super_block *s, struct superblock_notification *n)
+{
+	post_watch_notification(s->s_watchers, &n->watch, current_cred(),
+				s->s_unique_id);
+}
+
+static void sb_release_watch(struct watch *watch)
+{
+	put_super(watch->private);
+}
+
+/**
+ * sys_watch_sb - Watch for superblock events.
+ * @dfd: Base directory to pathwalk from or fd referring to superblock.
+ * @filename: Path to superblock to place the watch upon
+ * @at_flags: Pathwalk control flags
+ * @watch_fd: The watch queue to send notifications to.
+ * @watch_id: The watch ID to be placed in the notification (-1 to remove watch)
+ */
+SYSCALL_DEFINE5(watch_sb,
+		int, dfd,
+		const char __user *, filename,
+		unsigned int, at_flags,
+		int, watch_fd,
+		int, watch_id)
+{
+	struct watch_queue *wqueue;
+	struct super_block *s;
+	struct watch_list *wlist = NULL;
+	struct watch *watch = NULL;
+	struct path path;
+	unsigned int lookup_flags =
+		LOOKUP_DIRECTORY | LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
+	bool drop_s_count = false;
+	int ret;
+
+	if (watch_id < -1 || watch_id > 0xff)
+		return -EINVAL;
+	if ((at_flags & ~(AT_NO_AUTOMOUNT | AT_EMPTY_PATH)) != 0)
+		return -EINVAL;
+	if (at_flags & AT_NO_AUTOMOUNT)
+		lookup_flags &= ~LOOKUP_AUTOMOUNT;
+	if (at_flags & AT_EMPTY_PATH)
+		lookup_flags |= LOOKUP_EMPTY;
+
+	ret = user_path_at(dfd, filename, at_flags, &path);
+	if (ret)
+		return ret;
+
+	ret = inode_permission(path.dentry->d_inode, MAY_EXEC);
+	if (ret)
+		goto err_path;
+
+	wqueue = get_watch_queue(watch_fd);
+	if (IS_ERR(wqueue))
+		goto err_path;
+
+	s = path.dentry->d_sb;
+	if (watch_id >= 0) {
+		ret = -ENOMEM;
+		if (!READ_ONCE(s->s_watchers)) {
+			wlist = kzalloc(sizeof(*wlist), GFP_KERNEL);
+			if (!wlist)
+				goto err_wqueue;
+			init_watch_list(wlist, sb_release_watch);
+		}
+
+		watch = kzalloc(sizeof(*watch), GFP_KERNEL);
+		if (!watch)
+			goto err_wlist;
+
+		init_watch(watch, wqueue);
+		watch->id		= s->s_unique_id;
+		watch->private		= s;
+		watch->info_id		= (u32)watch_id << 24;
+
+		ret = security_watch_sb(watch, s);
+		if (ret < 0)
+			goto err_watch;
+
+		down_write(&s->s_umount);
+		ret = -EIO;
+		if (atomic_read(&s->s_active)) {
+			if (!s->s_watchers) {
+				s->s_watchers = wlist;
+				wlist = NULL;
+			}
+
+			spin_lock(&sb_lock);
+			s->s_count++;
+			spin_unlock(&sb_lock);
+			ret = add_watch_to_object(watch, s->s_watchers);
+			if (ret == 0)
+				watch = NULL; /* It worked */
+			else
+				drop_s_count = true;
+		}
+		up_write(&s->s_umount);
+		if (drop_s_count)
+			put_super(s);
+	} else {
+		ret = -EBADSLT;
+		down_write(&s->s_umount);
+		ret = remove_watch_from_object(s->s_watchers, wqueue,
+					       s->s_unique_id, false);
+		up_write(&s->s_umount);
+	}
+
+err_watch:
+	kfree(watch);
+err_wlist:
+	kfree(wlist);
+err_wqueue:
+	put_watch_queue(wqueue);
+err_path:
+	path_put(&path);
+	return ret;
+}
+#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9de6bfe41016..d5128d112384 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -40,6 +40,7 @@
 #include <linux/fs_types.h>
 #include <linux/build_bug.h>
 #include <linux/stddef.h>
+#include <linux/watch_queue.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -1551,6 +1552,11 @@ struct super_block {
 
 	/* Superblock event notifications */
 	u64			s_unique_id;
+
+#ifdef CONFIG_SB_NOTIFICATIONS
+	struct watch_list	*s_watchers;
+#endif
+	atomic_t		s_notify_counter;
 } __randomize_layout;
 
 /* Helper functions so that in most cases filesystems will
@@ -3654,4 +3660,78 @@ static inline int inode_drain_writes(struct inode *inode)
 	return filemap_write_and_wait(inode->i_mapping);
 }
 
+extern void post_sb_notification(struct super_block *, struct superblock_notification *);
+
+/**
+ * notify_sb: Post simple superblock notification.
+ * @s: The superblock the notification is about.
+ * @subtype: The type of notification.
+ * @info: WATCH_INFO_FLAG_* flags to be set in the record.
+ */
+static inline void notify_sb(struct super_block *s,
+			     enum superblock_notification_type subtype,
+			     u32 info)
+{
+#ifdef CONFIG_SB_NOTIFICATIONS
+	atomic_inc(&s->s_notify_counter);
+	if (unlikely(READ_ONCE(s->s_watchers))) {
+		struct superblock_notification n = {
+			.watch.type	= WATCH_TYPE_SB_NOTIFY,
+			.watch.subtype	= subtype,
+			.watch.info	= watch_sizeof(n) | info,
+			.sb_id		= s->s_unique_id,
+		};
+
+		post_sb_notification(s, &n);
+	}
+#endif
+}
+
+/**
+ * notify_sb_error: Post superblock error notification.
+ * @s: The superblock the notification is about.
+ * @error: The error number to be recorded.
+ */
+static inline int notify_sb_error(struct super_block *s, int error)
+{
+#ifdef CONFIG_SB_NOTIFICATIONS
+	atomic_inc(&s->s_notify_counter);
+	if (unlikely(READ_ONCE(s->s_watchers))) {
+		struct superblock_error_notification n = {
+			.s.watch.type	= WATCH_TYPE_SB_NOTIFY,
+			.s.watch.subtype = NOTIFY_SUPERBLOCK_ERROR,
+			.s.watch.info	= watch_sizeof(n),
+			.s.sb_id	= s->s_unique_id,
+			.error_number	= error,
+			.error_cookie	= 0,
+		};
+
+		post_sb_notification(s, &n.s);
+	}
+#endif
+	return error;
+}
+
+/**
+ * notify_sb_EDQUOT: Post superblock quota overrun notification.
+ * @s: The superblock the notification is about.
+ */
+static inline int notify_sb_EQDUOT(struct super_block *s)
+{
+#ifdef CONFIG_SB_NOTIFICATIONS
+	atomic_inc(&s->s_notify_counter);
+	if (unlikely(READ_ONCE(s->s_watchers))) {
+		struct superblock_notification n = {
+			.watch.type	= WATCH_TYPE_SB_NOTIFY,
+			.watch.subtype	= NOTIFY_SUPERBLOCK_EDQUOT,
+			.watch.info	= watch_sizeof(n),
+			.sb_id		= s->s_unique_id,
+		};
+
+		post_sb_notification(s, &n);
+	}
+#endif
+	return -EDQUOT;
+}
+
 #endif /* _LINUX_FS_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 1fd43af3b22d..c84440d57f52 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1005,6 +1005,8 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
 asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
 asmlinkage long sys_watch_mount(int dfd, const char __user *path,
 				unsigned int at_flags, int watch_fd, int watch_id);
+asmlinkage long sys_watch_sb(int dfd, const char __user *path,
+			     unsigned int at_flags, int watch_fd, int watch_id);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 6b5748287883..5bff318b7ffa 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -857,9 +857,11 @@ __SYSCALL(__NR_openat2, sys_openat2)
 __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
 #define __NR_watch_mount 439
 __SYSCALL(__NR_watch_mount, sys_watch_mount)
+#define __NR_watch_sb 440
+__SYSCALL(__NR_watch_sb, sys_watch_sb)
 
 #undef __NR_syscalls
-#define __NR_syscalls 440
+#define __NR_syscalls 441
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/watch_queue.h b/include/uapi/linux/watch_queue.h
index b0f35cf51394..e9c37b1ae68d 100644
--- a/include/uapi/linux/watch_queue.h
+++ b/include/uapi/linux/watch_queue.h
@@ -15,7 +15,8 @@ enum watch_notification_type {
 	WATCH_TYPE_META		= 0,	/* Special record */
 	WATCH_TYPE_KEY_NOTIFY	= 1,	/* Key change event notification */
 	WATCH_TYPE_MOUNT_NOTIFY	= 2,	/* Mount topology change notification */
-	WATCH_TYPE___NR		= 3
+	WATCH_TYPE_SB_NOTIFY	= 3,	/* Superblock event notification */
+	WATCH_TYPE___NR		= 4
 };
 
 enum watch_meta_notification_subtype {
@@ -131,4 +132,32 @@ struct mount_notification {
 	__u32	changed_mount;		/* The mount that got changed */
 };
 
+/*
+ * Type of superblock notification.
+ */
+enum superblock_notification_type {
+	NOTIFY_SUPERBLOCK_READONLY	= 0, /* Filesystem toggled between R/O and R/W */
+	NOTIFY_SUPERBLOCK_ERROR		= 1, /* Error in filesystem or blockdev */
+	NOTIFY_SUPERBLOCK_EDQUOT	= 2, /* EDQUOT notification */
+	NOTIFY_SUPERBLOCK_NETWORK	= 3, /* Network status change */
+};
+
+#define NOTIFY_SUPERBLOCK_IS_NOW_RO	WATCH_INFO_FLAG_0 /* Superblock changed to R/O */
+
+/*
+ * Superblock notification record.
+ * - watch.type = WATCH_TYPE_MOUNT_NOTIFY
+ * - watch.subtype = enum superblock_notification_subtype
+ */
+struct superblock_notification {
+	struct watch_notification watch; /* WATCH_TYPE_SB_NOTIFY */
+	__u64	sb_id;			/* 64-bit superblock ID */
+};
+
+struct superblock_error_notification {
+	struct superblock_notification s; /* subtype = notify_superblock_error */
+	__u32	error_number;
+	__u32	error_cookie;
+};
+
 #endif /* _UAPI_LINUX_WATCH_QUEUE_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 3e1c5c9d2efe..0ce01f86e5db 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -119,6 +119,9 @@ COND_SYSCALL_COMPAT(signalfd4);
 
 /* fs/sync.c */
 
+/* fs/super.c */
+COND_SYSCALL(watch_sb);
+
 /* fs/timerfd.c */
 COND_SYSCALL(timerfd_create);
 COND_SYSCALL(timerfd_settime);



^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 06/17] watch_queue: sample: Display superblock notifications [ver #17]
  2020-02-21 18:01 [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] David Howells
                   ` (4 preceding siblings ...)
  2020-02-21 18:02 ` [PATCH 05/17] watch_queue: Add superblock notifications " David Howells
@ 2020-02-21 18:02 ` " David Howells
  2020-02-21 18:02 ` [PATCH 07/17] fsinfo: Add fsinfo() syscall to query filesystem information " David Howells
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 117+ messages in thread
From: David Howells @ 2020-02-21 18:02 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong,
	linux-api, linux-fsdevel, linux-kernel

The notification is run as:

	./watch_test

and it then watches "/mnt" for superblock notifications:

	# mount -t tmpfs none /mnt
	# ./watch_test &
	# mount -o remount,ro /mnt
	# mount -o remount,rw /mnt

producing:

	# ./watch_test
	NOTIFY[000]: ty=000003 sy=00 i=03010010
	SUPER 157eb57ca7 change=0[readonly]
	read() = 16
	NOTIFY[000]: ty=000002 sy=04 i=02010010
	MOUNT 000001a0 change=4[setattr] aux=0
	read() = 16
	NOTIFY[000]: ty=000002 sy=04 i=02010010
	MOUNT 000001a0 change=4[setattr] aux=0

Signed-off-by: David Howells <dhowells@redhat.com>
---

 samples/watch_queue/watch_test.c |   39 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/samples/watch_queue/watch_test.c b/samples/watch_queue/watch_test.c
index 49d185150506..eea3bd8c6569 100644
--- a/samples/watch_queue/watch_test.c
+++ b/samples/watch_queue/watch_test.c
@@ -29,6 +29,9 @@
 #ifndef __NR_watch_mount
 #define __NR_watch_mount -1
 #endif
+#ifndef __NR_watch_sb
+#define __NR_watch_sb -1
+#endif
 
 #define BUF_SIZE 256
 
@@ -82,6 +85,24 @@ static void saw_mount_change(struct watch_notification *n, size_t len)
 	       m->triggered_on, n->subtype, mount_subtypes[n->subtype], m->changed_mount);
 }
 
+static const char *super_subtypes[256] = {
+	[NOTIFY_SUPERBLOCK_READONLY]	= "readonly",
+	[NOTIFY_SUPERBLOCK_ERROR]	= "error",
+	[NOTIFY_SUPERBLOCK_EDQUOT]	= "edquot",
+	[NOTIFY_SUPERBLOCK_NETWORK]	= "network",
+};
+
+static void saw_super_change(struct watch_notification *n, size_t len)
+{
+	struct superblock_notification *s = (struct superblock_notification *)n;
+
+	if (len < sizeof(struct superblock_notification))
+		return;
+
+	printf("SUPER %08llx change=%u[%s]\n",
+	       s->sb_id, n->subtype, super_subtypes[n->subtype]);
+}
+
 /*
  * Consume and display events.
  */
@@ -161,6 +182,9 @@ static void consumer(int fd)
 			case WATCH_TYPE_MOUNT_NOTIFY:
 				saw_mount_change(&n.n, len);
 				break;
+			case WATCH_TYPE_SB_NOTIFY:
+				saw_super_change(&n.n, len);
+				break;
 			}
 
 			p += len;
@@ -169,7 +193,7 @@ static void consumer(int fd)
 }
 
 static struct watch_notification_filter filter = {
-	.nr_filters	= 2,
+	.nr_filters	= 3,
 	.filters = {
 		[0]	= {
 			.type			= WATCH_TYPE_KEY_NOTIFY,
@@ -180,6 +204,14 @@ static struct watch_notification_filter filter = {
 			// Reject move-from notifications
 			.subtype_filter[0]	= UINT_MAX & ~(1 << NOTIFY_MOUNT_MOVE_FROM),
 		},
+		[2]	= {
+			.type			= WATCH_TYPE_SB_NOTIFY,
+			// Only accept notification of changes to R/O state
+			.subtype_filter[0]	= (1 << NOTIFY_SUPERBLOCK_READONLY),
+			// Only accept notifications of change-to-R/O
+			.info_mask		= WATCH_INFO_FLAG_0,
+			.info_filter		= WATCH_INFO_FLAG_0,
+		},
 	},
 };
 
@@ -218,6 +250,11 @@ int main(int argc, char **argv)
 		exit(1);
 	}
 
+	if (syscall(__NR_watch_sb, AT_FDCWD, "/mnt", 0, fd, 0x03) == -1) {
+		perror("watch_sb");
+		exit(1);
+	}
+
 	consumer(fd);
 	exit(0);
 }



^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 07/17] fsinfo: Add fsinfo() syscall to query filesystem information [ver #17]
  2020-02-21 18:01 [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] David Howells
                   ` (5 preceding siblings ...)
  2020-02-21 18:02 ` [PATCH 06/17] watch_queue: sample: Display " David Howells
@ 2020-02-21 18:02 ` " David Howells
  2020-02-26  2:29   ` Aleksa Sarai
  2020-02-28 14:44   ` David Howells
  2020-02-21 18:02 ` [PATCH 08/17] fsinfo: Provide a bitmap of supported features " David Howells
                   ` (10 subsequent siblings)
  17 siblings, 2 replies; 117+ messages in thread
From: David Howells @ 2020-02-21 18:02 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong,
	linux-api, linux-fsdevel, linux-kernel

Add a system call to allow filesystem information to be queried.  A request
value can be given to indicate the desired attribute.  Support is provided
for enumerating multi-value attributes.

===============
NEW SYSTEM CALL
===============

The new system call looks like:

	int ret = fsinfo(int dfd,
			 const char *filename,
			 const struct fsinfo_params *params,
			 void *buffer,
			 size_t buf_size);

The params parameter optionally points to a block of parameters:

	struct fsinfo_params {
		__u32	at_flags;
		__u32	flags;
		__u32	request;
		__u32	Nth;
		__u32	Mth;
		__u64	__reserved[3];
	};

If params is NULL, it is assumed params->request should be
FSINFO_ATTR_STATFS, params->Nth should be 0, params->Mth should be 0,
params->at_flags should be 0 and params->flags should be 0.

If params is given, all of params->__reserved[] must be 0.

dfd, filename and params->at_flags indicate the file to query.  There is no
equivalent of lstat() as that can be emulated with fsinfo() by setting
AT_SYMLINK_NOFOLLOW in params->at_flags.  There is also no equivalent of
fstat() as that can be emulated by passing a NULL filename to fsinfo() with
the fd of interest in dfd.  AT_NO_AUTOMOUNT can also be used to an allow
automount point to be queried without triggering it.

params->request indicates the attribute/attributes to be queried.  This can
be one of:

	FSINFO_ATTR_STATFS		- statfs-style info
	FSINFO_ATTR_IDS			- Filesystem IDs
	FSINFO_ATTR_LIMITS		- Filesystem limits
	FSINFO_ATTR_SUPPORTS		- What's supported in statx(), IOC flags
	FSINFO_ATTR_TIMESTAMP_INFO	- Inode timestamp info
	FSINFO_ATTR_VOLUME_ID		- Volume ID (string)
	FSINFO_ATTR_VOLUME_UUID		- Volume UUID
	FSINFO_ATTR_VOLUME_NAME		- Volume name (string)
	FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO - Information about attr Nth
	FSINFO_ATTR_FSINFO_ATTRIBUTES	- List of supported attrs

Some attributes (such as the servers backing a network filesystem) can have
multiple values.  These can be enumerated by setting params->Nth and
params->Mth to 0, 1, ... until ENODATA is returned.

buffer and buf_size point to the reply buffer.  The buffer is filled up to
the specified size, even if this means truncating the reply.  The full size
of the reply is returned.  In future versions, this will allow extra fields
to be tacked on to the end of the reply, but anyone not expecting them will
only get the subset they're expecting.  If either buffer of buf_size are 0,
no copy will take place and the data size will be returned.

At the moment, this will only work on x86_64 and i386 as it requires the
system call to be wired up.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---

 arch/alpha/kernel/syscalls/syscall.tbl      |    1 
 arch/arm/tools/syscall.tbl                  |    1 
 arch/arm64/include/asm/unistd.h             |    2 
 arch/ia64/kernel/syscalls/syscall.tbl       |    1 
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 
 arch/s390/kernel/syscalls/syscall.tbl       |    1 
 arch/sh/kernel/syscalls/syscall.tbl         |    1 
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 
 fs/Kconfig                                  |    7 
 fs/Makefile                                 |    1 
 fs/fsinfo.c                                 |  566 +++++++++++++++++++++++++
 include/linux/fs.h                          |    4 
 include/linux/fsinfo.h                      |   72 +++
 include/linux/syscalls.h                    |    4 
 include/uapi/asm-generic/unistd.h           |    4 
 include/uapi/linux/fsinfo.h                 |  187 ++++++++
 kernel/sys_ni.c                             |    1 
 samples/vfs/Makefile                        |    5 
 samples/vfs/test-fsinfo.c                   |  607 +++++++++++++++++++++++++++
 28 files changed, 1474 insertions(+), 2 deletions(-)
 create mode 100644 fs/fsinfo.c
 create mode 100644 include/linux/fsinfo.h
 create mode 100644 include/uapi/linux/fsinfo.h
 create mode 100644 samples/vfs/test-fsinfo.c

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 7c0115af9010..4d0b07dde12d 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -479,3 +479,4 @@
 548	common	pidfd_getfd			sys_pidfd_getfd
 549	common	watch_mount			sys_watch_mount
 550	common	watch_sb			sys_watch_sb
+551	common	fsinfo				sys_fsinfo
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index f256f009a89f..fdda8382b420 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -453,3 +453,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index bc0f923e0e04..388eeb71cff0 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		441
+#define __NR_compat_syscalls		442
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index a4dafc659647..2316e60e031a 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -360,3 +360,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 893fb4151547..efc2723ca91f 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -439,3 +439,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 54aaf0d40c64..745c0f462fce 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -445,3 +445,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index fd34dd0efed0..499f83562a8c 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -378,3 +378,4 @@
 438	n32	pidfd_getfd			sys_pidfd_getfd
 439	n32	watch_mount			sys_watch_mount
 440	n32	watch_sb			sys_watch_sb
+441	n32	fsinfo				sys_fsinfo
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index db0f4c0a0a0b..b3188bc3ab3c 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -354,3 +354,4 @@
 438	n64	pidfd_getfd			sys_pidfd_getfd
 439	n64	watch_mount			sys_watch_mount
 440	n64	watch_sb			sys_watch_sb
+441	n64	fsinfo				sys_fsinfo
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index ce2e1326de8f..1a3e8ed5e538 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -427,3 +427,4 @@
 438	o32	pidfd_getfd			sys_pidfd_getfd
 439	o32	watch_mount			sys_watch_mount
 440	o32	watch_sb			sys_watch_sb
+441	o32	fsinfo				sys_fsinfo
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 6e4a7c08b64b..2572c215d861 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -437,3 +437,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 08943f3b8206..39d7ac7e918c 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -521,3 +521,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index b3b8529d2b74..ae4cefd3dd1b 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -442,3 +442,4 @@
 438  common	pidfd_getfd		sys_pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount		sys_watch_mount			sys_watch_mount
 440	common	watch_sb		sys_watch_sb			sys_watch_sb
+441  common	fsinfo			sys_fsinfo			sys_fsinfo
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 89307a20657c..05945b9aee4b 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -442,3 +442,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 4ff841a00450..b71b34d4b45c 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -485,3 +485,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index e2731d295f88..e118ba9aca4c 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -444,3 +444,4 @@
 438	i386	pidfd_getfd		sys_pidfd_getfd			__ia32_sys_pidfd_getfd
 439	i386	watch_mount		sys_watch_mount			__ia32_sys_watch_mount
 440	i386	watch_sb		sys_watch_sb			__ia32_sys_watch_sb
+441	i386	fsinfo			sys_fsinfo			__ia32_sys_fsinfo
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f4391176102c..067f247471d0 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -361,6 +361,7 @@
 438	common	pidfd_getfd		__x64_sys_pidfd_getfd
 439	common	watch_mount		__x64_sys_watch_mount
 440	common	watch_sb		__x64_sys_watch_sb
+441	common	fsinfo			__x64_sys_fsinfo
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 8e7d731ed6cf..e1ec25099d10 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -410,3 +410,4 @@
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	watch_mount			sys_watch_mount
 440	common	watch_sb			sys_watch_sb
+441	common	fsinfo				sys_fsinfo
diff --git a/fs/Kconfig b/fs/Kconfig
index fef1365c23a5..01d0d436b3cd 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -15,6 +15,13 @@ config VALIDATE_FS_PARSER
 	  Enable this to perform validation of the parameter description for a
 	  filesystem when it is registered.
 
+config FSINFO
+	bool "Enable the fsinfo() system call"
+	help
+	  Enable the file system information querying system call to allow
+	  comprehensive information to be retrieved about a filesystem,
+	  superblock or mount object.
+
 if BLOCK
 
 config FS_IOMAP
diff --git a/fs/Makefile b/fs/Makefile
index 4477757780d0..b6bf2424c7f7 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -55,6 +55,7 @@ obj-$(CONFIG_COREDUMP)		+= coredump.o
 obj-$(CONFIG_SYSCTL)		+= drop_caches.o
 
 obj-$(CONFIG_FHANDLE)		+= fhandle.o
+obj-$(CONFIG_FSINFO)		+= fsinfo.o
 obj-y				+= iomap/
 
 obj-y				+= quota/
diff --git a/fs/fsinfo.c b/fs/fsinfo.c
new file mode 100644
index 000000000000..5d3ba3c3a7ad
--- /dev/null
+++ b/fs/fsinfo.c
@@ -0,0 +1,566 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Filesystem information query.
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+#include <linux/syscalls.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/statfs.h>
+#include <linux/security.h>
+#include <linux/uaccess.h>
+#include <linux/fsinfo.h>
+#include <uapi/linux/mount.h>
+#include "internal.h"
+
+/**
+ * fsinfo_string - Store a NUL-terminated string as an fsinfo attribute value.
+ * @s: The string to store (may be NULL)
+ * @ctx: The parameter context
+ */
+int fsinfo_string(const char *s, struct fsinfo_context *ctx)
+{
+	unsigned int len;
+	char *p = ctx->buffer;
+	int ret = 0;
+
+	if (s) {
+		len = min_t(size_t, strlen(s), ctx->buf_size - 1);
+		if (!ctx->want_size_only) {
+			memcpy(p, s, len);
+			p[len] = 0;
+		}
+		ret = len;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(fsinfo_string);
+
+/*
+ * Get basic filesystem stats from statfs.
+ */
+static int fsinfo_generic_statfs(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_statfs *p = ctx->buffer;
+	struct kstatfs buf;
+	int ret;
+
+	ret = vfs_statfs(path, &buf);
+	if (ret < 0)
+		return ret;
+
+	p->f_blocks.lo	= buf.f_blocks;
+	p->f_bfree.lo	= buf.f_bfree;
+	p->f_bavail.lo	= buf.f_bavail;
+	p->f_files.lo	= buf.f_files;
+	p->f_ffree.lo	= buf.f_ffree;
+	p->f_favail.lo	= buf.f_ffree;
+	p->f_bsize	= buf.f_bsize;
+	p->f_frsize	= buf.f_frsize;
+	return sizeof(*p);
+}
+
+static int fsinfo_generic_ids(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_ids *p = ctx->buffer;
+	struct super_block *sb;
+	struct kstatfs buf;
+	int ret;
+
+	ret = vfs_statfs(path, &buf);
+	if (ret < 0 && ret != -ENOSYS)
+		return ret;
+	if (ret == 0)
+		memcpy(&p->f_fsid, &buf.f_fsid, sizeof(p->f_fsid));
+
+	sb = path->dentry->d_sb;
+	p->f_fstype	= sb->s_magic;
+	p->f_dev_major	= MAJOR(sb->s_dev);
+	p->f_dev_minor	= MINOR(sb->s_dev);
+	p->f_sb_id	= sb->s_unique_id;
+	strlcpy(p->f_fs_name, sb->s_type->name, sizeof(p->f_fs_name));
+	return sizeof(*p);
+}
+
+int fsinfo_generic_limits(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_limits *p = ctx->buffer;
+	struct super_block *sb = path->dentry->d_sb;
+
+	p->max_file_size.hi	= 0;
+	p->max_file_size.lo	= sb->s_maxbytes;
+	p->max_ino.hi		= 0;
+	p->max_ino.lo		= UINT_MAX;
+	p->max_hard_links	= sb->s_max_links;
+	p->max_uid		= UINT_MAX;
+	p->max_gid		= UINT_MAX;
+	p->max_projid		= UINT_MAX;
+	p->max_filename_len	= NAME_MAX;
+	p->max_symlink_len	= PATH_MAX;
+	p->max_xattr_name_len	= XATTR_NAME_MAX;
+	p->max_xattr_body_len	= XATTR_SIZE_MAX;
+	p->max_dev_major	= 0xffffff;
+	p->max_dev_minor	= 0xff;
+	return sizeof(*p);
+}
+EXPORT_SYMBOL(fsinfo_generic_limits);
+
+int fsinfo_generic_supports(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_supports *p = ctx->buffer;
+	struct super_block *sb = path->dentry->d_sb;
+
+	p->stx_mask = STATX_BASIC_STATS;
+	if (sb->s_d_op && sb->s_d_op->d_automount)
+		p->stx_attributes |= STATX_ATTR_AUTOMOUNT;
+	return sizeof(*p);
+}
+EXPORT_SYMBOL(fsinfo_generic_supports);
+
+static const struct fsinfo_timestamp_info fsinfo_default_timestamp_info = {
+	.atime = {
+		.minimum	= S64_MIN,
+		.maximum	= S64_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+	.mtime = {
+		.minimum	= S64_MIN,
+		.maximum	= S64_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+	.ctime = {
+		.minimum	= S64_MIN,
+		.maximum	= S64_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+	.btime = {
+		.minimum	= S64_MIN,
+		.maximum	= S64_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+};
+
+int fsinfo_generic_timestamp_info(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_timestamp_info *p = ctx->buffer;
+	struct super_block *sb = path->dentry->d_sb;
+	s8 exponent;
+
+	*p = fsinfo_default_timestamp_info;
+
+	if (sb->s_time_gran < 1000000000) {
+		if (sb->s_time_gran < 1000)
+			exponent = -9;
+		else if (sb->s_time_gran < 1000000)
+			exponent = -6;
+		else
+			exponent = -3;
+
+		p->atime.gran_exponent = exponent;
+		p->mtime.gran_exponent = exponent;
+		p->ctime.gran_exponent = exponent;
+		p->btime.gran_exponent = exponent;
+	}
+
+	return sizeof(*p);
+}
+EXPORT_SYMBOL(fsinfo_generic_timestamp_info);
+
+static int fsinfo_generic_volume_uuid(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_volume_uuid *p = ctx->buffer;
+	struct super_block *sb = path->dentry->d_sb;
+
+	memcpy(p, &sb->s_uuid, sizeof(*p));
+	return sizeof(*p);
+}
+
+static int fsinfo_generic_volume_id(struct path *path, struct fsinfo_context *ctx)
+{
+	return fsinfo_string(path->dentry->d_sb->s_id, ctx);
+}
+
+static const struct fsinfo_attribute fsinfo_common_attributes[] = {
+	FSINFO_VSTRUCT	(FSINFO_ATTR_STATFS,		fsinfo_generic_statfs),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_IDS,		fsinfo_generic_ids),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_LIMITS,		fsinfo_generic_limits),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_SUPPORTS,		fsinfo_generic_supports),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_TIMESTAMP_INFO,	fsinfo_generic_timestamp_info),
+	FSINFO_STRING	(FSINFO_ATTR_VOLUME_ID,		fsinfo_generic_volume_id),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_VOLUME_UUID,	fsinfo_generic_volume_uuid),
+
+	FSINFO_LIST	(FSINFO_ATTR_FSINFO_ATTRIBUTES,	(void *)123UL),
+	FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, (void *)123UL),
+	{}
+};
+
+/*
+ * Determine an attribute's minimum buffer size and, if the buffer is large
+ * enough, get the attribute value.
+ */
+static int fsinfo_get_this_attribute(struct path *path,
+				     struct fsinfo_context *ctx,
+				     const struct fsinfo_attribute *attr)
+{
+	int buf_size;
+
+	if (ctx->Nth != 0 && !(attr->flags & (FSINFO_FLAGS_N | FSINFO_FLAGS_NM)))
+		return -ENODATA;
+	if (ctx->Mth != 0 && !(attr->flags & FSINFO_FLAGS_NM))
+		return -ENODATA;
+
+	switch (attr->type) {
+	case FSINFO_TYPE_VSTRUCT:
+		ctx->clear_tail = true;
+		buf_size = attr->size;
+		break;
+	case FSINFO_TYPE_STRING:
+	case FSINFO_TYPE_OPAQUE:
+	case FSINFO_TYPE_LIST:
+		buf_size = 4096;
+		break;
+	default:
+		return -ENOPKG;
+	}
+
+	if (ctx->buf_size < buf_size)
+		return buf_size;
+
+	return attr->get(path, ctx);
+}
+
+static void fsinfo_attributes_insert(struct fsinfo_context *ctx,
+				     const struct fsinfo_attribute *attr)
+{
+	__u32 *p = ctx->buffer;
+	unsigned int i;
+
+	if (ctx->usage >= ctx->buf_size ||
+	    ctx->buf_size - ctx->usage < sizeof(__u32)) {
+		ctx->usage += sizeof(__u32);
+		return;
+	}
+
+	for (i = 0; i < ctx->usage / sizeof(__u32); i++)
+		if (p[i] == attr->attr_id)
+			return;
+
+	p[i] = attr->attr_id;
+	ctx->usage += sizeof(__u32);
+}
+
+static int fsinfo_list_attributes(struct path *path,
+				  struct fsinfo_context *ctx,
+				  const struct fsinfo_attribute *attributes)
+{
+	const struct fsinfo_attribute *a;
+
+	for (a = attributes; a->get; a++)
+		fsinfo_attributes_insert(ctx, a);
+	return -EOPNOTSUPP; /* We want to go through all the lists */
+}
+
+static int fsinfo_get_attribute_info(struct path *path,
+				     struct fsinfo_context *ctx,
+				     const struct fsinfo_attribute *attributes)
+{
+	const struct fsinfo_attribute *a;
+	struct fsinfo_attribute_info *p = ctx->buffer;
+
+	if (!ctx->buf_size)
+		return sizeof(*p);
+
+	for (a = attributes; a->get; a++) {
+		if (a->attr_id == ctx->Nth) {
+			p->attr_id	= a->attr_id;
+			p->type		= a->type;
+			p->flags	= a->flags;
+			p->size		= a->size;
+			p->size		= a->size;
+			return sizeof(*p);
+		}
+	}
+	return -EOPNOTSUPP; /* We want to go through all the lists */
+}
+
+/**
+ * fsinfo_get_attribute - Look up and handle an attribute
+ * @path: The object to query
+ * @params: Parameters to define a request and place to store result
+ * @attributes: List of attributes to search.
+ *
+ * Look through a list of attributes for one that matches the requested
+ * attribute then call the handler for it.
+ */
+int fsinfo_get_attribute(struct path *path, struct fsinfo_context *ctx,
+			 const struct fsinfo_attribute *attributes)
+{
+	const struct fsinfo_attribute *a;
+
+	switch (ctx->requested_attr) {
+	case FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO:
+		return fsinfo_get_attribute_info(path, ctx, attributes);
+	case FSINFO_ATTR_FSINFO_ATTRIBUTES:
+		return fsinfo_list_attributes(path, ctx, attributes);
+	default:
+		for (a = attributes; a->get; a++)
+			if (a->attr_id == ctx->requested_attr)
+				return fsinfo_get_this_attribute(path, ctx, a);
+		return -EOPNOTSUPP;
+	}
+}
+EXPORT_SYMBOL(fsinfo_get_attribute);
+
+/**
+ * generic_fsinfo - Handle an fsinfo attribute generically
+ * @path: The object to query
+ * @params: Parameters to define a request and place to store result
+ */
+static int fsinfo_call(struct path *path, struct fsinfo_context *ctx)
+{
+	int ret;
+
+	if (path->dentry->d_sb->s_op->fsinfo) {
+		ret = path->dentry->d_sb->s_op->fsinfo(path, ctx);
+		if (ret != -EOPNOTSUPP)
+			return ret;
+	}
+	ret = fsinfo_get_attribute(path, ctx, fsinfo_common_attributes);
+	if (ret != -EOPNOTSUPP)
+		return ret;
+
+	switch (ctx->requested_attr) {
+	case FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO:
+		return -ENODATA;
+	case FSINFO_ATTR_FSINFO_ATTRIBUTES:
+		return ctx->usage;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+/**
+ * vfs_fsinfo - Retrieve filesystem information
+ * @path: The object to query
+ * @params: Parameters to define a request and place to store result
+ *
+ * Get an attribute on a filesystem or an object within a filesystem.  The
+ * filesystem attribute to be queried is indicated by @ctx->requested_attr, and
+ * if it's a multi-valued attribute, the particular value is selected by
+ * @ctx->Nth and then @ctx->Mth.
+ *
+ * For common attributes, a value may be fabricated if it is not supported by
+ * the filesystem.
+ *
+ * On success, the size of the attribute's value is returned (0 is a valid
+ * size).  A buffer will have been allocated and will be pointed to by
+ * @ctx->buffer.  The caller must free this with kvfree().
+ *
+ * Errors can also be returned: -ENOMEM if a buffer cannot be allocated, -EPERM
+ * or -EACCES if permission is denied by the LSM, -EOPNOTSUPP if an attribute
+ * doesn't exist for the specified object or -ENODATA if the attribute exists,
+ * but the Nth,Mth value does not exist.  -EMSGSIZE indicates that the value is
+ * unmanageable internally and -ENOPKG indicates other internal failure.
+ *
+ * Errors such as -EIO may also come from attempts to access media or servers
+ * to obtain the requested information if it's not immediately to hand.
+ *
+ * [*] Note that the caller may set @ctx->want_size_only if it only wants the
+ *     size of the value and not the data.  If this is set, a buffer may not be
+ *     allocated under some circumstances.  This is intended for size query by
+ *     userspace.
+ *
+ * [*] Note that @ctx->clear_tail will be returned set if the data should be
+ *     padded out with zeros when writing it to userspace.
+ */
+static int vfs_fsinfo(struct path *path, struct fsinfo_context *ctx)
+{
+	struct dentry *dentry = path->dentry;
+	int ret;
+
+	ret = security_sb_statfs(dentry);
+	if (ret)
+		return ret;
+
+	/* Call the handler to find out the buffer size required. */
+	ctx->buf_size = 0;
+	ret = fsinfo_call(path, ctx);
+	if (ret < 0 || ctx->want_size_only)
+		return ret;
+	ctx->buf_size = ret;
+
+	do {
+		/* Allocate a buffer of the requested size. */
+		if (ctx->buf_size > INT_MAX)
+			return -EMSGSIZE;
+		ctx->buffer = kvzalloc(ctx->buf_size, GFP_KERNEL);
+		if (!ctx->buffer)
+			return -ENOMEM;
+
+		ctx->usage = 0;
+		ret = fsinfo_call(path, ctx);
+		if (IS_ERR_VALUE((long)ret))
+			return ret;
+		if ((unsigned int)ret <= ctx->buf_size)
+			return ret; /* It fitted */
+
+		/* We need to resize the buffer */
+		ctx->buf_size = roundup(ret, PAGE_SIZE);
+		kvfree(ctx->buffer);
+		ctx->buffer = NULL;
+	} while (!signal_pending(current));
+
+	return -ERESTARTSYS;
+}
+
+static int vfs_fsinfo_path(int dfd, const char __user *pathname,
+			   unsigned int at_flags, struct fsinfo_context *ctx)
+{
+	struct path path;
+	unsigned lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
+	int ret = -EINVAL;
+
+	if ((at_flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
+			  AT_EMPTY_PATH)) != 0)
+		return -EINVAL;
+
+	if (at_flags & AT_SYMLINK_NOFOLLOW)
+		lookup_flags &= ~LOOKUP_FOLLOW;
+	if (at_flags & AT_NO_AUTOMOUNT)
+		lookup_flags &= ~LOOKUP_AUTOMOUNT;
+	if (at_flags & AT_EMPTY_PATH)
+		lookup_flags |= LOOKUP_EMPTY;
+
+retry:
+	ret = user_path_at(dfd, pathname, lookup_flags, &path);
+	if (ret)
+		goto out;
+
+	ret = vfs_fsinfo(&path, ctx);
+	path_put(&path);
+	if (retry_estale(ret, lookup_flags)) {
+		lookup_flags |= LOOKUP_REVAL;
+		goto retry;
+	}
+out:
+	return ret;
+}
+
+static int vfs_fsinfo_fd(unsigned int fd, struct fsinfo_context *ctx)
+{
+	struct fd f = fdget_raw(fd);
+	int ret = -EBADF;
+
+	if (f.file) {
+		ret = vfs_fsinfo(&f.file->f_path, ctx);
+		fdput(f);
+	}
+	return ret;
+}
+
+/**
+ * sys_fsinfo - System call to get filesystem information
+ * @dfd: Base directory to pathwalk from or fd referring to filesystem.
+ * @pathname: Filesystem to query or NULL.
+ * @_params: Parameters to define request (or NULL for enhanced statfs).
+ * @user_buffer: Result buffer.
+ * @user_buf_size: Size of result buffer.
+ *
+ * Get information on a filesystem.  The filesystem attribute to be queried is
+ * indicated by @_params->request, and some of the attributes can have multiple
+ * values, indexed by @_params->Nth and @_params->Mth.  If @_params is NULL,
+ * then the 0th fsinfo_attr_statfs attribute is queried.  If an attribute does
+ * not exist, EOPNOTSUPP is returned; if the Nth,Mth value does not exist,
+ * ENODATA is returned.
+ *
+ * On success, the size of the attribute's value is returned.  If
+ * @user_buf_size is 0 or @user_buffer is NULL, only the size is returned.  If
+ * the size of the value is larger than @user_buf_size, it will be truncated by
+ * the copy.  If the size of the value is smaller than @user_buf_size then the
+ * excess buffer space will be cleared.  The full size of the value will be
+ * returned, irrespective of how much data is actually placed in the buffer.
+ */
+SYSCALL_DEFINE5(fsinfo,
+		int, dfd, const char __user *, pathname,
+		struct fsinfo_params __user *, params,
+		void __user *, user_buffer, size_t, user_buf_size)
+{
+	struct fsinfo_context ctx;
+	struct fsinfo_params user_params;
+	unsigned int at_flags = 0, result_size;
+	int ret;
+
+	if (!user_buffer && user_buf_size)
+		return -EINVAL;
+	if (user_buffer && !user_buf_size)
+		return -EINVAL;
+	if (user_buf_size > UINT_MAX)
+		return -EOVERFLOW;
+
+	memset(&ctx, 0, sizeof(ctx));
+	ctx.requested_attr = FSINFO_ATTR_STATFS;
+	if (user_buf_size == 0)
+		ctx.want_size_only = true;
+
+	if (params) {
+		if (copy_from_user(&user_params, params, sizeof(user_params)))
+			return -EFAULT;
+		if (user_params.__reserved32[0] ||
+		    user_params.__reserved[0] ||
+		    user_params.__reserved[1] ||
+		    user_params.__reserved[2] ||
+		    user_params.flags & ~FSINFO_FLAGS_QUERY_MASK)
+			return -EINVAL;
+		at_flags = user_params.at_flags;
+		ctx.flags = user_params.flags;
+		ctx.requested_attr = user_params.request;
+		ctx.Nth = user_params.Nth;
+		ctx.Mth = user_params.Mth;
+	}
+
+	switch (ctx.flags & FSINFO_FLAGS_QUERY_MASK) {
+	case FSINFO_FLAGS_QUERY_PATH:
+		ret = vfs_fsinfo_path(dfd, pathname, at_flags, &ctx);
+		break;
+	case FSINFO_FLAGS_QUERY_FD:
+		if (pathname)
+			return -EINVAL;
+		ret = vfs_fsinfo_fd(dfd, &ctx);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	if (ret < 0)
+		goto error;
+
+	result_size = min_t(size_t, ret, user_buf_size);
+	if (result_size > 0 &&
+	    copy_to_user(user_buffer, ctx.buffer, result_size) != 0) {
+		ret = -EFAULT;
+		goto error;
+	}
+
+	/* Clear any part of the buffer that we won't fill if we're putting a
+	 * struct in there.  Strings, opaque objects and arrays are expected to
+	 * be variable length.
+	 */
+	if (ctx.clear_tail &&
+	    user_buf_size > result_size &&
+	    clear_user(user_buffer + result_size, user_buf_size - result_size) != 0) {
+		ret = -EFAULT;
+		goto error;
+	}
+
+error:
+	kvfree(ctx.buffer);
+	return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d5128d112384..d2476c0fc978 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -69,6 +69,7 @@ struct fsverity_info;
 struct fsverity_operations;
 struct fs_context;
 struct fs_parameter_spec;
+struct fsinfo_context;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -1963,6 +1964,9 @@ struct super_operations {
 	int (*thaw_super) (struct super_block *);
 	int (*unfreeze_fs) (struct super_block *);
 	int (*statfs) (struct dentry *, struct kstatfs *);
+#ifdef CONFIG_FSINFO
+	int (*fsinfo)(struct path *, struct fsinfo_context *);
+#endif
 	int (*remount_fs) (struct super_block *, int *, char *);
 	void (*umount_begin) (struct super_block *);
 
diff --git a/include/linux/fsinfo.h b/include/linux/fsinfo.h
new file mode 100644
index 000000000000..943fbd6640f9
--- /dev/null
+++ b/include/linux/fsinfo.h
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Filesystem information query
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#ifndef _LINUX_FSINFO_H
+#define _LINUX_FSINFO_H
+
+#ifdef CONFIG_FSINFO
+
+#include <uapi/linux/fsinfo.h>
+
+struct path;
+
+#define FSINFO_NORMAL_ATTR_MAX_SIZE 4096
+
+struct fsinfo_context {
+	__u32		flags;		/* [in] FSINFO_FLAGS_* */
+	__u32		requested_attr;	/* [in] What is being asking for */
+	__u32		Nth;		/* [in] Instance of it (some may have multiple) */
+	__u32		Mth;		/* [in] Subinstance */
+	bool		want_size_only;	/* [in] Just want to know the size, not the data */
+	bool		clear_tail;	/* [out] T if tail of buffer should be cleared */
+	unsigned int	usage;		/* [tmp] Amount of buffer used (if large) */
+	unsigned int	buf_size;	/* [tmp] Size of ->buffer[] */
+	void		*buffer;	/* [out] The reply buffer */
+};
+
+/*
+ * A filesystem information attribute definition.
+ */
+struct fsinfo_attribute {
+	unsigned int		attr_id;	/* The ID of the attribute */
+	enum fsinfo_value_type	type:8;		/* The type of the attribute's value(s) */
+	unsigned int		flags:8;
+	unsigned int		size:16;	/* - Value size (FSINFO_STRUCT/LIST) */
+	int (*get)(struct path *path, struct fsinfo_context *params);
+};
+
+#define __FSINFO(A, T, S, G, F) \
+	{ .attr_id = A, .type = T, .flags = F, .size = S, .get = G }
+
+#define _FSINFO(A, T, S, G)	__FSINFO(A, T, S, G, 0)
+#define _FSINFO_N(A, T, S, G)	__FSINFO(A, T, S, G, FSINFO_FLAGS_N)
+#define _FSINFO_NM(A, T, S, G)	__FSINFO(A, T, S, G, FSINFO_FLAGS_NM)
+
+#define _FSINFO_VSTRUCT(A,S,G)	  _FSINFO   (A, FSINFO_TYPE_VSTRUCT, sizeof(S), G)
+#define _FSINFO_VSTRUCT_N(A,S,G)  _FSINFO_N (A, FSINFO_TYPE_VSTRUCT, sizeof(S), G)
+#define _FSINFO_VSTRUCT_NM(A,S,G) _FSINFO_NM(A, FSINFO_TYPE_VSTRUCT, sizeof(S), G)
+
+#define FSINFO_VSTRUCT(A,G)	_FSINFO_VSTRUCT   (A, A##__STRUCT, G)
+#define FSINFO_VSTRUCT_N(A,G)	_FSINFO_VSTRUCT_N (A, A##__STRUCT, G)
+#define FSINFO_VSTRUCT_NM(A,G)	_FSINFO_VSTRUCT_NM(A, A##__STRUCT, G)
+#define FSINFO_STRING(A,G)	_FSINFO   (A, FSINFO_TYPE_STRING, 0, G)
+#define FSINFO_STRING_N(A,G)	_FSINFO_N (A, FSINFO_TYPE_STRING, 0, G)
+#define FSINFO_STRING_NM(A,G)	_FSINFO_NM(A, FSINFO_TYPE_STRING, 0, G)
+#define FSINFO_OPAQUE(A,G)	_FSINFO   (A, FSINFO_TYPE_OPAQUE, 0, G)
+#define FSINFO_LIST(A,G)	_FSINFO   (A, FSINFO_TYPE_LIST, sizeof(A##__STRUCT), G)
+#define FSINFO_LIST_N(A,G)	_FSINFO_N (A, FSINFO_TYPE_LIST, sizeof(A##__STRUCT), G)
+
+extern int fsinfo_string(const char *, struct fsinfo_context *);
+extern int fsinfo_generic_timestamp_info(struct path *, struct fsinfo_context *);
+extern int fsinfo_generic_supports(struct path *, struct fsinfo_context *);
+extern int fsinfo_generic_limits(struct path *, struct fsinfo_context *);
+extern int fsinfo_get_attribute(struct path *, struct fsinfo_context *,
+				const struct fsinfo_attribute *);
+
+#endif /* CONFIG_FSINFO */
+
+#endif /* _LINUX_FSINFO_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index c84440d57f52..936e2eb76c8f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -47,6 +47,7 @@ struct stat64;
 struct statfs;
 struct statfs64;
 struct statx;
+struct fsinfo_params;
 struct __sysctl_args;
 struct sysinfo;
 struct timespec;
@@ -1007,6 +1008,9 @@ asmlinkage long sys_watch_mount(int dfd, const char __user *path,
 				unsigned int at_flags, int watch_fd, int watch_id);
 asmlinkage long sys_watch_sb(int dfd, const char __user *path,
 			     unsigned int at_flags, int watch_fd, int watch_id);
+asmlinkage long sys_fsinfo(int dfd, const char __user *pathname,
+			   struct fsinfo_params __user *params,
+			   void __user *buffer, size_t buf_size);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 5bff318b7ffa..7d764f86d3f5 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -859,9 +859,11 @@ __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
 __SYSCALL(__NR_watch_mount, sys_watch_mount)
 #define __NR_watch_sb 440
 __SYSCALL(__NR_watch_sb, sys_watch_sb)
+#define __NR_fsinfo 441
+__SYSCALL(__NR_fsinfo, sys_fsinfo)
 
 #undef __NR_syscalls
-#define __NR_syscalls 441
+#define __NR_syscalls 442
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
new file mode 100644
index 000000000000..6eb02de8a631
--- /dev/null
+++ b/include/uapi/linux/fsinfo.h
@@ -0,0 +1,187 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* fsinfo() definitions.
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+#ifndef _UAPI_LINUX_FSINFO_H
+#define _UAPI_LINUX_FSINFO_H
+
+#include <linux/types.h>
+#include <linux/socket.h>
+
+/*
+ * The filesystem attributes that can be requested.  Note that some attributes
+ * may have multiple instances which can be switched in the parameter block.
+ */
+#define FSINFO_ATTR_STATFS		0x00	/* statfs()-style state */
+#define FSINFO_ATTR_IDS			0x01	/* Filesystem IDs */
+#define FSINFO_ATTR_LIMITS		0x02	/* Filesystem limits */
+#define FSINFO_ATTR_SUPPORTS		0x03	/* What's supported in statx, iocflags, ... */
+#define FSINFO_ATTR_TIMESTAMP_INFO	0x04	/* Inode timestamp info */
+#define FSINFO_ATTR_VOLUME_ID		0x05	/* Volume ID (string) */
+#define FSINFO_ATTR_VOLUME_UUID		0x06	/* Volume UUID (LE uuid) */
+#define FSINFO_ATTR_VOLUME_NAME		0x07	/* Volume name (string) */
+
+#define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO 0x100	/* Information about attr N (for path) */
+#define FSINFO_ATTR_FSINFO_ATTRIBUTES	0x101	/* List of supported attrs (for path) */
+
+/*
+ * Optional fsinfo() parameter structure.
+ *
+ * If this is not given, it is assumed that fsinfo_attr_statfs instance 0,0 is
+ * desired.
+ */
+struct fsinfo_params {
+	__u32	at_flags;	/* AT_SYMLINK_NOFOLLOW and similar flags */
+	__u32	flags;		/* Flags controlling fsinfo() specifically */
+#define FSINFO_FLAGS_QUERY_MASK	0x0007 /* What object should fsinfo() query? */
+#define FSINFO_FLAGS_QUERY_PATH	0x0000 /* - path, specified by dirfd,pathname,AT_EMPTY_PATH */
+#define FSINFO_FLAGS_QUERY_FD	0x0001 /* - fd specified by dirfd */
+	__u32	request;	/* ID of requested attribute */
+	__u32	Nth;		/* Instance of it (some may have multiple) */
+	__u32	Mth;		/* Subinstance of Nth instance */
+	__u32	__reserved32[1]; /* Reserved params; all must be 0 */
+	__u64	__reserved[3];
+};
+
+enum fsinfo_value_type {
+	FSINFO_TYPE_VSTRUCT	= 0,	/* Version-lengthed struct (up to 4096 bytes) */
+	FSINFO_TYPE_STRING	= 1,	/* NUL-term var-length string (up to 4095 chars) */
+	FSINFO_TYPE_OPAQUE	= 2,	/* Opaque blob (unlimited size) */
+	FSINFO_TYPE_LIST	= 3,	/* List of ints/structs (unlimited size) */
+};
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO).
+ *
+ * This gives information about the attributes supported by fsinfo for the
+ * given path.
+ */
+struct fsinfo_attribute_info {
+	unsigned int		attr_id;	/* The ID of the attribute */
+	enum fsinfo_value_type	type;		/* The type of the attribute's value(s) */
+	unsigned int		flags;
+#define FSINFO_FLAGS_N		0x01		/* - Attr has a set of values */
+#define FSINFO_FLAGS_NM		0x02		/* - Attr has a set of sets of values */
+	unsigned int		size;		/* - Value size (FSINFO_STRUCT/FSINFO_LIST) */
+};
+
+#define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO__STRUCT struct fsinfo_attribute_info
+#define FSINFO_ATTR_FSINFO_ATTRIBUTES__STRUCT __u32
+
+struct fsinfo_u128 {
+#if defined(__BYTE_ORDER) ? __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
+	__u64	hi;
+	__u64	lo;
+#elif defined(__BYTE_ORDER) ? __BYTE_ORDER == __LITTLE_ENDIAN : defined(__LITTLE_ENDIAN)
+	__u64	lo;
+	__u64	hi;
+#endif
+};
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_STATFS).
+ * - This gives extended filesystem information.
+ */
+struct fsinfo_statfs {
+	struct fsinfo_u128 f_blocks;	/* Total number of blocks in fs */
+	struct fsinfo_u128 f_bfree;	/* Total number of free blocks */
+	struct fsinfo_u128 f_bavail;	/* Number of free blocks available to ordinary user */
+	struct fsinfo_u128 f_files;	/* Total number of file nodes in fs */
+	struct fsinfo_u128 f_ffree;	/* Number of free file nodes */
+	struct fsinfo_u128 f_favail;	/* Number of file nodes available to ordinary user */
+	__u64	f_bsize;		/* Optimal block size */
+	__u64	f_frsize;		/* Fragment size */
+};
+
+#define FSINFO_ATTR_STATFS__STRUCT struct fsinfo_statfs
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_IDS).
+ *
+ * List of basic identifiers as is normally found in statfs().
+ */
+struct fsinfo_ids {
+	char	f_fs_name[15 + 1];	/* Filesystem name */
+	__u64	f_fsid;			/* Short 64-bit Filesystem ID (as statfs) */
+	__u64	f_sb_id;		/* Internal superblock ID for sbnotify()/mntnotify() */
+	__u32	f_fstype;		/* Filesystem type from linux/magic.h [uncond] */
+	__u32	f_dev_major;		/* As st_dev_* from struct statx [uncond] */
+	__u32	f_dev_minor;
+	__u32	__padding[1];
+};
+
+#define FSINFO_ATTR_IDS__STRUCT struct fsinfo_ids
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_LIMITS).
+ *
+ * List of supported filesystem limits.
+ */
+struct fsinfo_limits {
+	struct fsinfo_u128 max_file_size;	/* Maximum file size */
+	struct fsinfo_u128 max_ino;		/* Maximum inode number */
+	__u64	max_uid;			/* Maximum UID supported */
+	__u64	max_gid;			/* Maximum GID supported */
+	__u64	max_projid;			/* Maximum project ID supported */
+	__u64	max_hard_links;			/* Maximum number of hard links on a file */
+	__u64	max_xattr_body_len;		/* Maximum xattr content length */
+	__u32	max_xattr_name_len;		/* Maximum xattr name length */
+	__u32	max_filename_len;		/* Maximum filename length */
+	__u32	max_symlink_len;		/* Maximum symlink content length */
+	__u32	max_dev_major;			/* Maximum device major representable */
+	__u32	max_dev_minor;			/* Maximum device minor representable */
+	__u32	__padding[1];
+};
+
+#define FSINFO_ATTR_LIMITS__STRUCT struct fsinfo_limits
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_SUPPORTS).
+ *
+ * What's supported in various masks, such as statx() attribute and mask bits
+ * and IOC flags.
+ */
+struct fsinfo_supports {
+	__u64	stx_attributes;		/* What statx::stx_attributes are supported */
+	__u32	stx_mask;		/* What statx::stx_mask bits are supported */
+	__u32	fs_ioc_getflags;	/* What FS_IOC_GETFLAGS may return */
+	__u32	fs_ioc_setflags_set;	/* What FS_IOC_SETFLAGS may set */
+	__u32	fs_ioc_setflags_clear;	/* What FS_IOC_SETFLAGS may clear */
+	__u32	win_file_attrs;		/* What DOS/Windows FILE_* attributes are supported */
+	__u32	__padding[1];
+};
+
+#define FSINFO_ATTR_SUPPORTS__STRUCT struct fsinfo_supports
+
+struct fsinfo_timestamp_one {
+	__s64	minimum;	/* Minimum timestamp value in seconds */
+	__s64	maximum;	/* Maximum timestamp value in seconds */
+	__u16	gran_mantissa;	/* Granularity(secs) = mant * 10^exp */
+	__s8	gran_exponent;
+	__u8	__padding[5];
+};
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_TIMESTAMP_INFO).
+ */
+struct fsinfo_timestamp_info {
+	struct fsinfo_timestamp_one	atime;	/* Access time */
+	struct fsinfo_timestamp_one	mtime;	/* Modification time */
+	struct fsinfo_timestamp_one	ctime;	/* Change time */
+	struct fsinfo_timestamp_one	btime;	/* Birth/creation time */
+};
+
+#define FSINFO_ATTR_TIMESTAMP_INFO__STRUCT struct fsinfo_timestamp_info
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_VOLUME_UUID).
+ */
+struct fsinfo_volume_uuid {
+	__u8	uuid[16];
+};
+
+#define FSINFO_ATTR_VOLUME_UUID__STRUCT struct fsinfo_volume_uuid
+
+#endif /* _UAPI_LINUX_FSINFO_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0ce01f86e5db..519317f3904c 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -51,6 +51,7 @@ COND_SYSCALL_COMPAT(io_pgetevents);
 COND_SYSCALL(io_uring_setup);
 COND_SYSCALL(io_uring_enter);
 COND_SYSCALL(io_uring_register);
+COND_SYSCALL(fsinfo);
 
 /* fs/xattr.c */
 
diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
index 65acdde5c117..9159ad1d7fc5 100644
--- a/samples/vfs/Makefile
+++ b/samples/vfs/Makefile
@@ -1,10 +1,15 @@
 # SPDX-License-Identifier: GPL-2.0-only
 # List of programs to build
+
 hostprogs := \
+	test-fsinfo \
 	test-fsmount \
 	test-statx
 
 always-y := $(hostprogs)
 
+HOSTCFLAGS_test-fsinfo.o += -I$(objtree)/usr/include
+HOSTLDLIBS_test-fsinfo += -static -lm
+
 HOSTCFLAGS_test-fsmount.o += -I$(objtree)/usr/include
 HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
new file mode 100644
index 000000000000..22fe3c47ff42
--- /dev/null
+++ b/samples/vfs/test-fsinfo.c
@@ -0,0 +1,607 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Test the fsinfo() system call
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#define _GNU_SOURCE
+#define _ATFILE_SOURCE
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <unistd.h>
+#include <ctype.h>
+#include <errno.h>
+#include <time.h>
+#include <math.h>
+#include <fcntl.h>
+#include <sys/syscall.h>
+#include <linux/fsinfo.h>
+#include <linux/socket.h>
+#include <sys/stat.h>
+#include <arpa/inet.h>
+
+#ifndef __NR_fsinfo
+#define __NR_fsinfo -1
+#endif
+
+static bool debug = 0;
+
+static __attribute__((unused))
+ssize_t fsinfo(int dfd, const char *filename, struct fsinfo_params *params,
+	       void *buffer, size_t buf_size)
+{
+	return syscall(__NR_fsinfo, dfd, filename, params, buffer, buf_size);
+}
+
+struct fsinfo_attribute {
+	unsigned int		attr_id;
+	enum fsinfo_value_type	type;
+	unsigned int		size;
+	const char		*name;
+	void (*dump)(void *reply, unsigned int size);
+};
+
+static const struct fsinfo_attribute fsinfo_attributes[];
+
+static void dump_hex(unsigned int *data, int from, int to)
+{
+	unsigned offset, print_offset = 1, col = 0;
+
+	from /= 4;
+	to = (to + 3) / 4;
+
+	for (offset = from; offset < to; offset++) {
+		if (print_offset) {
+			printf("%04x: ", offset * 8);
+			print_offset = 0;
+		}
+		printf("%08x", data[offset]);
+		col++;
+		if ((col & 3) == 0) {
+			printf("\n");
+			print_offset = 1;
+		} else {
+			printf(" ");
+		}
+	}
+
+	if (!print_offset)
+		printf("\n");
+}
+
+static void dump_attribute_info(void *reply, unsigned int size)
+{
+	struct fsinfo_attribute_info *attr_info = reply;
+	const struct fsinfo_attribute *attr;
+	char type[32], val_size[32];
+
+	switch (attr_info->type) {
+	case FSINFO_TYPE_VSTRUCT:	strcpy(type, "V-STRUCT");	break;
+	case FSINFO_TYPE_STRING:	strcpy(type, "STRING");		break;
+	case FSINFO_TYPE_OPAQUE:	strcpy(type, "OPAQUE");		break;
+	case FSINFO_TYPE_LIST:		strcpy(type, "LIST");		break;
+	default:
+		sprintf(type, "type-%x", attr_info->type);
+		break;
+	}
+
+	if (attr_info->flags & FSINFO_FLAGS_N)
+		strcat(type, " x N");
+	else if (attr_info->flags & FSINFO_FLAGS_NM)
+		strcat(type, " x NM");
+
+	for (attr = fsinfo_attributes; attr->name; attr++)
+		if (attr->attr_id == attr_info->attr_id)
+			break;
+
+	if (attr_info->size)
+		sprintf(val_size, "%u", attr_info->size);
+	else
+		strcpy(val_size, "-");
+
+	printf("%8x %-12s %08x %5s %s\n",
+	       attr_info->attr_id,
+	       type,
+	       attr_info->flags,
+	       val_size,
+	       attr->name ? attr->name : "");
+}
+
+static void dump_fsinfo_generic_statfs(void *reply, unsigned int size)
+{
+	struct fsinfo_statfs *f = reply;
+
+	printf("\n");
+	printf("\tblocks       : n=%llu fr=%llu av=%llu\n",
+	       (unsigned long long)f->f_blocks.lo,
+	       (unsigned long long)f->f_bfree.lo,
+	       (unsigned long long)f->f_bavail.lo);
+
+	printf("\tfiles        : n=%llu fr=%llu av=%llu\n",
+	       (unsigned long long)f->f_files.lo,
+	       (unsigned long long)f->f_ffree.lo,
+	       (unsigned long long)f->f_favail.lo);
+	printf("\tbsize        : %llu\n", f->f_bsize);
+	printf("\tfrsize       : %llu\n", f->f_frsize);
+}
+
+static void dump_fsinfo_generic_ids(void *reply, unsigned int size)
+{
+	struct fsinfo_ids *f = reply;
+
+	printf("\n");
+	printf("\tdev          : %02x:%02x\n", f->f_dev_major, f->f_dev_minor);
+	printf("\tfs           : type=%x name=%s\n", f->f_fstype, f->f_fs_name);
+	printf("\tfsid         : %llx\n", (unsigned long long)f->f_fsid);
+	printf("\tsbid         : %llx\n", (unsigned long long)f->f_sb_id);
+}
+
+static void dump_fsinfo_generic_limits(void *reply, unsigned int size)
+{
+	struct fsinfo_limits *f = reply;
+
+	printf("\n");
+	printf("\tmax file size: %llx%016llx\n",
+	       (unsigned long long)f->max_file_size.hi,
+	       (unsigned long long)f->max_file_size.lo);
+	printf("\tmax ino      : %llx%016llx\n",
+	       (unsigned long long)f->max_ino.hi,
+	       (unsigned long long)f->max_ino.lo);
+	printf("\tmax ids      : u=%llx g=%llx p=%llx\n",
+	       (unsigned long long)f->max_uid,
+	       (unsigned long long)f->max_gid,
+	       (unsigned long long)f->max_projid);
+	printf("\tmax dev      : maj=%x min=%x\n",
+	       f->max_dev_major, f->max_dev_minor);
+	printf("\tmax links    : %llx\n",
+	       (unsigned long long)f->max_hard_links);
+	printf("\tmax xattr    : n=%x b=%llx\n",
+	       f->max_xattr_name_len,
+	       (unsigned long long)f->max_xattr_body_len);
+	printf("\tmax len      : file=%x sym=%x\n",
+	       f->max_filename_len, f->max_symlink_len);
+}
+
+static void dump_fsinfo_generic_supports(void *reply, unsigned int size)
+{
+	struct fsinfo_supports *f = reply;
+
+	printf("\n");
+	printf("\tstx_attr     : %llx\n", (unsigned long long)f->stx_attributes);
+	printf("\tstx_mask     : %x\n", f->stx_mask);
+	printf("\tfs_ioc_*flags: get=%x set=%x clr=%x\n",
+	       f->fs_ioc_getflags, f->fs_ioc_setflags_set, f->fs_ioc_setflags_clear);
+	printf("\twin_fattrs   : %x\n", f->win_file_attrs);
+}
+
+static void print_time(struct fsinfo_timestamp_one *t, char stamp)
+{
+	printf("\t%ctime       : gran=%gs range=%llx-%llx\n",
+	       stamp,
+	       t->gran_mantissa * pow(10., t->gran_exponent),
+	       (long long)t->minimum,
+	       (long long)t->maximum);
+}
+
+static void dump_fsinfo_generic_timestamp_info(void *reply, unsigned int size)
+{
+	struct fsinfo_timestamp_info *f = reply;
+
+	printf("\n");
+	print_time(&f->atime, 'a');
+	print_time(&f->mtime, 'm');
+	print_time(&f->ctime, 'c');
+	print_time(&f->btime, 'b');
+}
+
+static void dump_fsinfo_generic_volume_uuid(void *reply, unsigned int size)
+{
+	struct fsinfo_volume_uuid *f = reply;
+
+	printf("%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x"
+	       "-%02x%02x%02x%02x%02x%02x\n",
+	       f->uuid[ 0], f->uuid[ 1],
+	       f->uuid[ 2], f->uuid[ 3],
+	       f->uuid[ 4], f->uuid[ 5],
+	       f->uuid[ 6], f->uuid[ 7],
+	       f->uuid[ 8], f->uuid[ 9],
+	       f->uuid[10], f->uuid[11],
+	       f->uuid[12], f->uuid[13],
+	       f->uuid[14], f->uuid[15]);
+}
+
+static void dump_string(void *reply, unsigned int size)
+{
+	char *s = reply, *p;
+
+	p = s;
+	if (size >= 4096) {
+		size = 4096;
+		p[4092] = '.';
+		p[4093] = '.';
+		p[4094] = '.';
+		p[4095] = 0;
+	} else {
+		p[size] = 0;
+	}
+
+	for (p = s; *p; p++) {
+		if (!isprint(*p)) {
+			printf("<non-printable>\n");
+			continue;
+		}
+	}
+
+	printf("%s\n", s);
+}
+
+#define dump_fsinfo_generic_volume_id		dump_string
+#define dump_fsinfo_generic_volume_name		dump_string
+
+/*
+ *
+ */
+#define __FSINFO(A, T, S, U, G, F) \
+	{ .attr_id = A, .type = T, .size = S, .name = #G, .dump = dump_##G }
+
+#define _FSINFO(A, T, S, U, G)	  __FSINFO(A, T, S, U, G, 0)
+#define _FSINFO_N(A, T, S, U, G)  __FSINFO(A, T, S, U, G, FSINFO_FLAGS_N)
+#define _FSINFO_NM(A, T, S, U, G) __FSINFO(A, T, S, U, G, FSINFO_FLAGS_NM)
+
+#define _FSINFO_VSTRUCT(A,S,G)	 _FSINFO    (A, FSINFO_TYPE_VSTRUCT, sizeof(S), 0, G)
+#define _FSINFO_VSTRUCT_N(A,S,G) _FSINFO_N  (A, FSINFO_TYPE_VSTRUCT, sizeof(S), 0, G)
+#define _FSINFO_VSTRUCT_NM(A,S,G) _FSINFO_NM(A, FSINFO_TYPE_VSTRUCT, sizeof(S), 0, G)
+
+#define FSINFO_VSTRUCT(A,G)	_FSINFO_VSTRUCT   (A, A##__STRUCT, G)
+#define FSINFO_VSTRUCT_N(A,G)	_FSINFO_VSTRUCT_N (A, A##__STRUCT, G)
+#define FSINFO_VSTRUCT_NM(A,G)	_FSINFO_VSTRUCT_NM(A, A##__STRUCT, G)
+#define FSINFO_STRING(A,G)	_FSINFO   (A, FSINFO_TYPE_STRING, 0, 0, G)
+#define FSINFO_STRING_N(A,G)	_FSINFO_N (A, FSINFO_TYPE_STRING, 0, 0, G)
+#define FSINFO_STRING_NM(A,G)	_FSINFO_NM(A, FSINFO_TYPE_STRING, 0, 0, G)
+#define FSINFO_OPAQUE(A,G)	_FSINFO   (A, FSINFO_TYPE_OPAQUE, 0, 0, G)
+#define FSINFO_LIST(A,G)	_FSINFO   (A, FSINFO_TYPE_LIST, 0, sizeof(A##__STRUCT), G)
+#define FSINFO_LIST_N(A,G)	_FSINFO_N (A, FSINFO_TYPE_LIST, 0, sizeof(A##__STRUCT), G)
+
+static const struct fsinfo_attribute fsinfo_attributes[] = {
+	FSINFO_VSTRUCT	(FSINFO_ATTR_STATFS,		fsinfo_generic_statfs),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_IDS,		fsinfo_generic_ids),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_LIMITS,		fsinfo_generic_limits),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_SUPPORTS,		fsinfo_generic_supports),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_TIMESTAMP_INFO,	fsinfo_generic_timestamp_info),
+	FSINFO_STRING	(FSINFO_ATTR_VOLUME_ID,		fsinfo_generic_volume_id),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_VOLUME_UUID,	fsinfo_generic_volume_uuid),
+	FSINFO_STRING	(FSINFO_ATTR_VOLUME_NAME,	fsinfo_generic_volume_name),
+	{}
+};
+
+static void dump_value(unsigned int attr_id,
+		       const struct fsinfo_attribute *attr,
+		       const struct fsinfo_attribute_info *attr_info,
+		       void *reply, unsigned int size)
+{
+	if (!attr || !attr->dump) {
+		printf("<no dumper>\n");
+		return;
+	}
+
+	if (attr->type == FSINFO_TYPE_VSTRUCT && size < attr->size) {
+		printf("<short data %u/%u>\n", size, attr->size);
+		return;
+	}
+
+	attr->dump(reply, size);
+}
+
+static void dump_list(unsigned int attr_id,
+		      const struct fsinfo_attribute *attr,
+		      const struct fsinfo_attribute_info *attr_info,
+		      void *reply, unsigned int size)
+{
+	size_t elem_size = attr_info->size;
+	unsigned int ix = 0;
+
+	printf("\n");
+	if (!attr || !attr->dump) {
+		printf("<no dumper>\n");
+		return;
+	}
+
+	if (attr->type == FSINFO_TYPE_VSTRUCT && size < attr->size) {
+		printf("<short data %u/%u>\n", size, attr->size);
+		return;
+	}
+
+	while (size >= elem_size) {
+		printf("\t[%02x] ", ix);
+		attr->dump(reply, size);
+		reply += elem_size;
+		size -= elem_size;
+		ix++;
+	}
+}
+
+/*
+ * Call fsinfo, expanding the buffer as necessary.
+ */
+static ssize_t get_fsinfo(const char *file, const char *name,
+			  struct fsinfo_params *params, void **_r)
+{
+	ssize_t ret;
+	size_t buf_size = 4096;
+	void *r;
+
+	for (;;) {
+		r = malloc(buf_size);
+		if (!r) {
+			perror("malloc");
+			exit(1);
+		}
+		memset(r, 0xbd, buf_size);
+
+		errno = 0;
+		ret = fsinfo(AT_FDCWD, file, params, r, buf_size);
+		if (ret == -1) {
+			free(r);
+			*_r = NULL;
+			return ret;
+		}
+
+		if (ret <= buf_size)
+			break;
+		buf_size = (ret + 4096 - 1) & ~(4096 - 1);
+	}
+
+	if (debug) {
+		if (ret == -1)
+			printf("fsinfo(%s,%s,%u,%u) = %m\n",
+			       file, name, params->Nth, params->Mth);
+		else
+			printf("fsinfo(%s,%s,%u,%u) = %zd\n",
+			       file, name, params->Nth, params->Mth, ret);
+	}
+
+	*_r = r;
+	return ret;
+}
+
+/*
+ * Try one subinstance of an attribute.
+ */
+static int try_one(const char *file, struct fsinfo_params *params,
+		   const struct fsinfo_attribute_info *attr_info, bool raw)
+{
+	const struct fsinfo_attribute *attr;
+	const char *name;
+	size_t size = 4096;
+	char namebuf[32];
+	void *r;
+
+	for (attr = fsinfo_attributes; attr->name; attr++) {
+		if (attr->attr_id == params->request) {
+			name = attr->name;
+			if (strncmp(name, "fsinfo_generic_", 15) == 0)
+				name += 15;
+			goto found;
+		}
+	}
+
+	sprintf(namebuf, "<unknown-%x>", params->request);
+	name = namebuf;
+	attr = NULL;
+
+found:
+	size = get_fsinfo(file, name, params, &r);
+
+	if (size == -1) {
+		if (errno == ENODATA) {
+			if (!(attr_info->flags & (FSINFO_FLAGS_N | FSINFO_FLAGS_NM)) &&
+			    params->Nth == 0 && params->Mth == 0) {
+				fprintf(stderr,
+					"Unexpected ENODATA (0x%x{%u}{%u})\n",
+					params->request, params->Nth, params->Mth);
+				exit(1);
+			}
+			free(r);
+			return (params->Mth == 0) ? 2 : 1;
+		}
+		if (errno == EOPNOTSUPP) {
+			if (params->Nth > 0 || params->Mth > 0) {
+				fprintf(stderr,
+					"Should return -ENODATA (0x%x{%u}{%u})\n",
+					params->request, params->Nth, params->Mth);
+				exit(1);
+			}
+			//printf("\e[33m%s\e[m: <not supported>\n",
+			//       fsinfo_attr_names[attr]);
+			free(r);
+			return 2;
+		}
+		perror(file);
+		exit(1);
+	}
+
+	if (raw) {
+		if (size > 4096)
+			size = 4096;
+		dump_hex(r, 0, size);
+		free(r);
+		return 0;
+	}
+
+	switch (attr_info->flags & (FSINFO_FLAGS_N | FSINFO_FLAGS_NM)) {
+	case 0:
+		printf("\e[33m%s\e[m: ", name);
+		break;
+	case FSINFO_FLAGS_N:
+		printf("\e[33m%s{%u}\e[m: ", name, params->Nth);
+		break;
+	case FSINFO_FLAGS_NM:
+		printf("\e[33m%s{%u,%u}\e[m: ", name, params->Nth, params->Mth);
+		break;
+	}
+
+	switch (attr_info->type) {
+	case FSINFO_TYPE_VSTRUCT:
+	case FSINFO_TYPE_STRING:
+		dump_value(params->request, attr, attr_info, r, size);
+		free(r);
+		return 0;
+
+	case FSINFO_TYPE_LIST:
+		dump_list(params->request, attr, attr_info, r, size);
+		free(r);
+		return 0;
+
+	case FSINFO_TYPE_OPAQUE:
+		free(r);
+		return 0;
+
+	default:
+		fprintf(stderr, "Fishy about %u 0x%x,%x,%x\n",
+			params->request, attr_info->type, attr_info->flags, attr_info->size);
+		exit(1);
+	}
+}
+
+static int cmp_u32(const void *a, const void *b)
+{
+	return *(const int *)a - *(const int *)b;
+}
+
+/*
+ *
+ */
+int main(int argc, char **argv)
+{
+	struct fsinfo_attribute_info attr_info;
+	struct fsinfo_params params = {
+		.at_flags	= AT_SYMLINK_NOFOLLOW,
+		.flags		= FSINFO_FLAGS_QUERY_PATH,
+	};
+	unsigned int *attrs, ret, nr, i;
+	bool meta = false;
+	int raw = 0, opt, Nth, Mth;
+
+	while ((opt = getopt(argc, argv, "adlmr"))) {
+		switch (opt) {
+		case 'a':
+			params.at_flags |= AT_NO_AUTOMOUNT;
+			continue;
+		case 'd':
+			debug = true;
+			continue;
+		case 'l':
+			params.at_flags &= ~AT_SYMLINK_NOFOLLOW;
+			continue;
+		case 'm':
+			meta = true;
+			continue;
+		case 'r':
+			raw = 1;
+			continue;
+		}
+		break;
+	}
+
+	argc -= optind;
+	argv += optind;
+
+	if (argc != 1) {
+		printf("Format: test-fsinfo [-alr] <file>\n");
+		exit(2);
+	}
+
+	/* Retrieve a list of supported attribute IDs */
+	params.request = FSINFO_ATTR_FSINFO_ATTRIBUTES;
+	params.Nth = 0;
+	params.Mth = 0;
+	ret = get_fsinfo(argv[0], "attributes", &params, (void **)&attrs);
+	if (ret == -1) {
+		fprintf(stderr, "Unable to get attribute list: %m\n");
+		exit(1);
+	}
+
+	if (ret % sizeof(attrs[0])) {
+		fprintf(stderr, "Bad length of attribute list (0x%x)\n", ret);
+		exit(2);
+	}
+
+	nr = ret / sizeof(attrs[0]);
+	qsort(attrs, nr, sizeof(attrs[0]), cmp_u32);
+
+	if (meta) {
+		printf("ATTR ID  TYPE         FLAGS    SIZE  NAME\n");
+		printf("======== ============ ======== ===== =========\n");
+		for (i = 0; i < nr; i++) {
+			params.request = FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO;
+			params.Nth = attrs[i];
+			params.Mth = 0;
+			ret = fsinfo(AT_FDCWD, argv[0], &params, &attr_info, sizeof(attr_info));
+			if (ret == -1) {
+				fprintf(stderr, "Can't get info for attribute %x: %m\n", attrs[i]);
+				exit(1);
+			}
+
+			dump_attribute_info(&attr_info, ret);
+		}
+		exit(0);
+	}
+
+	for (i = 0; i < nr; i++) {
+		params.request = FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO;
+		params.Nth = attrs[i];
+		params.Mth = 0;
+		ret = fsinfo(AT_FDCWD, argv[0], &params, &attr_info, sizeof(attr_info));
+		if (ret == -1) {
+			fprintf(stderr, "Can't get info for attribute %x: %m\n", attrs[i]);
+			exit(1);
+		}
+
+		if (attrs[i] == FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO ||
+		    attrs[i] == FSINFO_ATTR_FSINFO_ATTRIBUTES)
+			continue;
+
+		if (attrs[i] != attr_info.attr_id) {
+			fprintf(stderr, "ID for %03x returned %03x\n",
+				attrs[i], attr_info.attr_id);
+			break;
+		}
+		Nth = 0;
+		do {
+			Mth = 0;
+			do {
+				params.request = attrs[i];
+				params.Nth = Nth;
+				params.Mth = Mth;
+
+				switch (try_one(argv[0], &params, &attr_info, raw)) {
+				case 0:
+					continue;
+				case 1:
+					goto done_M;
+				case 2:
+					goto done_N;
+				}
+			} while (++Mth < 100);
+
+		done_M:
+			if (Mth >= 100) {
+				fprintf(stderr, "Fishy: Mth %x[%u][%u]\n", attrs[i], Nth, Mth);
+				break;
+			}
+
+		} while (++Nth < 100);
+
+	done_N:
+		if (Nth >= 100) {
+			fprintf(stderr, "Fishy: Nth %x[%u]\n", attrs[i], Nth);
+			break;
+		}
+	}
+
+	return 0;
+}



^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 08/17] fsinfo: Provide a bitmap of supported features [ver #17]
  2020-02-21 18:01 [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] David Howells
                   ` (6 preceding siblings ...)
  2020-02-21 18:02 ` [PATCH 07/17] fsinfo: Add fsinfo() syscall to query filesystem information " David Howells
@ 2020-02-21 18:02 ` " David Howells
  2020-02-21 18:03 ` [PATCH 09/17] fsinfo: Allow fsinfo() to look up a mount object by ID " David Howells
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 117+ messages in thread
From: David Howells @ 2020-02-21 18:02 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong,
	linux-api, linux-fsdevel, linux-kernel

Provide a bitmap of features that a filesystem may provide for the path
being queried.  Features include such things as:

 (1) The general class of filesystem, such as kernel-interface,
     block-based, flash-based, network-based.

 (2) Supported inode features, such as which timestamps are supported,
     whether simple numeric user, group or project IDs are supported and
     whether user identification is actually more complex behind the
     scenes.

 (3) Supported volume features, such as it having a UUID, a name or a
     filesystem ID.

 (4) Supported filesystem features, such as what types of file are
     supported, whether sparse files, extended attributes and quotas are
     supported.

 (5) Supported interface features, such as whether locking and leases are
     supported, what open flags are honoured and how i_version is managed.

For some filesystems, this may be an immutable set and can just be memcpy'd
into the reply buffer.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/fsinfo.c                 |   30 +++++++++++++++++++
 include/linux/fsinfo.h      |   38 ++++++++++++++++++++++++
 include/uapi/linux/fsinfo.h |   67 ++++++++++++++++++++++++++++++++++++++++++
 samples/vfs/test-fsinfo.c   |   69 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 204 insertions(+)

diff --git a/fs/fsinfo.c b/fs/fsinfo.c
index 5d3ba3c3a7ad..f423a4c5afd9 100644
--- a/fs/fsinfo.c
+++ b/fs/fsinfo.c
@@ -121,6 +121,35 @@ int fsinfo_generic_supports(struct path *path, struct fsinfo_context *ctx)
 }
 EXPORT_SYMBOL(fsinfo_generic_supports);
 
+int fsinfo_generic_features(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_features *p = ctx->buffer;
+	struct super_block *sb = path->dentry->d_sb;
+
+	fsinfo_init_features(p);
+	if (sb->s_mtd)
+		fsinfo_set_feature(p, FSINFO_FEAT_IS_FLASH_FS);
+	else if (sb->s_bdev)
+		fsinfo_set_feature(p, FSINFO_FEAT_IS_BLOCK_FS);
+
+	if (sb->s_quota_types & QTYPE_MASK_USR)
+		fsinfo_set_feature(p, FSINFO_FEAT_USER_QUOTAS);
+	if (sb->s_quota_types & QTYPE_MASK_GRP)
+		fsinfo_set_feature(p, FSINFO_FEAT_GROUP_QUOTAS);
+	if (sb->s_quota_types & QTYPE_MASK_PRJ)
+		fsinfo_set_feature(p, FSINFO_FEAT_PROJECT_QUOTAS);
+	if (sb->s_d_op && sb->s_d_op->d_automount)
+		fsinfo_set_feature(p, FSINFO_FEAT_AUTOMOUNTS);
+	if (sb->s_id[0])
+		fsinfo_set_feature(p, FSINFO_FEAT_VOLUME_ID);
+
+	fsinfo_set_feature(p, FSINFO_FEAT_HAS_ATIME);
+	fsinfo_set_feature(p, FSINFO_FEAT_HAS_CTIME);
+	fsinfo_set_feature(p, FSINFO_FEAT_HAS_MTIME);
+	return sizeof(*p);
+}
+EXPORT_SYMBOL(fsinfo_generic_features);
+
 static const struct fsinfo_timestamp_info fsinfo_default_timestamp_info = {
 	.atime = {
 		.minimum	= S64_MIN,
@@ -196,6 +225,7 @@ static const struct fsinfo_attribute fsinfo_common_attributes[] = {
 	FSINFO_VSTRUCT	(FSINFO_ATTR_TIMESTAMP_INFO,	fsinfo_generic_timestamp_info),
 	FSINFO_STRING	(FSINFO_ATTR_VOLUME_ID,		fsinfo_generic_volume_id),
 	FSINFO_VSTRUCT	(FSINFO_ATTR_VOLUME_UUID,	fsinfo_generic_volume_uuid),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_FEATURES,		fsinfo_generic_features),
 
 	FSINFO_LIST	(FSINFO_ATTR_FSINFO_ATTRIBUTES,	(void *)123UL),
 	FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, (void *)123UL),
diff --git a/include/linux/fsinfo.h b/include/linux/fsinfo.h
index 943fbd6640f9..1b141e8e88e0 100644
--- a/include/linux/fsinfo.h
+++ b/include/linux/fsinfo.h
@@ -66,6 +66,44 @@ extern int fsinfo_generic_supports(struct path *, struct fsinfo_context *);
 extern int fsinfo_generic_limits(struct path *, struct fsinfo_context *);
 extern int fsinfo_get_attribute(struct path *, struct fsinfo_context *,
 				const struct fsinfo_attribute *);
+extern int fsinfo_generic_features(struct path *, struct fsinfo_context *);
+
+static inline void fsinfo_init_features(struct fsinfo_features *p)
+{
+	p->nr_features = FSINFO_FEAT__NR;
+}
+
+static inline void fsinfo_set_feature(struct fsinfo_features *p,
+				      enum fsinfo_feature feature)
+{
+	p->features[feature / 8] |= 1 << (feature % 8);
+}
+
+static inline void fsinfo_clear_feature(struct fsinfo_features *p,
+					enum fsinfo_feature feature)
+{
+	p->features[feature / 8] &= ~(1 << (feature % 8));
+}
+
+/**
+ * fsinfo_set_unix_features - Set standard UNIX features.
+ * @f: The features mask to alter
+ */
+static inline void fsinfo_set_unix_features(struct fsinfo_features *p)
+{
+	fsinfo_set_feature(p, FSINFO_FEAT_UIDS);
+	fsinfo_set_feature(p, FSINFO_FEAT_GIDS);
+	fsinfo_set_feature(p, FSINFO_FEAT_DIRECTORIES);
+	fsinfo_set_feature(p, FSINFO_FEAT_SYMLINKS);
+	fsinfo_set_feature(p, FSINFO_FEAT_HARD_LINKS);
+	fsinfo_set_feature(p, FSINFO_FEAT_DEVICE_FILES);
+	fsinfo_set_feature(p, FSINFO_FEAT_UNIX_SPECIALS);
+	fsinfo_set_feature(p, FSINFO_FEAT_SPARSE);
+	fsinfo_set_feature(p, FSINFO_FEAT_HAS_ATIME);
+	fsinfo_set_feature(p, FSINFO_FEAT_HAS_CTIME);
+	fsinfo_set_feature(p, FSINFO_FEAT_HAS_MTIME);
+	fsinfo_set_feature(p, FSINFO_FEAT_HAS_INODE_NUMBERS);
+}
 
 #endif /* CONFIG_FSINFO */
 
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index 6eb02de8a631..d7f24da36f0e 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -22,6 +22,7 @@
 #define FSINFO_ATTR_VOLUME_ID		0x05	/* Volume ID (string) */
 #define FSINFO_ATTR_VOLUME_UUID		0x06	/* Volume UUID (LE uuid) */
 #define FSINFO_ATTR_VOLUME_NAME		0x07	/* Volume name (string) */
+#define FSINFO_ATTR_FEATURES		0x08	/* Filesystem features (bits) */
 
 #define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO 0x100	/* Information about attr N (for path) */
 #define FSINFO_ATTR_FSINFO_ATTRIBUTES	0x101	/* List of supported attrs (for path) */
@@ -155,6 +156,72 @@ struct fsinfo_supports {
 
 #define FSINFO_ATTR_SUPPORTS__STRUCT struct fsinfo_supports
 
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_FEATURES).
+ *
+ * Bitmask indicating filesystem features where renderable as single bits.
+ */
+enum fsinfo_feature {
+	FSINFO_FEAT_IS_KERNEL_FS	= 0,	/* fs is kernel-special filesystem */
+	FSINFO_FEAT_IS_BLOCK_FS		= 1,	/* fs is block-based filesystem */
+	FSINFO_FEAT_IS_FLASH_FS		= 2,	/* fs is flash filesystem */
+	FSINFO_FEAT_IS_NETWORK_FS	= 3,	/* fs is network filesystem */
+	FSINFO_FEAT_IS_AUTOMOUNTER_FS	= 4,	/* fs is automounter special filesystem */
+	FSINFO_FEAT_IS_MEMORY_FS	= 5,	/* fs is memory-based filesystem */
+	FSINFO_FEAT_AUTOMOUNTS		= 6,	/* fs supports automounts */
+	FSINFO_FEAT_ADV_LOCKS		= 7,	/* fs supports advisory file locking */
+	FSINFO_FEAT_MAND_LOCKS		= 8,	/* fs supports mandatory file locking */
+	FSINFO_FEAT_LEASES		= 9,	/* fs supports file leases */
+	FSINFO_FEAT_UIDS		= 10,	/* fs supports numeric uids */
+	FSINFO_FEAT_GIDS		= 11,	/* fs supports numeric gids */
+	FSINFO_FEAT_PROJIDS		= 12,	/* fs supports numeric project ids */
+	FSINFO_FEAT_STRING_USER_IDS	= 13,	/* fs supports string user identifiers */
+	FSINFO_FEAT_GUID_USER_IDS	= 14,	/* fs supports GUID user identifiers */
+	FSINFO_FEAT_WINDOWS_ATTRS	= 15,	/* fs has windows attributes */
+	FSINFO_FEAT_USER_QUOTAS		= 16,	/* fs has per-user quotas */
+	FSINFO_FEAT_GROUP_QUOTAS	= 17,	/* fs has per-group quotas */
+	FSINFO_FEAT_PROJECT_QUOTAS	= 18,	/* fs has per-project quotas */
+	FSINFO_FEAT_XATTRS		= 19,	/* fs has xattrs */
+	FSINFO_FEAT_JOURNAL		= 20,	/* fs has a journal */
+	FSINFO_FEAT_DATA_IS_JOURNALLED	= 21,	/* fs is using data journalling */
+	FSINFO_FEAT_O_SYNC		= 22,	/* fs supports O_SYNC */
+	FSINFO_FEAT_O_DIRECT		= 23,	/* fs supports O_DIRECT */
+	FSINFO_FEAT_VOLUME_ID		= 24,	/* fs has a volume ID */
+	FSINFO_FEAT_VOLUME_UUID		= 25,	/* fs has a volume UUID */
+	FSINFO_FEAT_VOLUME_NAME		= 26,	/* fs has a volume name */
+	FSINFO_FEAT_VOLUME_FSID		= 27,	/* fs has a volume FSID */
+	FSINFO_FEAT_IVER_ALL_CHANGE	= 28,	/* i_version represents data + meta changes */
+	FSINFO_FEAT_IVER_DATA_CHANGE	= 29,	/* i_version represents data changes only */
+	FSINFO_FEAT_IVER_MONO_INCR	= 30,	/* i_version incremented monotonically */
+	FSINFO_FEAT_DIRECTORIES		= 31,	/* fs supports (sub)directories */
+	FSINFO_FEAT_SYMLINKS		= 32,	/* fs supports symlinks */
+	FSINFO_FEAT_HARD_LINKS		= 33,	/* fs supports hard links */
+	FSINFO_FEAT_HARD_LINKS_1DIR	= 34,	/* fs supports hard links in same dir only */
+	FSINFO_FEAT_DEVICE_FILES	= 35,	/* fs supports bdev, cdev */
+	FSINFO_FEAT_UNIX_SPECIALS	= 36,	/* fs supports pipe, fifo, socket */
+	FSINFO_FEAT_RESOURCE_FORKS	= 37,	/* fs supports resource forks/streams */
+	FSINFO_FEAT_NAME_CASE_INDEP	= 38,	/* Filename case independence is mandatory */
+	FSINFO_FEAT_NAME_NON_UTF8	= 39,	/* fs has non-utf8 names */
+	FSINFO_FEAT_NAME_HAS_CODEPAGE	= 40,	/* fs has a filename codepage */
+	FSINFO_FEAT_SPARSE		= 41,	/* fs supports sparse files */
+	FSINFO_FEAT_NOT_PERSISTENT	= 42,	/* fs is not persistent */
+	FSINFO_FEAT_NO_UNIX_MODE	= 43,	/* fs does not support unix mode bits */
+	FSINFO_FEAT_HAS_ATIME		= 44,	/* fs supports access time */
+	FSINFO_FEAT_HAS_BTIME		= 45,	/* fs supports birth/creation time */
+	FSINFO_FEAT_HAS_CTIME		= 46,	/* fs supports change time */
+	FSINFO_FEAT_HAS_MTIME		= 47,	/* fs supports modification time */
+	FSINFO_FEAT_HAS_ACL		= 48,	/* fs supports ACLs of some sort */
+	FSINFO_FEAT_HAS_INODE_NUMBERS	= 49,	/* fs has inode numbers */
+	FSINFO_FEAT__NR
+};
+
+struct fsinfo_features {
+	__u32	nr_features;	/* Number of supported features (FSINFO_FEAT__NR) */
+	__u8	features[(FSINFO_FEAT__NR + 7) / 8];
+};
+
+#define FSINFO_ATTR_FEATURES__STRUCT struct fsinfo_features
+
 struct fsinfo_timestamp_one {
 	__s64	minimum;	/* Minimum timestamp value in seconds */
 	__s64	maximum;	/* Maximum timestamp value in seconds */
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
index 22fe3c47ff42..7f49c2125ed3 100644
--- a/samples/vfs/test-fsinfo.c
+++ b/samples/vfs/test-fsinfo.c
@@ -178,6 +178,74 @@ static void dump_fsinfo_generic_supports(void *reply, unsigned int size)
 	printf("\twin_fattrs   : %x\n", f->win_file_attrs);
 }
 
+#define FSINFO_FEATURE_NAME(C) [FSINFO_FEAT_##C] = #C
+static const char *fsinfo_feature_names[FSINFO_FEAT__NR] = {
+	FSINFO_FEATURE_NAME(IS_KERNEL_FS),
+	FSINFO_FEATURE_NAME(IS_BLOCK_FS),
+	FSINFO_FEATURE_NAME(IS_FLASH_FS),
+	FSINFO_FEATURE_NAME(IS_NETWORK_FS),
+	FSINFO_FEATURE_NAME(IS_AUTOMOUNTER_FS),
+	FSINFO_FEATURE_NAME(IS_MEMORY_FS),
+	FSINFO_FEATURE_NAME(AUTOMOUNTS),
+	FSINFO_FEATURE_NAME(ADV_LOCKS),
+	FSINFO_FEATURE_NAME(MAND_LOCKS),
+	FSINFO_FEATURE_NAME(LEASES),
+	FSINFO_FEATURE_NAME(UIDS),
+	FSINFO_FEATURE_NAME(GIDS),
+	FSINFO_FEATURE_NAME(PROJIDS),
+	FSINFO_FEATURE_NAME(STRING_USER_IDS),
+	FSINFO_FEATURE_NAME(GUID_USER_IDS),
+	FSINFO_FEATURE_NAME(WINDOWS_ATTRS),
+	FSINFO_FEATURE_NAME(USER_QUOTAS),
+	FSINFO_FEATURE_NAME(GROUP_QUOTAS),
+	FSINFO_FEATURE_NAME(PROJECT_QUOTAS),
+	FSINFO_FEATURE_NAME(XATTRS),
+	FSINFO_FEATURE_NAME(JOURNAL),
+	FSINFO_FEATURE_NAME(DATA_IS_JOURNALLED),
+	FSINFO_FEATURE_NAME(O_SYNC),
+	FSINFO_FEATURE_NAME(O_DIRECT),
+	FSINFO_FEATURE_NAME(VOLUME_ID),
+	FSINFO_FEATURE_NAME(VOLUME_UUID),
+	FSINFO_FEATURE_NAME(VOLUME_NAME),
+	FSINFO_FEATURE_NAME(VOLUME_FSID),
+	FSINFO_FEATURE_NAME(IVER_ALL_CHANGE),
+	FSINFO_FEATURE_NAME(IVER_DATA_CHANGE),
+	FSINFO_FEATURE_NAME(IVER_MONO_INCR),
+	FSINFO_FEATURE_NAME(DIRECTORIES),
+	FSINFO_FEATURE_NAME(SYMLINKS),
+	FSINFO_FEATURE_NAME(HARD_LINKS),
+	FSINFO_FEATURE_NAME(HARD_LINKS_1DIR),
+	FSINFO_FEATURE_NAME(DEVICE_FILES),
+	FSINFO_FEATURE_NAME(UNIX_SPECIALS),
+	FSINFO_FEATURE_NAME(RESOURCE_FORKS),
+	FSINFO_FEATURE_NAME(NAME_CASE_INDEP),
+	FSINFO_FEATURE_NAME(NAME_NON_UTF8),
+	FSINFO_FEATURE_NAME(NAME_HAS_CODEPAGE),
+	FSINFO_FEATURE_NAME(SPARSE),
+	FSINFO_FEATURE_NAME(NOT_PERSISTENT),
+	FSINFO_FEATURE_NAME(NO_UNIX_MODE),
+	FSINFO_FEATURE_NAME(HAS_ATIME),
+	FSINFO_FEATURE_NAME(HAS_BTIME),
+	FSINFO_FEATURE_NAME(HAS_CTIME),
+	FSINFO_FEATURE_NAME(HAS_MTIME),
+	FSINFO_FEATURE_NAME(HAS_ACL),
+	FSINFO_FEATURE_NAME(HAS_INODE_NUMBERS),
+};
+
+static void dump_fsinfo_generic_features(void *reply, unsigned int size)
+{
+	struct fsinfo_features *f = reply;
+	int i;
+
+	printf("\n\t");
+	for (i = 0; i < sizeof(f->features); i++)
+		printf("%02x", f->features[i]);
+	printf(" (nr=%u)\n", f->nr_features);
+	for (i = 0; i < FSINFO_FEAT__NR; i++)
+		if (f->features[i / 8] & (1 << (i % 8)))
+			printf("\t- %s\n", fsinfo_feature_names[i]);
+}
+
 static void print_time(struct fsinfo_timestamp_one *t, char stamp)
 {
 	printf("\t%ctime       : gran=%gs range=%llx-%llx\n",
@@ -271,6 +339,7 @@ static const struct fsinfo_attribute fsinfo_attributes[] = {
 	FSINFO_VSTRUCT	(FSINFO_ATTR_IDS,		fsinfo_generic_ids),
 	FSINFO_VSTRUCT	(FSINFO_ATTR_LIMITS,		fsinfo_generic_limits),
 	FSINFO_VSTRUCT	(FSINFO_ATTR_SUPPORTS,		fsinfo_generic_supports),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_FEATURES,		fsinfo_generic_features),
 	FSINFO_VSTRUCT	(FSINFO_ATTR_TIMESTAMP_INFO,	fsinfo_generic_timestamp_info),
 	FSINFO_STRING	(FSINFO_ATTR_VOLUME_ID,		fsinfo_generic_volume_id),
 	FSINFO_VSTRUCT	(FSINFO_ATTR_VOLUME_UUID,	fsinfo_generic_volume_uuid),



^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 09/17] fsinfo: Allow fsinfo() to look up a mount object by ID [ver #17]
  2020-02-21 18:01 [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] David Howells
                   ` (7 preceding siblings ...)
  2020-02-21 18:02 ` [PATCH 08/17] fsinfo: Provide a bitmap of supported features " David Howells
@ 2020-02-21 18:03 ` " David Howells
  2020-02-21 18:03 ` [PATCH 10/17] fsinfo: Allow mount information to be queried " David Howells
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 117+ messages in thread
From: David Howells @ 2020-02-21 18:03 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong,
	linux-api, linux-fsdevel, linux-kernel

Allow the fsinfo() syscall to look up a mount object by ID rather than by
pathname.  This is necessary as there can be multiple mounts stacked up at
the same pathname and there's no way to look through them otherwise.

This is done by passing FSINFO_FLAGS_QUERY_MOUNT to fsinfo() in the
parameters and then passing the mount ID as a string to fsinfo() in place
of the filename:

	struct fsinfo_params params = {
		.flags	 = FSINFO_FLAGS_QUERY_MOUNT,
		.request = FSINFO_ATTR_IDS,
	};

	ret = fsinfo(AT_FDCWD, "21", &params, buffer, sizeof(buffer));

The caller is only permitted to query a mount object if the root directory
of that mount connects directly to the current chroot if dfd == AT_FDCWD[*]
or the directory specified by dfd otherwise.  Note that this is not
available to the pathwalk of any other syscall.

[*] This needs to be something other than AT_FDCWD, perhaps AT_FDROOT.

[!] This probably needs an LSM hook.

[!] This might want to check the permissions on all the intervening dirs -
    but it would have to do that under RCU conditions.

[!] This might want to check a CAP_* flag.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/fsinfo.c                 |   53 +++++++++++++++++++
 fs/internal.h               |    2 +
 fs/namespace.c              |  117 ++++++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/fsinfo.h |    1 
 samples/vfs/test-fsinfo.c   |   11 +++-
 5 files changed, 179 insertions(+), 5 deletions(-)

diff --git a/fs/fsinfo.c b/fs/fsinfo.c
index f423a4c5afd9..9712d340dd7d 100644
--- a/fs/fsinfo.c
+++ b/fs/fsinfo.c
@@ -496,6 +496,56 @@ static int vfs_fsinfo_fd(unsigned int fd, struct fsinfo_context *ctx)
 	return ret;
 }
 
+/*
+ * Look up the root of a mount object.  This allows access to mount objects
+ * (and their attached superblocks) that can't be retrieved by path because
+ * they're entirely covered.
+ *
+ * We only permit access to a mount that has a direct path between either the
+ * dentry pointed to by dfd or to our chroot (if dfd is AT_FDCWD).
+ */
+static int vfs_fsinfo_mount(int dfd, const char __user *filename,
+			    struct fsinfo_context *ctx)
+{
+	struct path path;
+	struct fd f = {};
+	char *name;
+	long mnt_id;
+	int ret;
+
+	if (!filename)
+		return -EINVAL;
+
+	name = strndup_user(filename, 32);
+	if (IS_ERR(name))
+		return PTR_ERR(name);
+	ret = kstrtoul(name, 0, &mnt_id);
+	if (ret < 0)
+		goto out_name;
+	if (mnt_id > INT_MAX)
+		goto out_name;
+
+	if (dfd != AT_FDCWD) {
+		ret = -EBADF;
+		f = fdget_raw(dfd);
+		if (!f.file)
+			goto out_name;
+	}
+
+	ret = lookup_mount_object(f.file ? &f.file->f_path : NULL,
+				  mnt_id, &path);
+	if (ret < 0)
+		goto out_fd;
+
+	ret = vfs_fsinfo(&path, ctx);
+	path_put(&path);
+out_fd:
+	fdput(f);
+out_name:
+	kfree(name);
+	return ret;
+}
+
 /**
  * sys_fsinfo - System call to get filesystem information
  * @dfd: Base directory to pathwalk from or fd referring to filesystem.
@@ -565,6 +615,9 @@ SYSCALL_DEFINE5(fsinfo,
 			return -EINVAL;
 		ret = vfs_fsinfo_fd(dfd, &ctx);
 		break;
+	case FSINFO_FLAGS_QUERY_MOUNT:
+		ret = vfs_fsinfo_mount(dfd, pathname, &ctx);
+		break;
 	default:
 		return -EINVAL;
 	}
diff --git a/fs/internal.h b/fs/internal.h
index f3f280b952a3..2ccd2b2eae88 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -91,6 +91,8 @@ extern int __mnt_want_write_file(struct file *);
 extern void __mnt_drop_write_file(struct file *);
 
 extern void dissolve_on_fput(struct vfsmount *);
+extern int lookup_mount_object(struct path *, int, struct path *);
+
 /*
  * fs_struct.c
  */
diff --git a/fs/namespace.c b/fs/namespace.c
index 668f797ae3bd..696fcc5010ca 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -63,7 +63,7 @@ static int __init set_mphash_entries(char *str)
 __setup("mphash_entries=", set_mphash_entries);
 
 static u64 event;
-static DEFINE_IDA(mnt_id_ida);
+static DEFINE_IDR(mnt_id_ida);
 static DEFINE_IDA(mnt_group_ida);
 
 static struct hlist_head *mount_hashtable __read_mostly;
@@ -104,17 +104,27 @@ static inline struct hlist_head *mp_hash(struct dentry *dentry)
 
 static int mnt_alloc_id(struct mount *mnt)
 {
-	int res = ida_alloc(&mnt_id_ida, GFP_KERNEL);
+	int res;
 
+	/* Allocate an ID, but don't set the pointer back to the mount until
+	 * later, as once we do that, we have to follow RCU protocols to get
+	 * rid of the mount struct.
+	 */
+	res = idr_alloc(&mnt_id_ida, NULL, 0, INT_MAX, GFP_KERNEL);
 	if (res < 0)
 		return res;
 	mnt->mnt_id = res;
 	return 0;
 }
 
+static void mnt_publish_id(struct mount *mnt)
+{
+	idr_replace(&mnt_id_ida, mnt, mnt->mnt_id);
+}
+
 static void mnt_free_id(struct mount *mnt)
 {
-	ida_free(&mnt_id_ida, mnt->mnt_id);
+	idr_remove(&mnt_id_ida, mnt->mnt_id);
 }
 
 /*
@@ -958,6 +968,7 @@ struct vfsmount *vfs_create_mount(struct fs_context *fc)
 	lock_mount_hash();
 	list_add_tail(&mnt->mnt_instance, &mnt->mnt.mnt_sb->s_mounts);
 	unlock_mount_hash();
+	mnt_publish_id(mnt);
 	return &mnt->mnt;
 }
 EXPORT_SYMBOL(vfs_create_mount);
@@ -1051,6 +1062,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 	lock_mount_hash();
 	list_add_tail(&mnt->mnt_instance, &sb->s_mounts);
 	unlock_mount_hash();
+	mnt_publish_id(mnt);
 
 	if ((flag & CL_SLAVE) ||
 	    ((flag & CL_SHARED_TO_SLAVE) && IS_MNT_SHARED(old))) {
@@ -3997,3 +4009,102 @@ const struct proc_ns_operations mntns_operations = {
 	.install	= mntns_install,
 	.owner		= mntns_owner,
 };
+
+/*
+ * See if one path point connects directly to another by ancestral relationship
+ * across mountpoints.  Must call with the RCU read lock held.
+ */
+static bool are_paths_connected(struct path *ancestor, struct path *to_check)
+{
+	struct mount *mnt, *parent;
+	struct path cursor;
+	unsigned seq;
+	bool connected;
+
+	seq = 0;
+restart:
+	cursor = *to_check;
+
+	read_seqbegin_or_lock(&rename_lock, &seq);
+	while (cursor.mnt != ancestor->mnt) {
+		mnt = real_mount(cursor.mnt);
+		parent = READ_ONCE(mnt->mnt_parent);
+		if (mnt == parent)
+			goto failed;
+		cursor.dentry = READ_ONCE(mnt->mnt_mountpoint);
+		cursor.mnt = &parent->mnt;
+	}
+
+	while (cursor.dentry != ancestor->dentry) {
+		if (cursor.dentry == cursor.mnt->mnt_root ||
+		    IS_ROOT(cursor.dentry))
+			goto failed;
+		cursor.dentry = READ_ONCE(cursor.dentry->d_parent);
+	}
+
+	connected = true;
+out:
+	done_seqretry(&rename_lock, seq);
+	return connected;
+
+failed:
+	if (need_seqretry(&rename_lock, seq)) {
+		seq = 1;
+		goto restart;
+	}
+	connected = false;
+	goto out;
+}
+
+/**
+ * lookup_mount_object - Look up a vfsmount object by ID
+ * @root: The mount root must connect backwards to this point (or chroot if NULL).
+ * @id: The ID of the mountpoint.
+ * @_mntpt: Where to return the resulting mountpoint path.
+ *
+ * Look up the root of the mount with the corresponding ID.  This is only
+ * permitted if that mount connects directly to the specified root/chroot.
+ */
+int lookup_mount_object(struct path *root, int mnt_id, struct path *_mntpt)
+{
+	struct mount *mnt;
+	struct path stop, mntpt = {};
+	int ret = -EPERM;
+
+	if (!root)
+		get_fs_root(current->fs, &stop);
+	else
+		stop = *root;
+
+	rcu_read_lock();
+	lock_mount_hash();
+	mnt = idr_find(&mnt_id_ida, mnt_id);
+	if (!mnt)
+		goto out_unlock_mh;
+	if (mnt->mnt.mnt_flags & (MNT_SYNC_UMOUNT | MNT_UMOUNT | MNT_DOOMED))
+		goto out_unlock_mh;
+	if (mnt_get_count(mnt) == 0)
+		goto out_unlock_mh;
+	mnt_add_count(mnt, 1);
+	mntpt.mnt = &mnt->mnt;
+	mntpt.dentry = dget(mnt->mnt.mnt_root);
+	unlock_mount_hash();
+
+	if (are_paths_connected(&stop, &mntpt)) {
+		*_mntpt = mntpt;
+		mntpt.mnt = NULL;
+		mntpt.dentry = NULL;
+		ret = 0;
+	}
+
+out_unlock:
+	rcu_read_unlock();
+	if (!root)
+		path_put(&stop);
+	path_put(&mntpt);
+	return ret;
+
+out_unlock_mh:
+	unlock_mount_hash();
+	goto out_unlock;
+}
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index d7f24da36f0e..3ce7810d96b4 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -39,6 +39,7 @@ struct fsinfo_params {
 #define FSINFO_FLAGS_QUERY_MASK	0x0007 /* What object should fsinfo() query? */
 #define FSINFO_FLAGS_QUERY_PATH	0x0000 /* - path, specified by dirfd,pathname,AT_EMPTY_PATH */
 #define FSINFO_FLAGS_QUERY_FD	0x0001 /* - fd specified by dirfd */
+#define FSINFO_FLAGS_QUERY_MOUNT 0x0002	/* - mount object (path=>mount_id, dirfd=>subtree) */
 	__u32	request;	/* ID of requested attribute */
 	__u32	Nth;		/* Instance of it (some may have multiple) */
 	__u32	Mth;		/* Subinstance of Nth instance */
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
index 7f49c2125ed3..546bf4f530d0 100644
--- a/samples/vfs/test-fsinfo.c
+++ b/samples/vfs/test-fsinfo.c
@@ -555,16 +555,22 @@ int main(int argc, char **argv)
 	bool meta = false;
 	int raw = 0, opt, Nth, Mth;
 
-	while ((opt = getopt(argc, argv, "adlmr"))) {
+	while ((opt = getopt(argc, argv, "Madlmr"))) {
 		switch (opt) {
+		case 'M':
+			params.at_flags = 0;
+			params.flags = FSINFO_FLAGS_QUERY_MOUNT;
+			continue;
 		case 'a':
 			params.at_flags |= AT_NO_AUTOMOUNT;
+			params.flags |= FSINFO_FLAGS_QUERY_PATH;
 			continue;
 		case 'd':
 			debug = true;
 			continue;
 		case 'l':
 			params.at_flags &= ~AT_SYMLINK_NOFOLLOW;
+			params.flags |= FSINFO_FLAGS_QUERY_PATH;
 			continue;
 		case 'm':
 			meta = true;
@@ -580,7 +586,8 @@ int main(int argc, char **argv)
 	argv += optind;
 
 	if (argc != 1) {
-		printf("Format: test-fsinfo [-alr] <file>\n");
+		printf("Format: test-fsinfo [-adlr] <file>\n");
+		printf("Format: test-fsinfo [-dr] -M <mnt_id>\n");
 		exit(2);
 	}
 



^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 10/17] fsinfo: Allow mount information to be queried [ver #17]
  2020-02-21 18:01 [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] David Howells
                   ` (8 preceding siblings ...)
  2020-02-21 18:03 ` [PATCH 09/17] fsinfo: Allow fsinfo() to look up a mount object by ID " David Howells
@ 2020-02-21 18:03 ` " David Howells
  2020-03-04 14:58   ` Miklos Szeredi
  2020-03-04 16:10   ` Miklos Szeredi
  2020-02-21 18:03 ` [PATCH 11/17] fsinfo: sample: Mount listing program " David Howells
                   ` (7 subsequent siblings)
  17 siblings, 2 replies; 117+ messages in thread
From: David Howells @ 2020-02-21 18:03 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong,
	linux-api, linux-fsdevel, linux-kernel

Allow mount information, including information about the topology tree to
be queried with the fsinfo() system call.  Setting AT_FSINFO_QUERY_MOUNT
allows overlapping mounts to be queried by indicating that the syscall
should interpet the pathname as a number indicating the mount ID.

To this end, four fsinfo() attributes are provided:

 (1) FSINFO_ATTR_MOUNT_INFO.

     This is a structure providing information about a mount, including:

	- Mounted superblock ID.
	- Mount ID (can be used with AT_FSINFO_QUERY_MOUNT).
	- Parent mount ID.
	- Mount attributes (eg. R/O, NOEXEC).
	- A change counter.

     Note that the parent mount ID is overridden to the ID of the queried
     mount if the parent lies outside of the chroot or dfd tree.

 (2) FSINFO_ATTR_MOUNT_DEVNAME.

     This a string providing the device name associated with the mount.

     Note that the device name may be a path that lies outside of the root.

 (3) FSINFO_ATTR_MOUNT_POINT.

     This is a string indicating the name of the mountpoint within the
     parent mount, limited to the parent's mounted root and the chroot.

 (4) FSINFO_ATTR_MOUNT_CHILDREN.

     This produces an array of structures, one for each child and capped
     with one for the argument mount (checked after listing all the
     children).  Each element contains the mount ID and the change counter
     of the respective mount object.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/d_path.c                 |    2 
 fs/fsinfo.c                 |    5 +
 fs/internal.h               |   10 ++
 fs/namespace.c              |  179 +++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fsinfo.h |   34 ++++++++
 samples/vfs/test-fsinfo.c   |   27 ++++++
 6 files changed, 256 insertions(+), 1 deletion(-)

diff --git a/fs/d_path.c b/fs/d_path.c
index 0f1fc1743302..4c203f64e45e 100644
--- a/fs/d_path.c
+++ b/fs/d_path.c
@@ -229,7 +229,7 @@ static int prepend_unreachable(char **buffer, int *buflen)
 	return prepend(buffer, buflen, "(unreachable)", 13);
 }
 
-static void get_fs_root_rcu(struct fs_struct *fs, struct path *root)
+void get_fs_root_rcu(struct fs_struct *fs, struct path *root)
 {
 	unsigned seq;
 
diff --git a/fs/fsinfo.c b/fs/fsinfo.c
index 9712d340dd7d..e3377842a2c1 100644
--- a/fs/fsinfo.c
+++ b/fs/fsinfo.c
@@ -229,6 +229,11 @@ static const struct fsinfo_attribute fsinfo_common_attributes[] = {
 
 	FSINFO_LIST	(FSINFO_ATTR_FSINFO_ATTRIBUTES,	(void *)123UL),
 	FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, (void *)123UL),
+
+	FSINFO_VSTRUCT	(FSINFO_ATTR_MOUNT_INFO,	fsinfo_generic_mount_info),
+	FSINFO_STRING	(FSINFO_ATTR_MOUNT_DEVNAME,	fsinfo_generic_mount_devname),
+	FSINFO_STRING	(FSINFO_ATTR_MOUNT_POINT,	fsinfo_generic_mount_point),
+	FSINFO_LIST	(FSINFO_ATTR_MOUNT_CHILDREN,	fsinfo_generic_mount_children),
 	{}
 };
 
diff --git a/fs/internal.h b/fs/internal.h
index 2ccd2b2eae88..6804cf54846d 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -15,6 +15,7 @@ struct mount;
 struct shrink_control;
 struct fs_context;
 struct user_namespace;
+struct fsinfo_context;
 
 /*
  * block_dev.c
@@ -47,6 +48,11 @@ extern int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
  */
 extern void __init chrdev_init(void);
 
+/*
+ * d_path.c
+ */
+extern void get_fs_root_rcu(struct fs_struct *fs, struct path *root);
+
 /*
  * fs_context.c
  */
@@ -92,6 +98,10 @@ extern void __mnt_drop_write_file(struct file *);
 
 extern void dissolve_on_fput(struct vfsmount *);
 extern int lookup_mount_object(struct path *, int, struct path *);
+extern int fsinfo_generic_mount_info(struct path *, struct fsinfo_context *);
+extern int fsinfo_generic_mount_devname(struct path *, struct fsinfo_context *);
+extern int fsinfo_generic_mount_point(struct path *, struct fsinfo_context *);
+extern int fsinfo_generic_mount_children(struct path *, struct fsinfo_context *);
 
 /*
  * fs_struct.c
diff --git a/fs/namespace.c b/fs/namespace.c
index 696fcc5010ca..fc22aea18e2d 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -30,6 +30,7 @@
 #include <uapi/linux/mount.h>
 #include <linux/fs_context.h>
 #include <linux/shmem_fs.h>
+#include <linux/fsinfo.h>
 
 #include "pnode.h"
 #include "internal.h"
@@ -4108,3 +4109,181 @@ int lookup_mount_object(struct path *root, int mnt_id, struct path *_mntpt)
 	unlock_mount_hash();
 	goto out_unlock;
 }
+
+#ifdef CONFIG_FSINFO
+/*
+ * Retrieve information about the nominated mount.
+ */
+int fsinfo_generic_mount_info(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_mount_info *p = ctx->buffer;
+	struct super_block *sb;
+	struct mount *m;
+	struct path root;
+	unsigned int flags;
+
+	if (!path->mnt)
+		return -ENODATA;
+
+	m = real_mount(path->mnt);
+	sb = m->mnt.mnt_sb;
+
+	p->f_sb_id		= sb->s_unique_id;
+	p->mnt_id		= m->mnt_id;
+	p->parent_id		= m->mnt_parent->mnt_id;
+	p->change_counter	= atomic_read(&m->mnt_change_counter);
+
+	get_fs_root(current->fs, &root);
+	if (path->mnt == root.mnt) {
+		p->parent_id = p->mnt_id;
+	} else {
+		rcu_read_lock();
+		if (!are_paths_connected(&root, path))
+			p->parent_id = p->mnt_id;
+		rcu_read_unlock();
+	}
+	if (IS_MNT_SHARED(m))
+		p->group_id = m->mnt_group_id;
+	if (IS_MNT_SLAVE(m)) {
+		int master = m->mnt_master->mnt_group_id;
+		int dom = get_dominating_id(m, &root);
+		p->master_id = master;
+		if (dom && dom != master)
+			p->from_id = dom;
+	}
+	path_put(&root);
+
+	flags = READ_ONCE(m->mnt.mnt_flags);
+	if (flags & MNT_READONLY)
+		p->attr |= MOUNT_ATTR_RDONLY;
+	if (flags & MNT_NOSUID)
+		p->attr |= MOUNT_ATTR_NOSUID;
+	if (flags & MNT_NODEV)
+		p->attr |= MOUNT_ATTR_NODEV;
+	if (flags & MNT_NOEXEC)
+		p->attr |= MOUNT_ATTR_NOEXEC;
+	if (flags & MNT_NODIRATIME)
+		p->attr |= MOUNT_ATTR_NODIRATIME;
+
+	if (flags & MNT_NOATIME)
+		p->attr |= MOUNT_ATTR_NOATIME;
+	else if (flags & MNT_RELATIME)
+		p->attr |= MOUNT_ATTR_RELATIME;
+	else
+		p->attr |= MOUNT_ATTR_STRICTATIME;
+	return sizeof(*p);
+}
+
+int fsinfo_generic_mount_devname(struct path *path, struct fsinfo_context *ctx)
+{
+	if (!path->mnt)
+		return -ENODATA;
+
+	return fsinfo_string(real_mount(path->mnt)->mnt_devname, ctx);
+}
+
+/*
+ * Return the path of this mount relative to its parent and clipped to
+ * the current chroot.
+ */
+int fsinfo_generic_mount_point(struct path *path, struct fsinfo_context *ctx)
+{
+	struct mountpoint *mp;
+	struct mount *m, *parent;
+	struct path mountpoint, root;
+	size_t len;
+	void *p;
+
+	if (!path->mnt)
+		return -ENODATA;
+
+	rcu_read_lock();
+
+	m = real_mount(path->mnt);
+	parent = m->mnt_parent;
+	if (parent == m)
+		goto skip;
+	mp = READ_ONCE(m->mnt_mp);
+	if (mp)
+		goto found;
+skip:
+	rcu_read_unlock();
+	return -ENODATA;
+
+found:
+	mountpoint.mnt = &parent->mnt;
+	mountpoint.dentry = READ_ONCE(mp->m_dentry);
+
+	get_fs_root_rcu(current->fs, &root);
+	if (path->mnt == root.mnt) {
+		rcu_read_unlock();
+		len = snprintf(ctx->buffer, ctx->buf_size, "/");
+	} else {
+		if (root.mnt != &parent->mnt) {
+			root.mnt = &parent->mnt;
+			root.dentry = parent->mnt.mnt_root;
+		}
+
+		p = __d_path(&mountpoint, &root, ctx->buffer, ctx->buf_size);
+		rcu_read_unlock();
+
+		if (IS_ERR(p))
+			return PTR_ERR(p);
+		if (!p)
+			return -EPERM;
+
+		len = (ctx->buffer + ctx->buf_size) - p;
+		memmove(ctx->buffer, p, len);
+	}
+	return len;
+}
+
+/*
+ * Store a mount record into the fsinfo buffer.
+ */
+static void store_mount_fsinfo(struct fsinfo_context *ctx,
+			       struct fsinfo_mount_child *child)
+{
+	unsigned int usage = ctx->usage;
+	unsigned int total = sizeof(*child);
+
+	if (ctx->usage >= INT_MAX)
+		return;
+	ctx->usage = usage + total;
+	if (ctx->buffer && ctx->usage <= ctx->buf_size)
+		memcpy(ctx->buffer + usage, child, total);
+}
+
+/*
+ * Return information about the submounts relative to path.
+ */
+int fsinfo_generic_mount_children(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_mount_child record;
+	struct mount *m, *child;
+
+	if (!path->mnt)
+		return -ENODATA;
+
+	m = real_mount(path->mnt);
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(child, &m->mnt_mounts, mnt_child) {
+		if (child->mnt_parent != m)
+			continue;
+		record.mnt_id = child->mnt_id;
+		record.change_counter = atomic_read(&child->mnt_change_counter);
+		store_mount_fsinfo(ctx, &record);
+	}
+	rcu_read_unlock();
+
+	/* End the list with a copy of the parameter mount's details so that
+	 * userspace can quickly check for changes.
+	 */
+	record.mnt_id = m->mnt_id;
+	record.change_counter = atomic_read(&m->mnt_change_counter);
+	store_mount_fsinfo(ctx, &record);
+	return ctx->usage;
+}
+
+#endif /* CONFIG_FSINFO */
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index 3ce7810d96b4..29940d110ce3 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -27,6 +27,11 @@
 #define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO 0x100	/* Information about attr N (for path) */
 #define FSINFO_ATTR_FSINFO_ATTRIBUTES	0x101	/* List of supported attrs (for path) */
 
+#define FSINFO_ATTR_MOUNT_INFO		0x200	/* Mount object information */
+#define FSINFO_ATTR_MOUNT_DEVNAME	0x201	/* Mount object device name (string) */
+#define FSINFO_ATTR_MOUNT_POINT		0x202	/* Relative path of mount in parent (string) */
+#define FSINFO_ATTR_MOUNT_CHILDREN	0x203	/* Children of this mount (list) */
+
 /*
  * Optional fsinfo() parameter structure.
  *
@@ -69,6 +74,7 @@ struct fsinfo_attribute_info {
 	unsigned int		size;		/* - Value size (FSINFO_STRUCT/FSINFO_LIST) */
 };
 
+#define FSINFO_ATTR_FSINFO_ATTRIBUTES__STRUCT __u32
 #define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO__STRUCT struct fsinfo_attribute_info
 #define FSINFO_ATTR_FSINFO_ATTRIBUTES__STRUCT __u32
 
@@ -82,6 +88,34 @@ struct fsinfo_u128 {
 #endif
 };
 
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_MOUNT_INFO).
+ */
+struct fsinfo_mount_info {
+	__u64		f_sb_id;	/* Superblock ID */
+	__u32		mnt_id;		/* Mount identifier (use with AT_FSINFO_MOUNTID_PATH) */
+	__u32		parent_id;	/* Parent mount identifier */
+	__u32		group_id;	/* Mount group ID */
+	__u32		master_id;	/* Slave master group ID */
+	__u32		from_id;	/* Slave propagated from ID */
+	__u32		attr;		/* MOUNT_ATTR_* flags */
+	__u32		change_counter;	/* Number of changes applied. */
+	__u32		__reserved[1];
+};
+
+#define FSINFO_ATTR_MOUNT_INFO__STRUCT struct fsinfo_mount_info
+
+/*
+ * Information struct element for fsinfo(FSINFO_ATTR_MOUNT_CHILDREN).
+ * - An extra element is placed on the end representing the parent mount.
+ */
+struct fsinfo_mount_child {
+	__u32		mnt_id;		/* Mount identifier (use with AT_FSINFO_MOUNTID_PATH) */
+	__u32		change_counter;	/* Number of changes applied to mount. */
+};
+
+#define FSINFO_ATTR_MOUNT_CHILDREN__STRUCT struct fsinfo_mount_child
+
 /*
  * Information struct for fsinfo(FSINFO_ATTR_STATFS).
  * - This gives extended filesystem information.
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
index 546bf4f530d0..f761ded6a52c 100644
--- a/samples/vfs/test-fsinfo.c
+++ b/samples/vfs/test-fsinfo.c
@@ -282,6 +282,26 @@ static void dump_fsinfo_generic_volume_uuid(void *reply, unsigned int size)
 	       f->uuid[14], f->uuid[15]);
 }
 
+static void dump_fsinfo_generic_mount_info(void *reply, unsigned int size)
+{
+	struct fsinfo_mount_info *f = reply;
+
+	printf("\n");
+	printf("\tsb_id   : %llx\n", (unsigned long long)f->f_sb_id);
+	printf("\tmnt_id  : %x\n", f->mnt_id);
+	printf("\tparent  : %x\n", f->parent_id);
+	printf("\tgroup   : %x\n", f->group_id);
+	printf("\tattr    : %x\n", f->attr);
+	printf("\tchanges : %x\n", f->change_counter);
+}
+
+static void dump_fsinfo_generic_mount_child(void *reply, unsigned int size)
+{
+	struct fsinfo_mount_child *f = reply;
+
+	printf("%8x %8x\n", f->mnt_id, f->change_counter);
+}
+
 static void dump_string(void *reply, unsigned int size)
 {
 	char *s = reply, *p;
@@ -309,6 +329,8 @@ static void dump_string(void *reply, unsigned int size)
 
 #define dump_fsinfo_generic_volume_id		dump_string
 #define dump_fsinfo_generic_volume_name		dump_string
+#define dump_fsinfo_generic_mount_devname	dump_string
+#define dump_fsinfo_generic_mount_point		dump_string
 
 /*
  *
@@ -344,6 +366,11 @@ static const struct fsinfo_attribute fsinfo_attributes[] = {
 	FSINFO_STRING	(FSINFO_ATTR_VOLUME_ID,		fsinfo_generic_volume_id),
 	FSINFO_VSTRUCT	(FSINFO_ATTR_VOLUME_UUID,	fsinfo_generic_volume_uuid),
 	FSINFO_STRING	(FSINFO_ATTR_VOLUME_NAME,	fsinfo_generic_volume_name),
+
+	FSINFO_VSTRUCT	(FSINFO_ATTR_MOUNT_INFO,	fsinfo_generic_mount_info),
+	FSINFO_STRING	(FSINFO_ATTR_MOUNT_DEVNAME,	fsinfo_generic_mount_devname),
+	FSINFO_LIST	(FSINFO_ATTR_MOUNT_CHILDREN,	fsinfo_generic_mount_child),
+	FSINFO_STRING_N	(FSINFO_ATTR_MOUNT_POINT,	fsinfo_generic_mount_point),
 	{}
 };
 



^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 11/17] fsinfo: sample: Mount listing program [ver #17]
  2020-02-21 18:01 [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] David Howells
                   ` (9 preceding siblings ...)
  2020-02-21 18:03 ` [PATCH 10/17] fsinfo: Allow mount information to be queried " David Howells
@ 2020-02-21 18:03 ` " David Howells
  2020-02-21 18:03 ` [PATCH 12/17] fsinfo: Allow the mount topology propogation flags to be retrieved " David Howells
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 117+ messages in thread
From: David Howells @ 2020-02-21 18:03 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong,
	linux-api, linux-fsdevel, linux-kernel

Implement a program to demonstrate mount listing using the new fsinfo()
syscall, for example:

# ./test-mntinfo -M 21
MOUNT                                 MOUNT ID   CHANGE#    TYPE & DEVICE
------------------------------------- ---------- ---------- ---------------
21                                            21          8 sysfs 0:15
 \_ kernel/security                           24          0 securityfs 0:8
 \_ fs/cgroup                                 28         16 tmpfs 0:19
 |   \_ unified                               29          0 cgroup2 0:1a
 |   \_ systemd                               30          0 cgroup 0:1b
 |   \_ freezer                               34          0 cgroup 0:1f
 |   \_ cpu,cpuacct                           35          0 cgroup 0:20
 |   \_ devices                               36          0 cgroup 0:21
 |   \_ memory                                37          0 cgroup 0:22
 |   \_ cpuset                                38          0 cgroup 0:23
 |   \_ net_cls,net_prio                      39          0 cgroup 0:24
 |   \_ hugetlb                               40          0 cgroup 0:25
 |   \_ rdma                                  41          0 cgroup 0:26
 |   \_ blkio                                 42          0 cgroup 0:27
 |   \_ perf_event                            43          0 cgroup 0:28
 \_ fs/pstore                                 31          0 pstore 0:1c
 \_ firmware/efi/efivars                      32          0 efivarfs 0:1d
 \_ fs/bpf                                    33          0 bpf 0:1e
 \_ kernel/config                             92          0 configfs 0:10
 \_ fs/selinux                                44          0 selinuxfs 0:12
 \_ kernel/debug                              48          0 debugfs 0:7

Signed-off-by: David Howells <dhowells@redhat.com>
---

 samples/vfs/Makefile       |    2 
 samples/vfs/test-mntinfo.c |  243 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 245 insertions(+)
 create mode 100644 samples/vfs/test-mntinfo.c

diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
index 9159ad1d7fc5..19be60ab950e 100644
--- a/samples/vfs/Makefile
+++ b/samples/vfs/Makefile
@@ -4,12 +4,14 @@
 hostprogs := \
 	test-fsinfo \
 	test-fsmount \
+	test-mntinfo \
 	test-statx
 
 always-y := $(hostprogs)
 
 HOSTCFLAGS_test-fsinfo.o += -I$(objtree)/usr/include
 HOSTLDLIBS_test-fsinfo += -static -lm
+HOSTCFLAGS_test-mntinfo.o += -I$(objtree)/usr/include
 
 HOSTCFLAGS_test-fsmount.o += -I$(objtree)/usr/include
 HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include
diff --git a/samples/vfs/test-mntinfo.c b/samples/vfs/test-mntinfo.c
new file mode 100644
index 000000000000..f4d90d0671c5
--- /dev/null
+++ b/samples/vfs/test-mntinfo.c
@@ -0,0 +1,243 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Test the fsinfo() system call
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#define _GNU_SOURCE
+#define _ATFILE_SOURCE
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <unistd.h>
+#include <ctype.h>
+#include <errno.h>
+#include <time.h>
+#include <math.h>
+#include <sys/syscall.h>
+#include <linux/fsinfo.h>
+#include <linux/socket.h>
+#include <linux/fcntl.h>
+#include <sys/stat.h>
+#include <arpa/inet.h>
+
+#ifndef __NR_fsinfo
+#define __NR_fsinfo -1
+#endif
+
+static __attribute__((unused))
+ssize_t fsinfo(int dfd, const char *filename, struct fsinfo_params *params,
+	       void *buffer, size_t buf_size)
+{
+	return syscall(__NR_fsinfo, dfd, filename, params, buffer, buf_size);
+}
+
+static char tree_buf[4096];
+static char bar_buf[4096];
+
+/*
+ * Get an fsinfo attribute in a statically allocated buffer.
+ */
+static void get_attr(unsigned int mnt_id, unsigned int attr,
+		     void *buf, size_t buf_size)
+{
+	struct fsinfo_params params = {
+		.flags		= FSINFO_FLAGS_QUERY_MOUNT,
+		.request	= attr,
+	};
+	char file[32];
+	long ret;
+
+	sprintf(file, "%u", mnt_id);
+
+	memset(buf, 0xbd, buf_size);
+
+	ret = fsinfo(AT_FDCWD, file, &params, buf, buf_size);
+	if (ret == -1) {
+		fprintf(stderr, "mount-%s: %m\n", file);
+		exit(1);
+	}
+}
+
+/*
+ * Get an fsinfo attribute in a dynamically allocated buffer.
+ */
+static void *get_attr_alloc(unsigned int mnt_id, unsigned int attr,
+			    unsigned int Nth, size_t *_size)
+{
+	struct fsinfo_params params = {
+		.flags		= FSINFO_FLAGS_QUERY_MOUNT,
+		.request	= attr,
+		.Nth		= Nth,
+	};
+	size_t buf_size = 4096;
+	char file[32];
+	void *r;
+	long ret;
+
+	sprintf(file, "%u", mnt_id);
+
+	for (;;) {
+		r = malloc(buf_size);
+		if (!r) {
+			perror("malloc");
+			exit(1);
+		}
+		memset(r, 0xbd, buf_size);
+
+		ret = fsinfo(AT_FDCWD, file, &params, r, buf_size);
+		if (ret == -1) {
+			fprintf(stderr, "mount-%s: %x,%x,%x %m\n",
+				file, params.request, params.Nth, params.Mth);
+			exit(1);
+		}
+
+		if (ret <= buf_size) {
+			*_size = ret;
+			break;
+		}
+		buf_size = (ret + 4096 - 1) & ~(4096 - 1);
+	}
+
+	return r;
+}
+
+/*
+ * Display a mount and then recurse through its children.
+ */
+static void display_mount(unsigned int mnt_id, unsigned int depth, char *path)
+{
+	struct fsinfo_mount_child *children;
+	struct fsinfo_mount_info info;
+	struct fsinfo_ids ids;
+	unsigned int d;
+	size_t ch_size, p_size;
+	char dev[64];
+	int i, n, s;
+
+	get_attr(mnt_id, FSINFO_ATTR_MOUNT_INFO, &info, sizeof(info));
+	get_attr(mnt_id, FSINFO_ATTR_IDS, &ids, sizeof(ids));
+	if (depth > 0)
+		printf("%s", tree_buf);
+
+	s = strlen(path);
+	printf("%s", !s ? "\"\"" : path);
+	if (!s)
+		s += 2;
+	s += depth;
+	if (s < 38)
+		s = 38 - s;
+	else
+		s = 1;
+	printf("%*.*s", s, s, "");
+
+	sprintf(dev, "%x:%x", ids.f_dev_major, ids.f_dev_minor);
+	printf("%10u %8x %2x %5s %s",
+	       info.mnt_id, info.change_counter,
+	       info.attr,
+	       dev, ids.f_fs_name);
+	putchar('\n');
+
+	children = get_attr_alloc(mnt_id, FSINFO_ATTR_MOUNT_CHILDREN, 0, &ch_size);
+	n = ch_size / sizeof(children[0]) - 1;
+
+	bar_buf[depth + 1] = '|';
+	if (depth > 0) {
+		tree_buf[depth - 4 + 1] = bar_buf[depth - 4 + 1];
+		tree_buf[depth - 4 + 2] = ' ';
+	}
+
+	tree_buf[depth + 0] = ' ';
+	tree_buf[depth + 1] = '\\';
+	tree_buf[depth + 2] = '_';
+	tree_buf[depth + 3] = ' ';
+	tree_buf[depth + 4] = 0;
+	d = depth + 4;
+
+	for (i = 0; i < n; i++) {
+		if (i == n - 1)
+			bar_buf[depth + 1] = ' ';
+		path = get_attr_alloc(children[i].mnt_id, FSINFO_ATTR_MOUNT_POINT,
+				      0, &p_size);
+		display_mount(children[i].mnt_id, d, path + 1);
+		free(path);
+	}
+
+	free(children);
+	if (depth > 0) {
+		tree_buf[depth - 4 + 1] = '\\';
+		tree_buf[depth - 4 + 2] = '_';
+	}
+	tree_buf[depth] = 0;
+}
+
+/*
+ * Find the ID of whatever is at the nominated path.
+ */
+static unsigned int lookup_mnt_by_path(const char *path)
+{
+	struct fsinfo_mount_info mnt;
+	struct fsinfo_params params = {
+		.flags		= FSINFO_FLAGS_QUERY_PATH,
+		.request	= FSINFO_ATTR_MOUNT_INFO,
+	};
+
+	if (fsinfo(AT_FDCWD, path, &params, &mnt, sizeof(mnt)) == -1) {
+		perror(path);
+		exit(1);
+	}
+
+	return mnt.mnt_id;
+}
+
+/*
+ *
+ */
+int main(int argc, char **argv)
+{
+	unsigned int mnt_id;
+	char *path;
+	bool use_mnt_id = false;
+	int opt;
+
+	while ((opt = getopt(argc, argv, "M"))) {
+		switch (opt) {
+		case 'M':
+			use_mnt_id = true;
+			continue;
+		}
+		break;
+	}
+
+	argc -= optind;
+	argv += optind;
+
+	switch (argc) {
+	case 0:
+		mnt_id = lookup_mnt_by_path("/");
+		path = "ROOT";
+		break;
+	case 1:
+		path = argv[0];
+		if (use_mnt_id) {
+			mnt_id = strtoul(argv[0], NULL, 0);
+			break;
+		}
+
+		mnt_id = lookup_mnt_by_path(argv[0]);
+		break;
+	default:
+		printf("Format: test-mntinfo\n");
+		printf("Format: test-mntinfo <path>\n");
+		printf("Format: test-mntinfo -M <mnt_id>\n");
+		exit(2);
+	}
+
+	printf("MOUNT                                 MOUNT ID   CHANGE#  AT DEV   TYPE\n");
+	printf("------------------------------------- ---------- -------- -- ----- --------\n");
+	display_mount(mnt_id, 0, path);
+	return 0;
+}



^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 12/17] fsinfo: Allow the mount topology propogation flags to be retrieved [ver #17]
  2020-02-21 18:01 [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] David Howells
                   ` (10 preceding siblings ...)
  2020-02-21 18:03 ` [PATCH 11/17] fsinfo: sample: Mount listing program " David Howells
@ 2020-02-21 18:03 ` " David Howells
  2020-02-21 18:03 ` [PATCH 13/17] fsinfo: Query superblock unique ID and notification counter " David Howells
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 117+ messages in thread
From: David Howells @ 2020-02-21 18:03 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong,
	linux-api, linux-fsdevel, linux-kernel

Allow the mount topology propogation flags to be retrieved as part of the
FSINFO_ATTR_MOUNT_INFO attributes.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/namespace.c              |    7 ++++++-
 include/uapi/linux/fsinfo.h |    2 +-
 include/uapi/linux/mount.h  |   10 +++++++++-
 samples/vfs/test-fsinfo.c   |    1 +
 samples/vfs/test-mntinfo.c  |    8 ++++----
 5 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index fc22aea18e2d..bbfd6cd5c501 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -4142,15 +4142,20 @@ int fsinfo_generic_mount_info(struct path *path, struct fsinfo_context *ctx)
 			p->parent_id = p->mnt_id;
 		rcu_read_unlock();
 	}
-	if (IS_MNT_SHARED(m))
+	if (IS_MNT_SHARED(m)) {
 		p->group_id = m->mnt_group_id;
+		p->propagation |= MOUNT_PROPAGATION_SHARED;
+	}
 	if (IS_MNT_SLAVE(m)) {
 		int master = m->mnt_master->mnt_group_id;
 		int dom = get_dominating_id(m, &root);
 		p->master_id = master;
 		if (dom && dom != master)
 			p->from_id = dom;
+		p->propagation |= MOUNT_PROPAGATION_SLAVE;
 	}
+	if (IS_MNT_UNBINDABLE(m))
+		p->propagation |= MOUNT_PROPAGATION_UNBINDABLE;
 	path_put(&root);
 
 	flags = READ_ONCE(m->mnt.mnt_flags);
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index 29940d110ce3..119c371697be 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -100,7 +100,7 @@ struct fsinfo_mount_info {
 	__u32		from_id;	/* Slave propagated from ID */
 	__u32		attr;		/* MOUNT_ATTR_* flags */
 	__u32		change_counter;	/* Number of changes applied. */
-	__u32		__reserved[1];
+	__u32		propagation;	/* MOUNT_PROPAGATION_* flags */
 };
 
 #define FSINFO_ATTR_MOUNT_INFO__STRUCT struct fsinfo_mount_info
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index 96a0240f23fe..39e50fe9d8d9 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -105,7 +105,7 @@ enum fsconfig_command {
 #define FSMOUNT_CLOEXEC		0x00000001
 
 /*
- * Mount attributes.
+ * Mount object attributes (these are separate to filesystem attributes).
  */
 #define MOUNT_ATTR_RDONLY	0x00000001 /* Mount read-only */
 #define MOUNT_ATTR_NOSUID	0x00000002 /* Ignore suid and sgid bits */
@@ -117,4 +117,12 @@ enum fsconfig_command {
 #define MOUNT_ATTR_STRICTATIME	0x00000020 /* - Always perform atime updates */
 #define MOUNT_ATTR_NODIRATIME	0x00000080 /* Do not update directory access times */
 
+/*
+ * Mount object propogation attributes.
+ */
+#define MOUNT_PROPAGATION_UNBINDABLE	0x00000001 /* Mount is unbindable */
+#define MOUNT_PROPAGATION_SLAVE		0x00000002 /* Mount is slave */
+#define MOUNT_PROPAGATION_PRIVATE	0x00000000 /* Mount is private (ie. not shared) */
+#define MOUNT_PROPAGATION_SHARED	0x00000004 /* Mount is shared */
+
 #endif /* _UAPI_LINUX_MOUNT_H */
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
index f761ded6a52c..6a61f3426982 100644
--- a/samples/vfs/test-fsinfo.c
+++ b/samples/vfs/test-fsinfo.c
@@ -291,6 +291,7 @@ static void dump_fsinfo_generic_mount_info(void *reply, unsigned int size)
 	printf("\tmnt_id  : %x\n", f->mnt_id);
 	printf("\tparent  : %x\n", f->parent_id);
 	printf("\tgroup   : %x\n", f->group_id);
+	printf("\tpropag  : %x\n", f->propagation);
 	printf("\tattr    : %x\n", f->attr);
 	printf("\tchanges : %x\n", f->change_counter);
 }
diff --git a/samples/vfs/test-mntinfo.c b/samples/vfs/test-mntinfo.c
index f4d90d0671c5..5a3d6b917447 100644
--- a/samples/vfs/test-mntinfo.c
+++ b/samples/vfs/test-mntinfo.c
@@ -135,9 +135,9 @@ static void display_mount(unsigned int mnt_id, unsigned int depth, char *path)
 	printf("%*.*s", s, s, "");
 
 	sprintf(dev, "%x:%x", ids.f_dev_major, ids.f_dev_minor);
-	printf("%10u %8x %2x %5s %s",
+	printf("%10u %8x %2x %x %5s %s",
 	       info.mnt_id, info.change_counter,
-	       info.attr,
+	       info.attr, info.propagation,
 	       dev, ids.f_fs_name);
 	putchar('\n');
 
@@ -236,8 +236,8 @@ int main(int argc, char **argv)
 		exit(2);
 	}
 
-	printf("MOUNT                                 MOUNT ID   CHANGE#  AT DEV   TYPE\n");
-	printf("------------------------------------- ---------- -------- -- ----- --------\n");
+	printf("MOUNT                                 MOUNT ID   CHANGE#  AT P DEV   TYPE\n");
+	printf("------------------------------------- ---------- -------- -- - ----- --------\n");
 	display_mount(mnt_id, 0, path);
 	return 0;
 }



^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 13/17] fsinfo: Query superblock unique ID and notification counter [ver #17]
  2020-02-21 18:01 [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] David Howells
                   ` (11 preceding siblings ...)
  2020-02-21 18:03 ` [PATCH 12/17] fsinfo: Allow the mount topology propogation flags to be retrieved " David Howells
@ 2020-02-21 18:03 ` " David Howells
  2020-02-21 18:03 ` [PATCH 14/17] fsinfo: Add API documentation " David Howells
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 117+ messages in thread
From: David Howells @ 2020-02-21 18:03 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong,
	linux-api, linux-fsdevel, linux-kernel

Provide an fsinfo attribute to query the superblock unique ID and
notification counter.  The unique ID is placed in notification events and
the counted it provided so that the changed superblock can be determined in
the event of a notification buffer overrun.  This is accessed with:

	struct fsinfo_params params = {
		.request = FSINFO_ATTR_SB_NOTIFICATIONS,
	};

and returns a structure that looks like:

	struct fsinfo_sb_notifications {
		__u64	watch_id;
		__u32	notify_counter;
		__u32	__reserved[1];
	};

Where watch_id is a number uniquely identifying the superblock in
notification records and notify_counter is incremented for each
superblock notification posted.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/fsinfo.c                      |   11 +++++++++++
 include/uapi/linux/fsinfo.h      |   12 ++++++++++++
 include/uapi/linux/watch_queue.h |    2 +-
 samples/vfs/test-fsinfo.c        |   10 ++++++++++
 4 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/fs/fsinfo.c b/fs/fsinfo.c
index e3377842a2c1..4334249339f9 100644
--- a/fs/fsinfo.c
+++ b/fs/fsinfo.c
@@ -217,6 +217,16 @@ static int fsinfo_generic_volume_id(struct path *path, struct fsinfo_context *ct
 	return fsinfo_string(path->dentry->d_sb->s_id, ctx);
 }
 
+static int fsinfo_generic_sb_notifications(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_sb_notifications *p = ctx->buffer;
+	struct super_block *sb = path->dentry->d_sb;
+
+	p->watch_id		= sb->s_unique_id;
+	p->notify_counter	= atomic_read(&sb->s_notify_counter);
+	return sizeof(*p);
+}
+
 static const struct fsinfo_attribute fsinfo_common_attributes[] = {
 	FSINFO_VSTRUCT	(FSINFO_ATTR_STATFS,		fsinfo_generic_statfs),
 	FSINFO_VSTRUCT	(FSINFO_ATTR_IDS,		fsinfo_generic_ids),
@@ -226,6 +236,7 @@ static const struct fsinfo_attribute fsinfo_common_attributes[] = {
 	FSINFO_STRING	(FSINFO_ATTR_VOLUME_ID,		fsinfo_generic_volume_id),
 	FSINFO_VSTRUCT	(FSINFO_ATTR_VOLUME_UUID,	fsinfo_generic_volume_uuid),
 	FSINFO_VSTRUCT	(FSINFO_ATTR_FEATURES,		fsinfo_generic_features),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_SB_NOTIFICATIONS,	fsinfo_generic_sb_notifications),
 
 	FSINFO_LIST	(FSINFO_ATTR_FSINFO_ATTRIBUTES,	(void *)123UL),
 	FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, (void *)123UL),
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index 119c371697be..2f9280d16293 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -23,6 +23,7 @@
 #define FSINFO_ATTR_VOLUME_UUID		0x06	/* Volume UUID (LE uuid) */
 #define FSINFO_ATTR_VOLUME_NAME		0x07	/* Volume name (string) */
 #define FSINFO_ATTR_FEATURES		0x08	/* Filesystem features (bits) */
+#define FSINFO_ATTR_SB_NOTIFICATIONS	0x09	/* sb_notify() information */
 
 #define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO 0x100	/* Information about attr N (for path) */
 #define FSINFO_ATTR_FSINFO_ATTRIBUTES	0x101	/* List of supported attrs (for path) */
@@ -286,4 +287,15 @@ struct fsinfo_volume_uuid {
 
 #define FSINFO_ATTR_VOLUME_UUID__STRUCT struct fsinfo_volume_uuid
 
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_SB_NOTIFICATIONS).
+ */
+struct fsinfo_sb_notifications {
+	__u64		watch_id;	/* Watch ID for superblock. */
+	__u32		notify_counter;	/* Number of notifications. */
+	__u32		__reserved[1];
+};
+
+#define FSINFO_ATTR_SB_NOTIFICATIONS__STRUCT struct fsinfo_sb_notifications
+
 #endif /* _UAPI_LINUX_FSINFO_H */
diff --git a/include/uapi/linux/watch_queue.h b/include/uapi/linux/watch_queue.h
index e9c37b1ae68d..9ac2ea6f4a75 100644
--- a/include/uapi/linux/watch_queue.h
+++ b/include/uapi/linux/watch_queue.h
@@ -151,7 +151,7 @@ enum superblock_notification_type {
  */
 struct superblock_notification {
 	struct watch_notification watch; /* WATCH_TYPE_SB_NOTIFY */
-	__u64	sb_id;			/* 64-bit superblock ID */
+	__u64	sb_id;			/* 64-bit superblock ID [FSINFO_ATTR_SB_NOTIFICATIONS] */
 };
 
 struct superblock_error_notification {
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
index 6a61f3426982..247fae5bbb74 100644
--- a/samples/vfs/test-fsinfo.c
+++ b/samples/vfs/test-fsinfo.c
@@ -303,6 +303,15 @@ static void dump_fsinfo_generic_mount_child(void *reply, unsigned int size)
 	printf("%8x %8x\n", f->mnt_id, f->change_counter);
 }
 
+static void dump_fsinfo_generic_sb_notifications(void *reply, unsigned int size)
+{
+	struct fsinfo_sb_notifications *f = reply;
+
+	printf("\n");
+	printf("\twatch_id: %llx\n", (unsigned long long)f->watch_id);
+	printf("\tnotifs  : %llx\n", (unsigned long long)f->notify_counter);
+}
+
 static void dump_string(void *reply, unsigned int size)
 {
 	char *s = reply, *p;
@@ -367,6 +376,7 @@ static const struct fsinfo_attribute fsinfo_attributes[] = {
 	FSINFO_STRING	(FSINFO_ATTR_VOLUME_ID,		fsinfo_generic_volume_id),
 	FSINFO_VSTRUCT	(FSINFO_ATTR_VOLUME_UUID,	fsinfo_generic_volume_uuid),
 	FSINFO_STRING	(FSINFO_ATTR_VOLUME_NAME,	fsinfo_generic_volume_name),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_SB_NOTIFICATIONS,	fsinfo_generic_sb_notifications),
 
 	FSINFO_VSTRUCT	(FSINFO_ATTR_MOUNT_INFO,	fsinfo_generic_mount_info),
 	FSINFO_STRING	(FSINFO_ATTR_MOUNT_DEVNAME,	fsinfo_generic_mount_devname),



^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 14/17] fsinfo: Add API documentation [ver #17]
  2020-02-21 18:01 [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] David Howells
                   ` (12 preceding siblings ...)
  2020-02-21 18:03 ` [PATCH 13/17] fsinfo: Query superblock unique ID and notification counter " David Howells
@ 2020-02-21 18:03 ` " David Howells
  2020-02-21 18:03 ` [PATCH 15/17] fsinfo: Add support for AFS " David Howells
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 117+ messages in thread
From: David Howells @ 2020-02-21 18:03 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong,
	linux-api, linux-fsdevel, linux-kernel

Add API documentation for fsinfo.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 Documentation/filesystems/fsinfo.rst |  491 ++++++++++++++++++++++++++++++++++
 1 file changed, 491 insertions(+)
 create mode 100644 Documentation/filesystems/fsinfo.rst

diff --git a/Documentation/filesystems/fsinfo.rst b/Documentation/filesystems/fsinfo.rst
new file mode 100644
index 000000000000..6283293d3bce
--- /dev/null
+++ b/Documentation/filesystems/fsinfo.rst
@@ -0,0 +1,491 @@
+============================
+Filesystem Information Query
+============================
+
+The fsinfo() system call allows the querying of filesystem and filesystem
+security information beyond what stat(), statx() and statfs() can obtain.  It
+does not require a file to be opened as does ioctl().
+
+fsinfo() may be called with a path, with open file descriptor or a with a mount
+object identifier.
+
+The fsinfo() system call needs to be configured on by enabling:
+
+	"File systems"/"Enable the fsinfo() system call" (CONFIG_FSINFO)
+
+This document has the following sections:
+
+.. contents:: :local:
+
+
+Overview
+========
+
+The fsinfo() system call retrieves one of a number of attributes, the IDs of
+which can be found in include/uapi/linux/fsinfo.h::
+
+	FSINFO_ATTR_STATFS	- statfs()-style state
+	FSINFO_ATTR_IDS		- Filesystem IDs
+	FSINFO_ATTR_LIMITS	- Filesystem limits
+	...
+	FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO - Information about an attribute
+	FSINFO_ATTR_FSINFO_ATTRIBUTES - List of available attributes
+	...
+	FSINFO_ATTR_MOUNT_INFO	- Information about the mount topology
+	...
+
+Each attribute can have zero or more values, which can be of one of the
+following types:
+
+ * ``VStruct``.  This is a structure with a version-dependent length.  New
+   versions of the kernel may append more fields, though they are not
+   permitted to remove or replace old ones.
+
+   Older applications, expecting an older version of the field, can ask for a
+   shorter struct and will only get the fields they requested; newer
+   applications running on an older kernel will get the extra fields they
+   requested filled with zeros.  Either way, the system call returns the size
+   of the internal struct, regardless of how much data it returned.
+
+   This allows for struct-type fields to be extended in future.
+
+ * ``String``.  This is a variable-length string of up to 4096 characters (no
+   NUL character is included).  The returned string will be truncated if the
+   output buffer is too small.  The total size of the string is returned,
+   regardless of any truncation.
+
+ * ``Opaque``.  This is a variable-length blob of indeterminate structure.  It
+   may be up to INT_MAX bytes in size.
+
+ * ``List``.  This is a variable-length list of fixed-size structures.  The
+   element size may not vary over time, so the element format must be designed
+   with care.  The maximum length is INT_MAX bytes, though this depends on the
+   kernel being able to allocate an internal buffer large enough.
+
+Value type is an inherent propery of an attribute and all the values of an
+attribute must be of that type.  Each attribute can have a single value, a
+sequence of values or a sequence-of-sequences of values.
+
+
+Filesystem API
+==============
+
+If the filesystem wishes to provide a list of queryable attributes, it should
+set the table pointer in the superblock::
+
+	const struct fsinfo_attribute *fsinfo_attributes;
+
+terminating it with a blank entry.  Each entry is a ``struct fsinfo_attribute``
+and these can be created with a set of helper macros::
+
+	FSINFO_VSTRUCT(A,G)
+	FSINFO_VSTRUCT_N(A,G)
+	FSINFO_VSTRUCT_NM(A,G)
+	FSINFO_STRING(A,G)
+	FSINFO_STRING_N(A,G)
+	FSINFO_STRING_NM(A,G)
+	FSINFO_OPAQUE(A,G)
+	FSINFO_LIST(A,G)
+	FSINFO_LIST_N(A,G)
+
+The names of the macro are a combination of type (vstruct, string, opaque and
+list) and an optional qualifier, if the attribute has N values or N lots of M
+values.  ``A`` is the name of the attribute and ``G`` is a function to get a
+value for that attribute.
+
+For vstruct- and list-type attributes, it is expected that there is a macro
+defined with the name ``A##__STRUCT`` that indicates the structure or element
+type.
+
+The get function needs to match the following type::
+
+	int (*get)(struct path *path, struct fsinfo_context *ctx);
+
+where "path" indicates the object to be queried and ctx is a context describing
+the parameters and the output buffer.  The function should return the total
+size of the data it would like to produce or an error.
+
+The parameter struct looks like::
+
+	struct fsinfo_context {
+		__u32		requested_attr;
+		__u32		Nth;
+		__u32		Mth;
+		bool		want_size_only;
+		unsigned int	buf_size;
+		unsigned int	usage;
+		void		*buffer;
+		...
+	};
+
+The fields relevant to the filesystem are as follows:
+
+ * ``requested_attr``
+
+   Which attribute is being requested.  EOPNOTSUPP should be returned if the
+   attribute is not supported by the filesystem or the LSM.
+
+ * ``Nth`` and ``Mth``
+
+   Which value of an attribute is being requested.
+
+   For a single-value attribute Nth and Mth will both be 0.
+
+   For a "1D" attribute, Nth will indicate which value and Mth will always
+   be 0.  Take, for example, FSINFO_ATTR_SERVER_NAME - for a network
+   filesystem, the superblock will be backed by a number of servers.  This will
+   return the name of the Nth server.  ENODATA will be returned if Nth goes
+   beyond the end of the array.
+
+   For a "2D" attribute, Mth will indicate the index in the Nth set of values.
+   Take, for example, an attribute for a network filesystems that returns
+   server addresses - each server may have one or more addresses.  This could
+   return the Mth address of the Nth server.  ENODATA should be returned if the
+   Nth set doesn't exist or the Mth element of the Nth set doesn't exist.
+
+ * ``want_size_only``
+
+   Is set to true if the caller only wants the size of the value so that the
+   get function doesn't have to make expensive calculations or calls to
+   retrieve the value.
+
+ * ``buf_size``
+
+   This indicates the current size of the buffer.  For the list type and the
+   opaque type this will be increased if the current buffer won't hold the
+   value and the filesystem will be called again.
+
+ * ``usage``
+
+   This indicates how much of the buffer has been used so far for an list or
+   opaque type attribute.  This is updated by the fsinfo_note_param*()
+   functions.
+
+ * ``buffer``
+
+   This points to the output buffer.  For struct- and string-type attributes it
+   will always be big enough; for list- and opaque-type, it will be buf_size in
+   size and will be resized if the returned size is larger than this.
+
+To simplify filesystem code, there will always be at least a minimal buffer
+available if the ->fsinfo() method gets called - and the filesystem should
+always write what it can into the buffer.  It's possible that the fsinfo()
+system call will then throw the contents away and just return the length.
+
+
+Helper Functions
+================
+
+The API includes a number of helper functions:
+
+ * ``void fsinfo_set_feature(struct fsinfo_features *ft,
+			     enum fsinfo_feature feature);``
+
+   This function sets a feature flag.
+
+ * ``void fsinfo_clear_feature(struct fsinfo_features *ft,
+			       enum fsinfo_feature feature);``
+
+   This function clears a feature flag.
+
+ * ``void fsinfo_set_unix_features(struct fsinfo_features *ft);``
+
+   Set feature flags appropriate to the features of a standard UNIX filesystem,
+   such as having numeric UIDS and GIDS; allowing the creation of directories,
+   symbolic links, hard links, device files, FIFO and socket files; permitting
+   sparse files; and having access, change and modification times.
+
+
+Attribute Summary
+=================
+
+To summarise the attributes that are defined::
+
+  Symbolic name				Type
+  =====================================	===============
+  FSINFO_ATTR_STATFS			vstruct
+  FSINFO_ATTR_IDS			vstruct
+  FSINFO_ATTR_LIMITS			vstruct
+  FSINFO_ATTR_SUPPORTS			vstruct
+  FSINFO_ATTR_FEATURES			vstruct
+  FSINFO_ATTR_TIMESTAMP_INFO		vstruct
+  FSINFO_ATTR_VOLUME_ID			string
+  FSINFO_ATTR_VOLUME_UUID		vstruct
+  FSINFO_ATTR_VOLUME_NAME		string
+  FSINFO_ATTR_NAME_ENCODING		string
+  FSINFO_ATTR_NAME_CODEPAGE		string
+  FSINFO_ATTR_FSINFO			vstruct
+  FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO	vstruct
+  FSINFO_ATTR_FSINFO_ATTRIBUTES		list
+  FSINFO_ATTR_MOUNT_INFO		vstruct
+  FSINFO_ATTR_MOUNT_DEVNAME		string
+  FSINFO_ATTR_MOUNT_POINT		string
+  FSINFO_ATTR_MOUNT_CHILDREN		list
+  FSINFO_ATTR_AFS_CELL_NAME		string
+  FSINFO_ATTR_AFS_SERVER_NAME		N × string
+  FSINFO_ATTR_AFS_SERVER_ADDRESS	N × struct
+
+
+Attribute Catalogue
+===================
+
+A number of the attributes convey information about a filesystem superblock:
+
+ *  ``FSINFO_ATTR_STATFS``
+
+    This struct-type attribute gives most of the equivalent data to statfs(),
+    but with all the fields as unconditional 64-bit or 128-bit integers.  Note
+    that static data like IDs that don't change are retrieved with
+    FSINFO_ATTR_IDS instead.
+
+    Further, superblock flags (such as MS_RDONLY) are not exposed by this
+    attribute; rather the parameters must be listed and the attributes picked
+    out from that.
+
+ *  ``FSINFO_ATTR_IDS``
+
+    This struct-type attribute conveys various identifiers used by the target
+    filesystem.  This includes the filesystem name, the NFS filesystem ID, the
+    superblock ID used in notifications, the filesystem magic type number and
+    the primary device ID.
+
+ *  ``FSINFO_ATTR_LIMITS``
+
+    This struct-type attribute conveys the limits on various aspects of a
+    filesystem, such as maximum file, symlink and xattr sizes, maxiumm filename
+    and xattr name length, maximum number of symlinks, maximum device major and
+    minor numbers and maximum UID, GID and project ID numbers.
+
+ *  ``FSINFO_ATTR_SUPPORTS``
+
+    This struct-type attribute conveys information about the support the
+    filesystem has for various UAPI features of a filesystem.  This includes
+    information about which bits are supported in various masks employed by the
+    statx system call, what FS_IOC_* flags are supported by ioctls and what
+    DOS/Windows file attribute flags are supported.
+
+ *  ``FSINFO_ATTR_TIMESTAMP_INFO``
+
+    This struct-type attribute conveys information about the resolution and
+    range of the timestamps available in a filesystem.  The resolutions are
+    given as a mantissa and exponent (resolution = mantissa * 10^exponent
+    seconds), where the exponent can be negative to indicate a sub-second
+    resolution (-9 being nanoseconds, for example).
+
+ *  ``FSINFO_ATTR_VOLUME_ID``
+
+    This is a string-type attribute that conveys the superblock identifier for
+    the volume.  By default it will be filled in from the contents of s_id from
+    the superblock.  For a block-based filesystem, for example, this might be
+    the name of the primary block device.
+
+ *  ``FSINFO_ATTR_VOLUME_UUID``
+
+    This is a struct-type attribute that conveys the UUID identifier for the
+    volume.  By default it will be filled in from the contents of s_uuid from
+    the superblock.  If this doesn't exist, it will be an entirely zeros.
+
+ *  ``FSINFO_ATTR_VOLUME_NAME``
+
+    This is a string-type attribute that conveys the name of the volume.  By
+    default it will return EOPNOTSUPP.  For a disk-based filesystem, it might
+    convey the partition label; for a network-based filesystem, it might convey
+    the name of the remote volume.
+
+ *  ``FSINFO_ATTR_FEATURES``
+
+    This is a special attribute, being a set of single-bit feature flags,
+    formatted as struct-type attribute.  The meanings of the feature bits are
+    listed below - see the "Feature Bit Catalogue" section.  The feature bits
+    are grouped numerically into bytes, such that features 0-7 are in byte 0,
+    8-15 are in byte 1, 16-23 in byte 2 and so on.
+
+    Any feature bit that's not supported by the kernel will be set to false if
+    asked for.  The highest supported feature can be obtained from attribute
+    "FSINFO_ATTR_FSINFO".
+
+
+Some attributes give information about fsinfo itself:
+
+ *  ``FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO``
+
+    This struct-type attribute gives metadata about the attribute with the ID
+    specified by the Nth parameter, including its type, default size and
+    element size.
+
+ *  ``FSINFO_ATTR_FSINFO_ATTRIBUTES``
+
+    This list-type attribute gives a list of the attribute IDs available at the
+    point of reference.  FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO can then be used to
+    query each attribute.
+
+ *  ``FSINFO_ATTR_FSINFO``
+
+    This struct-type attribute gives information about the fsinfo() system call
+    itself, including the maximum number of feature bits supported.
+
+
+Then there are filesystem-specific attributes, e.g.:
+
+ *  ``FSINFO_ATTR_AFS_CELL_NAME``
+
+    This is a string-type attribute that retrieves the AFS cell name of the
+    target object.
+
+ *  ``FSINFO_ATTR_AFS_SERVER_NAME``
+
+    This is a string-type attribute that conveys the name of the Nth server
+    backing a network-filesystem superblock.
+
+ *  ``FSINFO_ATTR_AFS_SERVER_ADDRESSES``
+
+    This is a list-type attribute that conveys the Mth address of the Nth
+    server, as returned by FSINFO_ATTR_SERVER_NAME.
+
+
+Feature Bit Catalogue
+=====================
+
+The feature bits convey single true/false assertions about a specific instance
+of a filesystem (ie. a specific superblock).  They are accessed using the
+"FSINFO_ATTR_FEATURE" attribute:
+
+ *  ``FSINFO_FEAT_IS_KERNEL_FS``
+ *  ``FSINFO_FEAT_IS_BLOCK_FS``
+ *  ``FSINFO_FEAT_IS_FLASH_FS``
+ *  ``FSINFO_FEAT_IS_NETWORK_FS``
+ *  ``FSINFO_FEAT_IS_AUTOMOUNTER_FS``
+ *  ``FSINFO_FEAT_IS_MEMORY_FS``
+
+    These indicate what kind of filesystem the target is: kernel API (proc),
+    block-based (ext4), flash/nvm-based (jffs2), remote over the network (NFS),
+    local quasi-filesystem that acts as a tray of mountpoints (autofs), plain
+    in-memory filesystem (shmem).
+
+ *  ``FSINFO_FEAT_AUTOMOUNTS``
+
+    This indicate if a filesystem may have objects that are automount points.
+
+ *  ``FSINFO_FEAT_ADV_LOCKS``
+ *  ``FSINFO_FEAT_MAND_LOCKS``
+ *  ``FSINFO_FEAT_LEASES``
+
+    These indicate if a filesystem supports advisory locks, mandatory locks or
+    leases.
+
+ *  ``FSINFO_FEAT_UIDS``
+ *  ``FSINFO_FEAT_GIDS``
+ *  ``FSINFO_FEAT_PROJIDS``
+
+    These indicate if a filesystem supports/stores/transports numeric user IDs,
+    group IDs or project IDs.  The "FSINFO_ATTR_LIMITS" attribute can be used
+    to find out the upper limits on the IDs values.
+
+ *  ``FSINFO_FEAT_STRING_USER_IDS``
+
+    This indicates if a filesystem supports/stores/transports string user
+    identifiers.
+
+ *  ``FSINFO_FEAT_GUID_USER_IDS``
+
+    This indicates if a filesystem supports/stores/transports Windows GUIDs as
+    user identifiers (eg. ntfs).
+
+ *  ``FSINFO_FEAT_WINDOWS_ATTRS``
+
+    This indicates if a filesystem supports Windows FILE_* attribute bits
+    (eg. cifs, jfs).  The "FSINFO_ATTR_SUPPORTS" attribute can be used to find
+    out which windows file attributes are supported by the filesystem.
+
+ *  ``FSINFO_FEAT_USER_QUOTAS``
+ *  ``FSINFO_FEAT_GROUP_QUOTAS``
+ *  ``FSINFO_FEAT_PROJECT_QUOTAS``
+
+    These indicate if a filesystem supports quotas for users, groups or
+    projects.
+
+ *  ``FSINFO_FEAT_XATTRS``
+
+    These indicate if a filesystem supports extended attributes.  The
+    "FSINFO_ATTR_LIMITS" attribute can be used to find out the upper limits on
+    the supported name and body lengths.
+
+ *  ``FSINFO_FEAT_JOURNAL``
+ *  ``FSINFO_FEAT_DATA_IS_JOURNALLED``
+
+    These indicate whether the filesystem has a journal and whether data
+    changes are logged to it.
+
+ *  ``FSINFO_FEAT_O_SYNC``
+ *  ``FSINFO_FEAT_O_DIRECT``
+
+    These indicate whether the filesystem supports the O_SYNC and O_DIRECT
+    flags.
+
+ *  ``FSINFO_FEAT_VOLUME_ID``
+ *  ``FSINFO_FEAT_VOLUME_UUID``
+ *  ``FSINFO_FEAT_VOLUME_NAME``
+ *  ``FSINFO_FEAT_VOLUME_FSID``
+
+    These indicate whether ID, UUID, name and FSID identifiers actually exist
+    in the filesystem and thus might be considered persistent.
+
+ *  ``FSINFO_FEAT_IVER_ALL_CHANGE``
+ *  ``FSINFO_FEAT_IVER_DATA_CHANGE``
+ *  ``FSINFO_FEAT_IVER_MONO_INCR``
+
+    These indicate whether i_version in the inode is supported and, if so, what
+    mode it operates in.  The first two indicate if it's changed for any data
+    or metadata change, or whether it's only changed for any data changes; the
+    last indicates whether or not it's monotonically increasing for each such
+    change.
+
+ *  ``FSINFO_FEAT_HARD_LINKS``
+ *  ``FSINFO_FEAT_HARD_LINKS_1DIR``
+
+    These indicate whether the filesystem can have hard links made in it, and
+    whether they can be made between directory or only within the same
+    directory.
+
+ *  ``FSINFO_FEAT_DIRECTORIES``
+ *  ``FSINFO_FEAT_SYMLINKS``
+ *  ``FSINFO_FEAT_DEVICE_FILES``
+ *  ``FSINFO_FEAT_UNIX_SPECIALS``
+
+    These indicate whether directories; symbolic links; device files; or pipes
+    and sockets can be made within the filesystem.
+
+ *  ``FSINFO_FEAT_RESOURCE_FORKS``
+
+    This indicates if the filesystem supports resource forks.
+
+ *  ``FSINFO_FEAT_NAME_CASE_INDEP``
+ *  ``FSINFO_FEAT_NAME_NON_UTF8``
+ *  ``FSINFO_FEAT_NAME_HAS_CODEPAGE``
+
+    These indicate if the filesystem supports case-independent file names,
+    whether the filenames are non-utf8 (see the "FSINFO_ATTR_NAME_ENCODING"
+    attribute) and whether a codepage is in use to transliterate them (see
+    the "FSINFO_ATTR_NAME_CODEPAGE" attribute).
+
+ *  ``FSINFO_FEAT_SPARSE``
+
+    This indicates if a filesystem supports sparse files.
+
+ *  ``FSINFO_FEAT_NOT_PERSISTENT``
+
+    This indicates if a filesystem is not persistent.
+
+ *  ``FSINFO_FEAT_NO_UNIX_MODE``
+
+    This indicates if a filesystem doesn't support UNIX mode bits (though they
+    may be manufactured from other bits, such as Windows file attribute flags).
+
+ *  ``FSINFO_FEAT_HAS_ATIME``
+ *  ``FSINFO_FEAT_HAS_BTIME``
+ *  ``FSINFO_FEAT_HAS_CTIME``
+ *  ``FSINFO_FEAT_HAS_MTIME``
+
+    These indicate which timestamps a filesystem supports (access, birth,
+    change, modify).  The range and resolutions can be queried with the
+    "FSINFO_ATTR_TIMESTAMPS" attribute).



^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 15/17] fsinfo: Add support for AFS [ver #17]
  2020-02-21 18:01 [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] David Howells
                   ` (13 preceding siblings ...)
  2020-02-21 18:03 ` [PATCH 14/17] fsinfo: Add API documentation " David Howells
@ 2020-02-21 18:03 ` " David Howells
  2020-02-21 18:03 ` [PATCH 16/17] fsinfo: Add example support for Ext4 " David Howells
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 117+ messages in thread
From: David Howells @ 2020-02-21 18:03 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong,
	linux-api, linux-fsdevel, linux-kernel

Add fsinfo support to the AFS filesystem.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/internal.h           |    1 
 fs/afs/super.c              |  218 +++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fsinfo.h |   15 +++
 samples/vfs/test-fsinfo.c   |   51 ++++++++++
 4 files changed, 283 insertions(+), 2 deletions(-)

diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 1d81fc4c3058..b4b2a8a18e9f 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -248,6 +248,7 @@ struct afs_super_info {
 	struct afs_volume	*volume;	/* volume record */
 	enum afs_flock_mode	flock_mode:8;	/* File locking emulation mode */
 	bool			dyn_root;	/* True if dynamic root */
+	bool			autocell;	/* True if autocell */
 };
 
 static inline struct afs_super_info *AFS_FS_S(struct super_block *sb)
diff --git a/fs/afs/super.c b/fs/afs/super.c
index dda7a9a66848..969248a192a2 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -26,9 +26,13 @@
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
 #include <linux/magic.h>
+#include <linux/fsinfo.h>
 #include <net/net_namespace.h>
 #include "internal.h"
 
+#ifdef CONFIG_FSINFO
+static int afs_fsinfo(struct path *path, struct fsinfo_context *ctx);
+#endif
 static void afs_i_init_once(void *foo);
 static void afs_kill_super(struct super_block *sb);
 static struct inode *afs_alloc_inode(struct super_block *sb);
@@ -54,6 +58,9 @@ int afs_net_id;
 
 static const struct super_operations afs_super_ops = {
 	.statfs		= afs_statfs,
+#ifdef CONFIG_FSINFO
+	.fsinfo		= afs_fsinfo,
+#endif
 	.alloc_inode	= afs_alloc_inode,
 	.drop_inode	= afs_drop_inode,
 	.destroy_inode	= afs_destroy_inode,
@@ -193,7 +200,7 @@ static int afs_show_options(struct seq_file *m, struct dentry *root)
 
 	if (as->dyn_root)
 		seq_puts(m, ",dyn");
-	if (test_bit(AFS_VNODE_AUTOCELL, &AFS_FS_I(d_inode(root))->flags))
+	if (as->autocell)
 		seq_puts(m, ",autocell");
 	switch (as->flock_mode) {
 	case afs_flock_mode_unset:	break;
@@ -458,7 +465,7 @@ static int afs_fill_super(struct super_block *sb, struct afs_fs_context *ctx)
 	if (IS_ERR(inode))
 		return PTR_ERR(inode);
 
-	if (ctx->autocell || as->dyn_root)
+	if (as->autocell || as->dyn_root)
 		set_bit(AFS_VNODE_AUTOCELL, &AFS_FS_I(inode)->flags);
 
 	ret = -ENOMEM;
@@ -498,6 +505,8 @@ static struct afs_super_info *afs_alloc_sbi(struct fs_context *fc)
 			as->cell = afs_get_cell(ctx->cell);
 			as->volume = __afs_get_volume(ctx->volume);
 		}
+		if (ctx->autocell)
+			as->autocell = true;
 	}
 	return as;
 }
@@ -760,3 +769,208 @@ static int afs_statfs(struct dentry *dentry, struct kstatfs *buf)
 
 	return ret;
 }
+
+#ifdef CONFIG_FSINFO
+static const struct fsinfo_timestamp_info afs_timestamp_info = {
+	.atime = {
+		.minimum	= 0,
+		.maximum	= UINT_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+	.mtime = {
+		.minimum	= 0,
+		.maximum	= UINT_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+	.ctime = {
+		.minimum	= 0,
+		.maximum	= UINT_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+	.btime = {
+		.minimum	= 0,
+		.maximum	= UINT_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+};
+
+static int afs_fsinfo_get_timestamp(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_timestamp_info *tsinfo = ctx->buffer;
+	*tsinfo = afs_timestamp_info;
+	return sizeof(*tsinfo);
+}
+
+static int afs_fsinfo_get_limits(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_limits *lim = ctx->buffer;
+
+	lim->max_file_size.hi	= 0;
+	lim->max_file_size.lo	= MAX_LFS_FILESIZE;
+	/* Inode numbers can be 96-bit on YFS, but that's hard to determine. */
+	lim->max_ino.hi		= 0;
+	lim->max_ino.lo		= UINT_MAX;
+	lim->max_hard_links	= UINT_MAX;
+	lim->max_uid		= UINT_MAX;
+	lim->max_gid		= UINT_MAX;
+	lim->max_filename_len	= AFSNAMEMAX - 1;
+	lim->max_symlink_len	= AFSPATHMAX - 1;
+	return sizeof(*lim);
+}
+
+static int afs_fsinfo_get_supports(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_supports *p = ctx->buffer;
+
+	p->stx_mask = (STATX_TYPE | STATX_MODE |
+		       STATX_NLINK |
+		       STATX_UID | STATX_GID |
+		       STATX_MTIME | STATX_INO |
+		       STATX_SIZE);
+	p->stx_attributes = STATX_ATTR_AUTOMOUNT;
+	return sizeof(*p);
+}
+
+static int afs_fsinfo_get_features(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_features *p = ctx->buffer;
+
+	fsinfo_set_feature(p, FSINFO_FEAT_IS_NETWORK_FS);
+	fsinfo_set_feature(p, FSINFO_FEAT_AUTOMOUNTS);
+	fsinfo_set_feature(p, FSINFO_FEAT_ADV_LOCKS);
+	fsinfo_set_feature(p, FSINFO_FEAT_UIDS);
+	fsinfo_set_feature(p, FSINFO_FEAT_GIDS);
+	fsinfo_set_feature(p, FSINFO_FEAT_VOLUME_ID);
+	fsinfo_set_feature(p, FSINFO_FEAT_VOLUME_NAME);
+	fsinfo_set_feature(p, FSINFO_FEAT_IVER_MONO_INCR);
+	fsinfo_set_feature(p, FSINFO_FEAT_SYMLINKS);
+	fsinfo_set_feature(p, FSINFO_FEAT_HARD_LINKS_1DIR);
+	fsinfo_set_feature(p, FSINFO_FEAT_HAS_MTIME);
+	fsinfo_set_feature(p, FSINFO_FEAT_HAS_INODE_NUMBERS);
+	return sizeof(*p);
+}
+
+static int afs_dyn_fsinfo_get_features(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_features *p = ctx->buffer;
+
+	fsinfo_set_feature(p, FSINFO_FEAT_IS_AUTOMOUNTER_FS);
+	fsinfo_set_feature(p, FSINFO_FEAT_AUTOMOUNTS);
+	return sizeof(*p);
+}
+
+static int afs_fsinfo_get_volume_name(struct path *path, struct fsinfo_context *ctx)
+{
+	struct afs_super_info *as = AFS_FS_S(path->dentry->d_sb);
+	struct afs_volume *volume = as->volume;
+
+	memcpy(ctx->buffer, volume->name, volume->name_len);
+	return volume->name_len;
+}
+
+static int afs_fsinfo_get_cell_name(struct path *path, struct fsinfo_context *ctx)
+{
+	struct afs_super_info *as = AFS_FS_S(path->dentry->d_sb);
+	struct afs_cell *cell = as->cell;
+
+	memcpy(ctx->buffer, cell->name, cell->name_len);
+	return cell->name_len;
+}
+
+static int afs_fsinfo_get_server_name(struct path *path, struct fsinfo_context *ctx)
+{
+	struct afs_server_list *slist;
+	struct afs_super_info *as = AFS_FS_S(path->dentry->d_sb);
+	struct afs_volume *volume = as->volume;
+	struct afs_server *server;
+	int ret = -ENODATA;
+
+	read_lock(&volume->servers_lock);
+	slist = volume->servers;
+	if (slist) {
+		if (ctx->Nth < slist->nr_servers) {
+			server = slist->servers[ctx->Nth].server;
+			ret = sprintf(ctx->buffer, "%pU", &server->uuid);
+		}
+	}
+
+	read_unlock(&volume->servers_lock);
+	return ret;
+}
+
+static int afs_fsinfo_get_server_address(struct path *path, struct fsinfo_context *ctx)
+{
+	struct fsinfo_afs_server_address *p = ctx->buffer;
+	struct afs_server_list *slist;
+	struct afs_super_info *as = AFS_FS_S(path->dentry->d_sb);
+	struct afs_addr_list *alist;
+	struct afs_volume *volume = as->volume;
+	struct afs_server *server;
+	struct afs_net *net = afs_d2net(path->dentry);
+	unsigned int i;
+	int ret = -ENODATA;
+
+	read_lock(&volume->servers_lock);
+	slist = afs_get_serverlist(volume->servers);
+	read_unlock(&volume->servers_lock);
+
+	if (ctx->Nth >= slist->nr_servers)
+		goto put_slist;
+	server = slist->servers[ctx->Nth].server;
+
+	read_lock(&server->fs_lock);
+	alist = afs_get_addrlist(rcu_dereference_protected(
+					 server->addresses,
+					 lockdep_is_held(&server->fs_lock)));
+	read_unlock(&server->fs_lock);
+	if (!alist)
+		goto put_slist;
+
+	ret = alist->nr_addrs * sizeof(*p);
+	if (ret <= ctx->buf_size) {
+		for (i = 0; i < alist->nr_addrs; i++)
+			memcpy(&p[i].address, &alist->addrs[i],
+			       sizeof(struct sockaddr_rxrpc));
+	}
+
+	afs_put_addrlist(alist);
+put_slist:
+	afs_put_serverlist(net, slist);
+	return ret;
+}
+
+static const struct fsinfo_attribute afs_fsinfo_attributes[] = {
+	FSINFO_VSTRUCT	(FSINFO_ATTR_TIMESTAMP_INFO,	afs_fsinfo_get_timestamp),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_LIMITS,		afs_fsinfo_get_limits),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_SUPPORTS,		afs_fsinfo_get_supports),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_FEATURES,		afs_fsinfo_get_features),
+	FSINFO_STRING	(FSINFO_ATTR_VOLUME_NAME,	afs_fsinfo_get_volume_name),
+	FSINFO_STRING	(FSINFO_ATTR_AFS_CELL_NAME,	afs_fsinfo_get_cell_name),
+	FSINFO_STRING_N	(FSINFO_ATTR_AFS_SERVER_NAME,	afs_fsinfo_get_server_name),
+	FSINFO_LIST_N	(FSINFO_ATTR_AFS_SERVER_ADDRESSES, afs_fsinfo_get_server_address),
+	{}
+};
+
+static const struct fsinfo_attribute afs_dyn_fsinfo_attributes[] = {
+	FSINFO_VSTRUCT(FSINFO_ATTR_TIMESTAMP_INFO,	afs_fsinfo_get_timestamp),
+	FSINFO_VSTRUCT(FSINFO_ATTR_FEATURES,		afs_dyn_fsinfo_get_features),
+	{}
+};
+
+static int afs_fsinfo(struct path *path, struct fsinfo_context *ctx)
+{
+	struct afs_super_info *as = AFS_FS_S(path->dentry->d_sb);
+	int ret;
+
+	if (as->dyn_root)
+		ret = fsinfo_get_attribute(path, ctx, afs_dyn_fsinfo_attributes);
+	else
+		ret = fsinfo_get_attribute(path, ctx, afs_fsinfo_attributes);
+	return ret;
+}
+
+#endif /* CONFIG_FSINFO */
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index 2f9280d16293..a587b6f9847c 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -33,6 +33,10 @@
 #define FSINFO_ATTR_MOUNT_POINT		0x202	/* Relative path of mount in parent (string) */
 #define FSINFO_ATTR_MOUNT_CHILDREN	0x203	/* Children of this mount (list) */
 
+#define FSINFO_ATTR_AFS_CELL_NAME	0x300	/* AFS cell name (string) */
+#define FSINFO_ATTR_AFS_SERVER_NAME	0x301	/* Name of the Nth server (string) */
+#define FSINFO_ATTR_AFS_SERVER_ADDRESSES 0x302	/* List of addresses of the Nth server */
+
 /*
  * Optional fsinfo() parameter structure.
  *
@@ -298,4 +302,15 @@ struct fsinfo_sb_notifications {
 
 #define FSINFO_ATTR_SB_NOTIFICATIONS__STRUCT struct fsinfo_sb_notifications
 
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_AFS_SERVER_ADDRESSES).
+ *
+ * Get the addresses of the Nth server for a network filesystem.
+ */
+struct fsinfo_afs_server_address {
+	struct __kernel_sockaddr_storage address;
+};
+
+#define FSINFO_ATTR_AFS_SERVER_ADDRESSES__STRUCT struct fsinfo_afs_server_address
+
 #endif /* _UAPI_LINUX_FSINFO_H */
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
index 247fae5bbb74..f0dc90fdd49d 100644
--- a/samples/vfs/test-fsinfo.c
+++ b/samples/vfs/test-fsinfo.c
@@ -23,6 +23,7 @@
 #include <linux/socket.h>
 #include <sys/stat.h>
 #include <arpa/inet.h>
+#include <linux/rxrpc.h>
 
 #ifndef __NR_fsinfo
 #define __NR_fsinfo -1
@@ -312,6 +313,50 @@ static void dump_fsinfo_generic_sb_notifications(void *reply, unsigned int size)
 	printf("\tnotifs  : %llx\n", (unsigned long long)f->notify_counter);
 }
 
+static void dump_afs_fsinfo_server_address(void *reply, unsigned int size)
+{
+	struct fsinfo_afs_server_address *f = reply;
+	struct sockaddr_storage *ss = (struct sockaddr_storage *)&f->address;
+	struct sockaddr_rxrpc *srx;
+	struct sockaddr_in6 *sin6;
+	struct sockaddr_in *sin;
+	char proto[32], buf[1024];
+
+	if (ss->ss_family == AF_RXRPC) {
+		srx = (struct sockaddr_rxrpc *)ss;
+		printf("%5u ", srx->srx_service);
+		switch (srx->transport_type) {
+		case SOCK_DGRAM:
+			sprintf(proto, "udp");
+			break;
+		case SOCK_STREAM:
+			sprintf(proto, "tcp");
+			break;
+		default:
+			sprintf(proto, "%3u", srx->transport_type);
+			break;
+		}
+		ss = (struct sockaddr_storage *)&srx->transport;
+	}
+
+	switch (ss->ss_family) {
+	case AF_INET:
+		sin = (struct sockaddr_in *)ss;
+		if (!inet_ntop(AF_INET, &sin->sin_addr, buf, sizeof(buf)))
+			break;
+		printf("%5u/%s %s\n", ntohs(sin->sin_port), proto, buf);
+		return;
+	case AF_INET6:
+		sin6 = (struct sockaddr_in6 *)ss;
+		if (!inet_ntop(AF_INET6, &sin6->sin6_addr, buf, sizeof(buf)))
+			break;
+		printf("%5u/%s %s\n", ntohs(sin6->sin6_port), proto, buf);
+		return;
+	}
+
+	printf("family=%u\n", ss->ss_family);
+}
+
 static void dump_string(void *reply, unsigned int size)
 {
 	char *s = reply, *p;
@@ -341,6 +386,8 @@ static void dump_string(void *reply, unsigned int size)
 #define dump_fsinfo_generic_volume_name		dump_string
 #define dump_fsinfo_generic_mount_devname	dump_string
 #define dump_fsinfo_generic_mount_point		dump_string
+#define dump_afs_cell_name			dump_string
+#define dump_afs_server_name			dump_string
 
 /*
  *
@@ -382,6 +429,10 @@ static const struct fsinfo_attribute fsinfo_attributes[] = {
 	FSINFO_STRING	(FSINFO_ATTR_MOUNT_DEVNAME,	fsinfo_generic_mount_devname),
 	FSINFO_LIST	(FSINFO_ATTR_MOUNT_CHILDREN,	fsinfo_generic_mount_child),
 	FSINFO_STRING_N	(FSINFO_ATTR_MOUNT_POINT,	fsinfo_generic_mount_point),
+
+	FSINFO_STRING	(FSINFO_ATTR_AFS_CELL_NAME,	afs_cell_name),
+	FSINFO_STRING	(FSINFO_ATTR_AFS_SERVER_NAME,	afs_server_name),
+	FSINFO_LIST_N	(FSINFO_ATTR_AFS_SERVER_ADDRESSES, afs_fsinfo_server_address),
 	{}
 };
 



^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 16/17] fsinfo: Add example support for Ext4 [ver #17]
  2020-02-21 18:01 [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] David Howells
                   ` (14 preceding siblings ...)
  2020-02-21 18:03 ` [PATCH 15/17] fsinfo: Add support for AFS " David Howells
@ 2020-02-21 18:03 ` " David Howells
  2020-02-21 18:04 ` [PATCH 17/17] fsinfo: Add example support for NFS " David Howells
  2020-02-21 20:21 ` [PATCH 00/17] VFS: Filesystem information and notifications " James Bottomley
  17 siblings, 0 replies; 117+ messages in thread
From: David Howells @ 2020-02-21 18:03 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong,
	linux-api, linux-fsdevel, linux-kernel

Add the ability to list some Ext4 volume timestamps as an example.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-ext4@vger.kernel.org
---

 fs/ext4/Makefile            |    1 +
 fs/ext4/ext4.h              |    6 ++++++
 fs/ext4/fsinfo.c            |   45 +++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/super.c             |    3 +++
 include/uapi/linux/fsinfo.h |   16 +++++++++++++++
 samples/vfs/test-fsinfo.c   |   35 +++++++++++++++++++++++++++++++++
 6 files changed, 106 insertions(+)
 create mode 100644 fs/ext4/fsinfo.c

diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
index 4ccb3c9189d8..71d5b460c7c7 100644
--- a/fs/ext4/Makefile
+++ b/fs/ext4/Makefile
@@ -16,3 +16,4 @@ ext4-$(CONFIG_EXT4_FS_SECURITY)		+= xattr_security.o
 ext4-inode-test-objs			+= inode-test.o
 obj-$(CONFIG_EXT4_KUNIT_TESTS)		+= ext4-inode-test.o
 ext4-$(CONFIG_FS_VERITY)		+= verity.o
+ext4-$(CONFIG_FSINFO)			+= fsinfo.o
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9a2ee2428ecc..461968a87cd6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -42,6 +42,7 @@
 
 #include <linux/fscrypt.h>
 #include <linux/fsverity.h>
+#include <linux/fsinfo.h>
 
 #include <linux/compiler.h>
 
@@ -3166,6 +3167,11 @@ extern const struct inode_operations ext4_file_inode_operations;
 extern const struct file_operations ext4_file_operations;
 extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
 
+/* fsinfo.c */
+#ifdef CONFIG_FSINFO
+extern int ext4_fsinfo(struct path *path, struct fsinfo_context *ctx);
+#endif
+
 /* inline.c */
 extern int ext4_get_max_inline_size(struct inode *inode);
 extern int ext4_find_inline_data_nolock(struct inode *inode);
diff --git a/fs/ext4/fsinfo.c b/fs/ext4/fsinfo.c
new file mode 100644
index 000000000000..785f82a74dc9
--- /dev/null
+++ b/fs/ext4/fsinfo.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Filesystem information for ext4
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <linux/mount.h>
+#include "ext4.h"
+
+static int ext4_fsinfo_get_volume_name(struct path *path, struct fsinfo_context *ctx)
+{
+	const struct ext4_sb_info *sbi = EXT4_SB(path->mnt->mnt_sb);
+	const struct ext4_super_block *es = sbi->s_es;
+
+	memcpy(ctx->buffer, es->s_volume_name, sizeof(es->s_volume_name));
+	return strlen(ctx->buffer);
+}
+
+static int ext4_fsinfo_get_timestamps(struct path *path, struct fsinfo_context *ctx)
+{
+	const struct ext4_sb_info *sbi = EXT4_SB(path->mnt->mnt_sb);
+	const struct ext4_super_block *es = sbi->s_es;
+	struct fsinfo_ext4_timestamps *ts = ctx->buffer;
+
+#define Z(R,S) R = S | (((u64)S##_hi) << 32)
+	Z(ts->mkfs_time,	es->s_mkfs_time);
+	Z(ts->mount_time,	es->s_mtime);
+	Z(ts->write_time,	es->s_wtime);
+	Z(ts->last_check_time,	es->s_lastcheck);
+	Z(ts->first_error_time,	es->s_first_error_time);
+	Z(ts->last_error_time,	es->s_last_error_time);
+	return sizeof(*ts);
+}
+
+static const struct fsinfo_attribute ext4_fsinfo_attributes[] = {
+	FSINFO_STRING	(FSINFO_ATTR_VOLUME_NAME,	ext4_fsinfo_get_volume_name),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_EXT4_TIMESTAMPS,	ext4_fsinfo_get_timestamps),
+	{}
+};
+
+int ext4_fsinfo(struct path *path, struct fsinfo_context *ctx)
+{
+	return fsinfo_get_attribute(path, ctx, ext4_fsinfo_attributes);
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8434217549b3..02b4df073c4b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1477,6 +1477,9 @@ static const struct super_operations ext4_sops = {
 	.freeze_fs	= ext4_freeze,
 	.unfreeze_fs	= ext4_unfreeze,
 	.statfs		= ext4_statfs,
+#ifdef CONFIG_FSINFO
+	.fsinfo		= ext4_fsinfo,
+#endif
 	.remount_fs	= ext4_remount,
 	.show_options	= ext4_show_options,
 #ifdef CONFIG_QUOTA
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index a587b6f9847c..6a8a7a8e4910 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -37,6 +37,8 @@
 #define FSINFO_ATTR_AFS_SERVER_NAME	0x301	/* Name of the Nth server (string) */
 #define FSINFO_ATTR_AFS_SERVER_ADDRESSES 0x302	/* List of addresses of the Nth server */
 
+#define FSINFO_ATTR_EXT4_TIMESTAMPS	0x400	/* Ext4 superblock timestamps */
+
 /*
  * Optional fsinfo() parameter structure.
  *
@@ -313,4 +315,18 @@ struct fsinfo_afs_server_address {
 
 #define FSINFO_ATTR_AFS_SERVER_ADDRESSES__STRUCT struct fsinfo_afs_server_address
 
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_EXT4_TIMESTAMPS).
+ */
+struct fsinfo_ext4_timestamps {
+	__u64		mkfs_time;
+	__u64		mount_time;
+	__u64		write_time;
+	__u64		last_check_time;
+	__u64		first_error_time;
+	__u64		last_error_time;
+};
+
+#define FSINFO_ATTR_EXT4_TIMESTAMPS__STRUCT struct fsinfo_ext4_timestamps
+
 #endif /* _UAPI_LINUX_FSINFO_H */
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
index f0dc90fdd49d..df8d2449fc22 100644
--- a/samples/vfs/test-fsinfo.c
+++ b/samples/vfs/test-fsinfo.c
@@ -357,6 +357,40 @@ static void dump_afs_fsinfo_server_address(void *reply, unsigned int size)
 	printf("family=%u\n", ss->ss_family);
 }
 
+static char *dump_ext4_time(char *buffer, time_t tim)
+{
+	struct tm tm;
+	int len;
+
+	if (tim == 0)
+		return "-";
+
+	if (!localtime_r(&tim, &tm)) {
+		perror("localtime_r");
+		exit(1);
+	}
+	len = strftime(buffer, 100, "%F %T", &tm);
+	if (len == 0) {
+		perror("strftime");
+		exit(1);
+	}
+	return buffer;
+}
+
+static void dump_ext4_fsinfo_timestamps(void *reply, unsigned int size)
+{
+	struct fsinfo_ext4_timestamps *r = reply;
+	char buffer[100];
+
+	printf("\n");
+	printf("\tmkfs    : %s\n", dump_ext4_time(buffer, r->mkfs_time));
+	printf("\tmount   : %s\n", dump_ext4_time(buffer, r->mount_time));
+	printf("\twrite   : %s\n", dump_ext4_time(buffer, r->write_time));
+	printf("\tfsck    : %s\n", dump_ext4_time(buffer, r->last_check_time));
+	printf("\t1st-err : %s\n", dump_ext4_time(buffer, r->first_error_time));
+	printf("\tlast-err: %s\n", dump_ext4_time(buffer, r->last_error_time));
+}
+
 static void dump_string(void *reply, unsigned int size)
 {
 	char *s = reply, *p;
@@ -433,6 +467,7 @@ static const struct fsinfo_attribute fsinfo_attributes[] = {
 	FSINFO_STRING	(FSINFO_ATTR_AFS_CELL_NAME,	afs_cell_name),
 	FSINFO_STRING	(FSINFO_ATTR_AFS_SERVER_NAME,	afs_server_name),
 	FSINFO_LIST_N	(FSINFO_ATTR_AFS_SERVER_ADDRESSES, afs_fsinfo_server_address),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_EXT4_TIMESTAMPS,	ext4_fsinfo_timestamps),
 	{}
 };
 



^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 17/17] fsinfo: Add example support for NFS [ver #17]
  2020-02-21 18:01 [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] David Howells
                   ` (15 preceding siblings ...)
  2020-02-21 18:03 ` [PATCH 16/17] fsinfo: Add example support for Ext4 " David Howells
@ 2020-02-21 18:04 ` " David Howells
  2020-02-21 20:21 ` [PATCH 00/17] VFS: Filesystem information and notifications " James Bottomley
  17 siblings, 0 replies; 117+ messages in thread
From: David Howells @ 2020-02-21 18:04 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong,
	linux-api, linux-fsdevel, linux-kernel

Add the ability to list NFS server addresses and hostname, timestamp
information and capabilities as an example.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-nfs@vger.kernel.org
---

 fs/nfs/Makefile              |    1 
 fs/nfs/fsinfo.c              |  230 ++++++++++++++++++++++++++++++++++++++++++
 fs/nfs/internal.h            |    6 +
 fs/nfs/nfs4super.c           |    3 +
 fs/nfs/super.c               |    3 +
 include/uapi/linux/fsinfo.h  |   29 +++++
 include/uapi/linux/windows.h |   35 ++++++
 samples/vfs/test-fsinfo.c    |   40 +++++++
 8 files changed, 347 insertions(+)
 create mode 100644 fs/nfs/fsinfo.c
 create mode 100644 include/uapi/linux/windows.h

diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index 2433c3e03cfa..20fbc9596833 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -13,6 +13,7 @@ nfs-y 			:= client.o dir.o file.o getroot.o inode.o super.o \
 nfs-$(CONFIG_ROOT_NFS)	+= nfsroot.o
 nfs-$(CONFIG_SYSCTL)	+= sysctl.o
 nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-index.o
+nfs-$(CONFIG_FSINFO)	+= fsinfo.o
 
 obj-$(CONFIG_NFS_V2) += nfsv2.o
 nfsv2-y := nfs2super.o proc.o nfs2xdr.o
diff --git a/fs/nfs/fsinfo.c b/fs/nfs/fsinfo.c
new file mode 100644
index 000000000000..a0299ec27efd
--- /dev/null
+++ b/fs/nfs/fsinfo.c
@@ -0,0 +1,230 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Filesystem information for NFS
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <linux/nfs_fs.h>
+#include <linux/windows.h>
+#include "internal.h"
+
+static const struct fsinfo_timestamp_info nfs_timestamp_info = {
+	.atime = {
+		.minimum	= 0,
+		.maximum	= UINT_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+	.mtime = {
+		.minimum	= 0,
+		.maximum	= UINT_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+	.ctime = {
+		.minimum	= 0,
+		.maximum	= UINT_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+	.btime = {
+		.minimum	= 0,
+		.maximum	= UINT_MAX,
+		.gran_mantissa	= 1,
+		.gran_exponent	= 0,
+	},
+};
+
+static int nfs_fsinfo_get_timestamp_info(struct path *path, struct fsinfo_context *ctx)
+{
+	const struct nfs_server *server = NFS_SB(path->dentry->d_sb);
+	struct fsinfo_timestamp_info *r = ctx->buffer;
+	unsigned long long nsec;
+	unsigned int rem, mant;
+	int exp = -9;
+
+	*r = nfs_timestamp_info;
+
+	nsec = server->time_delta.tv_nsec;
+	nsec += server->time_delta.tv_sec * 1000000000ULL;
+	if (nsec == 0)
+		goto out;
+
+	do {
+		mant = nsec;
+		rem = do_div(nsec, 10);
+		if (rem)
+			break;
+		exp++;
+	} while (nsec);
+
+	r->atime.gran_mantissa = mant;
+	r->atime.gran_exponent = exp;
+	r->btime.gran_mantissa = mant;
+	r->btime.gran_exponent = exp;
+	r->ctime.gran_mantissa = mant;
+	r->ctime.gran_exponent = exp;
+	r->mtime.gran_mantissa = mant;
+	r->mtime.gran_exponent = exp;
+
+out:
+	return sizeof(*r);
+}
+
+static int nfs_fsinfo_get_info(struct path *path, struct fsinfo_context *ctx)
+{
+	const struct nfs_server *server = NFS_SB(path->dentry->d_sb);
+	const struct nfs_client *clp = server->nfs_client;
+	struct fsinfo_nfs_info *r = ctx->buffer;
+
+	r->version		= clp->rpc_ops->version;
+	r->minor_version	= clp->cl_minorversion;
+	r->transport_proto	= clp->cl_proto;
+	return sizeof(*r);
+}
+
+static int nfs_fsinfo_get_server_name(struct path *path, struct fsinfo_context *ctx)
+{
+	const struct nfs_server *server = NFS_SB(path->dentry->d_sb);
+	const struct nfs_client *clp = server->nfs_client;
+
+	return fsinfo_string(clp->cl_hostname, ctx);
+}
+
+static int nfs_fsinfo_get_server_addresses(struct path *path, struct fsinfo_context *ctx)
+{
+	const struct nfs_server *server = NFS_SB(path->dentry->d_sb);
+	const struct nfs_client *clp = server->nfs_client;
+	struct fsinfo_nfs_server_address *addr = ctx->buffer;
+	int ret;
+
+	ret = 1 * sizeof(*addr);
+	if (ret <= ctx->buf_size)
+		memcpy(&addr[0].address, &clp->cl_addr, clp->cl_addrlen);
+	return ret;
+
+}
+
+static int nfs_fsinfo_get_gssapi_name(struct path *path, struct fsinfo_context *ctx)
+{
+	const struct nfs_server *server = NFS_SB(path->dentry->d_sb);
+	const struct nfs_client *clp = server->nfs_client;
+
+	return fsinfo_string(clp->cl_acceptor, ctx);
+}
+
+static int nfs_fsinfo_get_limits(struct path *path, struct fsinfo_context *ctx)
+{
+	const struct nfs_server *server = NFS_SB(path->dentry->d_sb);
+	struct fsinfo_limits *lim = ctx->buffer;
+
+	lim->max_file_size.hi	= 0;
+	lim->max_file_size.lo	= server->maxfilesize;
+	lim->max_ino.hi		= 0;
+	lim->max_ino.lo		= U64_MAX;
+	lim->max_hard_links	= UINT_MAX;
+	lim->max_uid		= UINT_MAX;
+	lim->max_gid		= UINT_MAX;
+	lim->max_filename_len	= NAME_MAX - 1;
+	lim->max_symlink_len	= PATH_MAX - 1;
+	return sizeof(*lim);
+}
+
+static int nfs_fsinfo_get_supports(struct path *path, struct fsinfo_context *ctx)
+{
+	const struct nfs_server *server = NFS_SB(path->dentry->d_sb);
+	struct fsinfo_supports *sup = ctx->buffer;
+
+	/* Don't set STATX_INO as i_ino is fabricated and may not be unique. */
+
+	if (!(server->caps & NFS_CAP_MODE))
+		sup->stx_mask |= STATX_TYPE | STATX_MODE;
+	if (server->caps & NFS_CAP_OWNER)
+		sup->stx_mask |= STATX_UID;
+	if (server->caps & NFS_CAP_OWNER_GROUP)
+		sup->stx_mask |= STATX_GID;
+	if (server->caps & NFS_CAP_ATIME)
+		sup->stx_mask |= STATX_ATIME;
+	if (server->caps & NFS_CAP_CTIME)
+		sup->stx_mask |= STATX_CTIME;
+	if (server->caps & NFS_CAP_MTIME)
+		sup->stx_mask |= STATX_MTIME;
+	if (server->attr_bitmask[0] & FATTR4_WORD0_SIZE)
+		sup->stx_mask |= STATX_SIZE;
+	if (server->attr_bitmask[1] & FATTR4_WORD1_NUMLINKS)
+		sup->stx_mask |= STATX_NLINK;
+
+	if (server->attr_bitmask[0] & FATTR4_WORD0_ARCHIVE)
+		sup->win_file_attrs |= ATTR_ARCHIVE;
+	if (server->attr_bitmask[0] & FATTR4_WORD0_HIDDEN)
+		sup->win_file_attrs |= ATTR_HIDDEN;
+	if (server->attr_bitmask[1] & FATTR4_WORD1_SYSTEM)
+		sup->win_file_attrs |= ATTR_SYSTEM;
+
+	sup->stx_attributes = STATX_ATTR_AUTOMOUNT;
+	return sizeof(*sup);
+}
+
+static int nfs_fsinfo_get_features(struct path *path, struct fsinfo_context *ctx)
+{
+	const struct nfs_server *server = NFS_SB(path->dentry->d_sb);
+	struct fsinfo_features *ft = ctx->buffer;
+
+	fsinfo_set_feature(ft, FSINFO_FEAT_IS_NETWORK_FS);
+	fsinfo_set_feature(ft, FSINFO_FEAT_AUTOMOUNTS);
+	fsinfo_set_feature(ft, FSINFO_FEAT_O_SYNC);
+	fsinfo_set_feature(ft, FSINFO_FEAT_O_DIRECT);
+	fsinfo_set_feature(ft, FSINFO_FEAT_ADV_LOCKS);
+	fsinfo_set_feature(ft, FSINFO_FEAT_DEVICE_FILES);
+	fsinfo_set_feature(ft, FSINFO_FEAT_UNIX_SPECIALS);
+	if (server->nfs_client->rpc_ops->version == 4) {
+		fsinfo_set_feature(ft, FSINFO_FEAT_LEASES);
+		fsinfo_set_feature(ft, FSINFO_FEAT_IVER_ALL_CHANGE);
+	}
+
+	if (server->caps & NFS_CAP_OWNER)
+		fsinfo_set_feature(ft, FSINFO_FEAT_UIDS);
+	if (server->caps & NFS_CAP_OWNER_GROUP)
+		fsinfo_set_feature(ft, FSINFO_FEAT_GIDS);
+	if (!(server->caps & NFS_CAP_MODE))
+		fsinfo_set_feature(ft, FSINFO_FEAT_NO_UNIX_MODE);
+	if (server->caps & NFS_CAP_ACLS)
+		fsinfo_set_feature(ft, FSINFO_FEAT_HAS_ACL);
+	if (server->caps & NFS_CAP_SYMLINKS)
+		fsinfo_set_feature(ft, FSINFO_FEAT_SYMLINKS);
+	if (server->caps & NFS_CAP_HARDLINKS)
+		fsinfo_set_feature(ft, FSINFO_FEAT_HARD_LINKS);
+	if (server->caps & NFS_CAP_ATIME)
+		fsinfo_set_feature(ft, FSINFO_FEAT_HAS_ATIME);
+	if (server->caps & NFS_CAP_CTIME)
+		fsinfo_set_feature(ft, FSINFO_FEAT_HAS_CTIME);
+	if (server->caps & NFS_CAP_MTIME)
+		fsinfo_set_feature(ft, FSINFO_FEAT_HAS_MTIME);
+
+	if (server->attr_bitmask[0] & FATTR4_WORD0_CASE_INSENSITIVE)
+		fsinfo_set_feature(ft, FSINFO_FEAT_NAME_CASE_INDEP);
+	if ((server->attr_bitmask[0] & FATTR4_WORD0_ARCHIVE) ||
+	    (server->attr_bitmask[0] & FATTR4_WORD0_HIDDEN) ||
+	    (server->attr_bitmask[1] & FATTR4_WORD1_SYSTEM))
+		fsinfo_set_feature(ft, FSINFO_FEAT_WINDOWS_ATTRS);
+
+	return sizeof(*ft);
+}
+
+static const struct fsinfo_attribute nfs_fsinfo_attributes[] = {
+	FSINFO_VSTRUCT	(FSINFO_ATTR_TIMESTAMP_INFO,	nfs_fsinfo_get_timestamp_info),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_LIMITS,		nfs_fsinfo_get_limits),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_SUPPORTS,		nfs_fsinfo_get_supports),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_FEATURES,		nfs_fsinfo_get_features),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_NFS_INFO,		nfs_fsinfo_get_info),
+	FSINFO_STRING	(FSINFO_ATTR_NFS_SERVER_NAME,	nfs_fsinfo_get_server_name),
+	FSINFO_LIST	(FSINFO_ATTR_NFS_SERVER_ADDRESSES, nfs_fsinfo_get_server_addresses),
+	FSINFO_STRING	(FSINFO_ATTR_NFS_GSSAPI_NAME,	nfs_fsinfo_get_gssapi_name),
+	{}
+};
+
+int nfs_fsinfo(struct path *path, struct fsinfo_context *ctx)
+{
+	return fsinfo_get_attribute(path, ctx, nfs_fsinfo_attributes);
+}
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index f80c47d5ff27..59e407066b45 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -10,6 +10,7 @@
 #include <linux/sunrpc/addr.h>
 #include <linux/nfs_page.h>
 #include <linux/wait_bit.h>
+#include <linux/fsinfo.h>
 
 #define NFS_SB_MASK (SB_RDONLY|SB_NOSUID|SB_NODEV|SB_NOEXEC|SB_SYNCHRONOUS)
 
@@ -247,6 +248,11 @@ extern const struct svc_version nfs4_callback_version4;
 /* fs_context.c */
 extern struct file_system_type nfs_fs_type;
 
+/* fsinfo.c */
+#ifdef CONFIG_FSINFO
+extern int nfs_fsinfo(struct path *path, struct fsinfo_context *ctx);
+#endif
+
 /* pagelist.c */
 extern int __init nfs_init_nfspagecache(void);
 extern void nfs_destroy_nfspagecache(void);
diff --git a/fs/nfs/nfs4super.c b/fs/nfs/nfs4super.c
index 1475f932d7da..cd38da87cbd3 100644
--- a/fs/nfs/nfs4super.c
+++ b/fs/nfs/nfs4super.c
@@ -26,6 +26,9 @@ static const struct super_operations nfs4_sops = {
 	.write_inode	= nfs4_write_inode,
 	.drop_inode	= nfs_drop_inode,
 	.statfs		= nfs_statfs,
+#ifdef CONFIG_FSINFO
+	.fsinfo		= nfs_fsinfo,
+#endif
 	.evict_inode	= nfs4_evict_inode,
 	.umount_begin	= nfs_umount_begin,
 	.show_options	= nfs_show_options,
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index dada09b391c6..27ac751d3789 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -76,6 +76,9 @@ const struct super_operations nfs_sops = {
 	.write_inode	= nfs_write_inode,
 	.drop_inode	= nfs_drop_inode,
 	.statfs		= nfs_statfs,
+#ifdef CONFIG_FSINFO
+	.fsinfo		= nfs_fsinfo,
+#endif
 	.evict_inode	= nfs_evict_inode,
 	.umount_begin	= nfs_umount_begin,
 	.show_options	= nfs_show_options,
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index 6a8a7a8e4910..d5c4fe681333 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -39,6 +39,11 @@
 
 #define FSINFO_ATTR_EXT4_TIMESTAMPS	0x400	/* Ext4 superblock timestamps */
 
+#define FSINFO_ATTR_NFS_INFO		0x500	/* Information about an NFS mount */
+#define FSINFO_ATTR_NFS_SERVER_NAME	0x501	/* Name of the server (string) */
+#define FSINFO_ATTR_NFS_SERVER_ADDRESSES 0x502	/* List of addresses of the server */
+#define FSINFO_ATTR_NFS_GSSAPI_NAME	0x503	/* GSSAPI acceptor name */
+
 /*
  * Optional fsinfo() parameter structure.
  *
@@ -329,4 +334,28 @@ struct fsinfo_ext4_timestamps {
 
 #define FSINFO_ATTR_EXT4_TIMESTAMPS__STRUCT struct fsinfo_ext4_timestamps
 
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_NFS_INFO).
+ *
+ * Get information about an NFS mount.
+ */
+struct fsinfo_nfs_info {
+	__u32		version;
+	__u32		minor_version;
+	__u32		transport_proto;
+};
+
+#define FSINFO_ATTR_NFS_INFO__STRUCT struct fsinfo_nfs_info
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_NFS_SERVER_ADDRESSES).
+ *
+ * Get the addresses of the server for an NFS mount.
+ */
+struct fsinfo_nfs_server_address {
+	struct __kernel_sockaddr_storage address;
+};
+
+#define FSINFO_ATTR_NFS_SERVER_ADDRESSES__STRUCT struct fsinfo_nfs_server_address
+
 #endif /* _UAPI_LINUX_FSINFO_H */
diff --git a/include/uapi/linux/windows.h b/include/uapi/linux/windows.h
new file mode 100644
index 000000000000..17efb9a40529
--- /dev/null
+++ b/include/uapi/linux/windows.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Common windows attributes
+ */
+#ifndef _UAPI_LINUX_WINDOWS_H
+#define _UAPI_LINUX_WINDOWS_H
+
+/*
+ * File Attribute flags
+ */
+#define ATTR_READONLY		0x0001
+#define ATTR_HIDDEN		0x0002
+#define ATTR_SYSTEM		0x0004
+#define ATTR_VOLUME		0x0008
+#define ATTR_DIRECTORY		0x0010
+#define ATTR_ARCHIVE		0x0020
+#define ATTR_DEVICE		0x0040
+#define ATTR_NORMAL		0x0080
+#define ATTR_TEMPORARY		0x0100
+#define ATTR_SPARSE		0x0200
+#define ATTR_REPARSE		0x0400
+#define ATTR_COMPRESSED		0x0800
+#define ATTR_OFFLINE		0x1000	/* ie file not immediately available -
+					   on offline storage */
+#define ATTR_NOT_CONTENT_INDEXED 0x2000
+#define ATTR_ENCRYPTED		0x4000
+#define ATTR_POSIX_SEMANTICS	0x01000000
+#define ATTR_BACKUP_SEMANTICS	0x02000000
+#define ATTR_DELETE_ON_CLOSE	0x04000000
+#define ATTR_SEQUENTIAL_SCAN	0x08000000
+#define ATTR_RANDOM_ACCESS	0x10000000
+#define ATTR_NO_BUFFERING	0x20000000
+#define ATTR_WRITE_THROUGH	0x80000000
+
+#endif /* _UAPI_LINUX_WINDOWS_H */
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
index df8d2449fc22..87239f0b6a50 100644
--- a/samples/vfs/test-fsinfo.c
+++ b/samples/vfs/test-fsinfo.c
@@ -391,6 +391,40 @@ static void dump_ext4_fsinfo_timestamps(void *reply, unsigned int size)
 	printf("\tlast-err: %s\n", dump_ext4_time(buffer, r->last_error_time));
 }
 
+static void dump_nfs_fsinfo_info(void *reply, unsigned int size)
+{
+	struct fsinfo_nfs_info *r = reply;
+
+	printf("ver=%u.%u proto=%u\n", r->version, r->minor_version, r->transport_proto);
+}
+
+static void dump_nfs_fsinfo_server_addresses(void *reply, unsigned int size)
+{
+	struct fsinfo_nfs_server_address *r = reply;
+	struct sockaddr_storage *ss = (struct sockaddr_storage *)&r->address;
+	struct sockaddr_in6 *sin6;
+	struct sockaddr_in *sin;
+	char buf[1024];
+
+	switch (ss->ss_family) {
+	case AF_INET:
+		sin = (struct sockaddr_in *)ss;
+		if (!inet_ntop(AF_INET, &sin->sin_addr, buf, sizeof(buf)))
+			break;
+		printf("%5u %s\n", ntohs(sin->sin_port), buf);
+		return;
+	case AF_INET6:
+		sin6 = (struct sockaddr_in6 *)ss;
+		if (!inet_ntop(AF_INET6, &sin6->sin6_addr, buf, sizeof(buf)))
+			break;
+		printf("%5u %s\n", ntohs(sin6->sin6_port), buf);
+		return;
+	default:
+		printf("family=%u\n", ss->ss_family);
+		return;
+	}
+}
+
 static void dump_string(void *reply, unsigned int size)
 {
 	char *s = reply, *p;
@@ -422,6 +456,8 @@ static void dump_string(void *reply, unsigned int size)
 #define dump_fsinfo_generic_mount_point		dump_string
 #define dump_afs_cell_name			dump_string
 #define dump_afs_server_name			dump_string
+#define dump_nfs_fsinfo_server_name		dump_string
+#define dump_nfs_fsinfo_gssapi_name		dump_string
 
 /*
  *
@@ -468,6 +504,10 @@ static const struct fsinfo_attribute fsinfo_attributes[] = {
 	FSINFO_STRING	(FSINFO_ATTR_AFS_SERVER_NAME,	afs_server_name),
 	FSINFO_LIST_N	(FSINFO_ATTR_AFS_SERVER_ADDRESSES, afs_fsinfo_server_address),
 	FSINFO_VSTRUCT	(FSINFO_ATTR_EXT4_TIMESTAMPS,	ext4_fsinfo_timestamps),
+	FSINFO_VSTRUCT	(FSINFO_ATTR_NFS_INFO,		nfs_fsinfo_info),
+	FSINFO_STRING	(FSINFO_ATTR_NFS_SERVER_NAME,	nfs_fsinfo_server_name),
+	FSINFO_LIST	(FSINFO_ATTR_NFS_SERVER_ADDRESSES, nfs_fsinfo_server_addresses),
+	FSINFO_STRING	(FSINFO_ATTR_NFS_GSSAPI_NAME,	nfs_fsinfo_gssapi_name),
 	{}
 };
 



^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-21 18:01 [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] David Howells
                   ` (16 preceding siblings ...)
  2020-02-21 18:04 ` [PATCH 17/17] fsinfo: Add example support for NFS " David Howells
@ 2020-02-21 20:21 ` " James Bottomley
  2020-02-24 10:24   ` Miklos Szeredi
  17 siblings, 1 reply; 117+ messages in thread
From: James Bottomley @ 2020-02-21 20:21 UTC (permalink / raw)
  To: David Howells, viro
  Cc: raven, mszeredi, christian, jannh, darrick.wong, linux-api,
	linux-fsdevel, linux-kernel

On Fri, 2020-02-21 at 18:01 +0000, David Howells wrote:
[...]
> ============================
> FILESYSTEM INFORMATION QUERY
> ============================
> 
> The fsinfo() system call allows information about the filesystem at a
> particular path point to be queried as a set of attributes, some of
> which may have more than one value.
> 
> Attribute values are of four basic types:
> 
>  (1) Version dependent-length structure (size defined by type).
> 
>  (2) Variable-length string (up to 4096, including NUL).
> 
>  (3) List of structures (up to INT_MAX size).
> 
>  (4) Opaque blob (up to INT_MAX size).
> 
> Attributes can have multiple values either as a sequence of values or
> a sequence-of-sequences of values and all the values of a particular
> attribute must be of the same type.
> 
> Note that the values of an attribute *are* allowed to vary between
> dentries within a single superblock, depending on the specific dentry
> that you're looking at, but all the values of an attribute have to be
> of the same type.
> 
> I've tried to make the interface as light as possible, so
> integer/enum attribute selector rather than string and the core does
> all the allocation and extensibility support work rather than leaving
> that to the filesystems. That means that for the first two attribute
> types, the filesystem will always see a sufficiently-sized buffer
> allocated.  Further, this removes the possibility of the filesystem
> gaining access to the userspace buffer.
> 
> 
> fsinfo() allows a variety of information to be retrieved about a
> filesystem and the mount topology:
> 
>  (1) General superblock attributes:
> 
>      - Filesystem identifiers (UUID, volume label, device numbers,
> ...)
>      - The limits on a filesystem's capabilities
>      - Information on supported statx fields and attributes and IOC
> flags.
>      - A variety single-bit flags indicating supported capabilities.
>      - Timestamp resolution and range.
>      - The amount of space/free space in a filesystem (as statfs()).
>      - Superblock notification counter.
> 
>  (2) Filesystem-specific superblock attributes:
> 
>      - Superblock-level timestamps.
>      - Cell name.
>      - Server names and addresses.
>      - Filesystem-specific information.
> 
>  (3) VFS information:
> 
>      - Mount topology information.
>      - Mount attributes.
>      - Mount notification counter.
> 
>  (4) Information about what the fsinfo() syscall itself supports,
> including
>      the type and struct/element size of attributes.
> 
> The system is extensible:
> 
>  (1) New attributes can be added.  There is no requirement that a
>      filesystem implement every attribute.  Note that the core VFS
> keeps a
>      table of types and sizes so it can handle future extensibility
> rather
>      than delegating this to the filesystems.
> 
>  (2) Version length-dependent structure attributes can be made larger
> and
>      have additional information tacked on the end, provided it keeps
> the
>      layout of the existing fields.  If an older process asks for a
> shorter
>      structure, it will only be given the bits it asks for.  If a
> newer
>      process asks for a longer structure on an older kernel, the
> extra
>      space will be set to 0.  In all cases, the size of the data
> actually
>      available is returned.
> 
>      In essence, the size of a structure is that structure's version:
> a
>      smaller size is an earlier version and a later version includes
>      everything that the earlier version did.
> 
>  (3) New single-bit capability flags can be added.  This is a
> structure-typed
>      attribute and, as such, (2) applies.  Any bits you wanted but
> the kernel
>      doesn't support are automatically set to 0.
> 
> fsinfo() may be called like the following, for example:
> 
> 	struct fsinfo_params params = {
> 		.at_flags	= AT_SYMLINK_NOFOLLOW,
> 		.flags		= FSINFO_FLAGS_QUERY_PATH,
> 		.request	= FSINFO_ATTR_AFS_SERVER_ADDRESSES,
> 		.Nth		= 2,
> 	};
> 	struct fsinfo_server_address address;
> 	len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", &params,
> 		     &address, sizeof(address));
> 
> The above example would query an AFS filesystem to retrieve the
> address
> list for the 3rd server, and:
> 
> 	struct fsinfo_params params = {
> 		.at_flags	= AT_SYMLINK_NOFOLLOW,
> 		.flags		= FSINFO_FLAGS_QUERY_PATH,
> 		.request	= FSINFO_ATTR_AFS_CELL_NAME;
> 	};
> 	char cell_name[256];
> 	len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", &params,
> 		     &cell_name, sizeof(cell_name));
> 
> would retrieve the name of an AFS cell as a string.
> 
> In future, I want to make fsinfo() capable of querying a context
> created by
> fsopen() or fspick(), e.g.:
> 
> 	fd = fsopen("ext4", 0);
> 	struct fsinfo_params params = {
> 		.flags		= FSINFO_FLAGS_QUERY_FSCONTEXT,
> 		.request	= FSINFO_ATTR_PARAMETERS;
> 	};
> 	char buffer[65536];
> 	fsinfo(fd, NULL, &params, &buffer, sizeof(buffer));
> 
> even if that context doesn't currently have a superblock attached.  I
> would prefer this to contain length-prefixed strings so that there's
> no need to insert escaping, especially as any character, including
> '\', can be used as the separator in cifs and so that binary
> parameters can be returned (though that is a lesser issue).

Could I make a suggestion about how this should be done in a way that
doesn't actually require the fsinfo syscall at all: it could just be
done with fsconfig.  The idea is based on something I've wanted to do
for configfd but couldn't because otherwise it wouldn't substitute for
fsconfig, but Christian made me think it was actually essential to the
ability of the seccomp and other verifier tools in the critique of
configfd and I belive the same critique applies here.

Instead of making fsconfig functionally configure ... as in you pass
the attribute name, type and parameters down into the fs specific
handler and the handler does a string match and then verifies the
parameters and then acts on them, make it table configured, so what
each fstype does is register a table of attributes which can be got and
optionally set (with each attribute having a get and optional set
function).  We'd have multiple tables per fstype, so the generic VFS
can register a table of attributes it understands for every fstype
(things like name, uuid and the like) and then each fs type would
register a table of fs specific attributes following the same pattern. 
The system would examine the fs specific table before the generic one,
allowing overrides.  fsconfig would have the ability to both get and
set attributes, permitting retrieval as well as setting (which is how I
get rid of the fsinfo syscall), we'd have a global parameter, which
would retrieve the entire table by name and type so the whole thing is
introspectable because the upper layer knows a-priori all the
attributes which can be set for a given fs type and what type they are
(so we can make more of the parsing generic).  Any attribute which
doesn't have a set routine would be read only and all attributes would
have to have a get routine meaning everything is queryable.

I think I know how to code this up in a way that would be fully
transparent to the existing syscalls.

James


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-21 20:21 ` [PATCH 00/17] VFS: Filesystem information and notifications " James Bottomley
@ 2020-02-24 10:24   ` Miklos Szeredi
  2020-02-24 14:55     ` James Bottomley
  0 siblings, 1 reply; 117+ messages in thread
From: Miklos Szeredi @ 2020-02-24 10:24 UTC (permalink / raw)
  To: James Bottomley
  Cc: David Howells, viro, Ian Kent, christian, Jann Horn,
	darrick.wong, Linux API, linux-fsdevel, lkml

On Fri, Feb 21, 2020 at 9:21 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> On Fri, 2020-02-21 at 18:01 +0000, David Howells wrote:
> [...]
> > ============================
> > FILESYSTEM INFORMATION QUERY
> > ============================
> >
> > The fsinfo() system call allows information about the filesystem at a
> > particular path point to be queried as a set of attributes, some of
> > which may have more than one value.
> >
> > Attribute values are of four basic types:
> >
> >  (1) Version dependent-length structure (size defined by type).
> >
> >  (2) Variable-length string (up to 4096, including NUL).
> >
> >  (3) List of structures (up to INT_MAX size).
> >
> >  (4) Opaque blob (up to INT_MAX size).
> >
> > Attributes can have multiple values either as a sequence of values or
> > a sequence-of-sequences of values and all the values of a particular
> > attribute must be of the same type.
> >
> > Note that the values of an attribute *are* allowed to vary between
> > dentries within a single superblock, depending on the specific dentry
> > that you're looking at, but all the values of an attribute have to be
> > of the same type.
> >
> > I've tried to make the interface as light as possible, so
> > integer/enum attribute selector rather than string and the core does
> > all the allocation and extensibility support work rather than leaving
> > that to the filesystems. That means that for the first two attribute
> > types, the filesystem will always see a sufficiently-sized buffer
> > allocated.  Further, this removes the possibility of the filesystem
> > gaining access to the userspace buffer.
> >
> >
> > fsinfo() allows a variety of information to be retrieved about a
> > filesystem and the mount topology:
> >
> >  (1) General superblock attributes:
> >
> >      - Filesystem identifiers (UUID, volume label, device numbers,
> > ...)
> >      - The limits on a filesystem's capabilities
> >      - Information on supported statx fields and attributes and IOC
> > flags.
> >      - A variety single-bit flags indicating supported capabilities.
> >      - Timestamp resolution and range.
> >      - The amount of space/free space in a filesystem (as statfs()).
> >      - Superblock notification counter.
> >
> >  (2) Filesystem-specific superblock attributes:
> >
> >      - Superblock-level timestamps.
> >      - Cell name.
> >      - Server names and addresses.
> >      - Filesystem-specific information.
> >
> >  (3) VFS information:
> >
> >      - Mount topology information.
> >      - Mount attributes.
> >      - Mount notification counter.
> >
> >  (4) Information about what the fsinfo() syscall itself supports,
> > including
> >      the type and struct/element size of attributes.
> >
> > The system is extensible:
> >
> >  (1) New attributes can be added.  There is no requirement that a
> >      filesystem implement every attribute.  Note that the core VFS
> > keeps a
> >      table of types and sizes so it can handle future extensibility
> > rather
> >      than delegating this to the filesystems.
> >
> >  (2) Version length-dependent structure attributes can be made larger
> > and
> >      have additional information tacked on the end, provided it keeps
> > the
> >      layout of the existing fields.  If an older process asks for a
> > shorter
> >      structure, it will only be given the bits it asks for.  If a
> > newer
> >      process asks for a longer structure on an older kernel, the
> > extra
> >      space will be set to 0.  In all cases, the size of the data
> > actually
> >      available is returned.
> >
> >      In essence, the size of a structure is that structure's version:
> > a
> >      smaller size is an earlier version and a later version includes
> >      everything that the earlier version did.
> >
> >  (3) New single-bit capability flags can be added.  This is a
> > structure-typed
> >      attribute and, as such, (2) applies.  Any bits you wanted but
> > the kernel
> >      doesn't support are automatically set to 0.
> >
> > fsinfo() may be called like the following, for example:
> >
> >       struct fsinfo_params params = {
> >               .at_flags       = AT_SYMLINK_NOFOLLOW,
> >               .flags          = FSINFO_FLAGS_QUERY_PATH,
> >               .request        = FSINFO_ATTR_AFS_SERVER_ADDRESSES,
> >               .Nth            = 2,
> >       };
> >       struct fsinfo_server_address address;
> >       len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", &params,
> >                    &address, sizeof(address));
> >
> > The above example would query an AFS filesystem to retrieve the
> > address
> > list for the 3rd server, and:
> >
> >       struct fsinfo_params params = {
> >               .at_flags       = AT_SYMLINK_NOFOLLOW,
> >               .flags          = FSINFO_FLAGS_QUERY_PATH,
> >               .request        = FSINFO_ATTR_AFS_CELL_NAME;
> >       };
> >       char cell_name[256];
> >       len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", &params,
> >                    &cell_name, sizeof(cell_name));
> >
> > would retrieve the name of an AFS cell as a string.
> >
> > In future, I want to make fsinfo() capable of querying a context
> > created by
> > fsopen() or fspick(), e.g.:
> >
> >       fd = fsopen("ext4", 0);
> >       struct fsinfo_params params = {
> >               .flags          = FSINFO_FLAGS_QUERY_FSCONTEXT,
> >               .request        = FSINFO_ATTR_PARAMETERS;
> >       };
> >       char buffer[65536];
> >       fsinfo(fd, NULL, &params, &buffer, sizeof(buffer));
> >
> > even if that context doesn't currently have a superblock attached.  I
> > would prefer this to contain length-prefixed strings so that there's
> > no need to insert escaping, especially as any character, including
> > '\', can be used as the separator in cifs and so that binary
> > parameters can be returned (though that is a lesser issue).
>
> Could I make a suggestion about how this should be done in a way that
> doesn't actually require the fsinfo syscall at all: it could just be
> done with fsconfig.  The idea is based on something I've wanted to do
> for configfd but couldn't because otherwise it wouldn't substitute for
> fsconfig, but Christian made me think it was actually essential to the
> ability of the seccomp and other verifier tools in the critique of
> configfd and I belive the same critique applies here.
>
> Instead of making fsconfig functionally configure ... as in you pass
> the attribute name, type and parameters down into the fs specific
> handler and the handler does a string match and then verifies the
> parameters and then acts on them, make it table configured, so what
> each fstype does is register a table of attributes which can be got and
> optionally set (with each attribute having a get and optional set
> function).  We'd have multiple tables per fstype, so the generic VFS
> can register a table of attributes it understands for every fstype
> (things like name, uuid and the like) and then each fs type would
> register a table of fs specific attributes following the same pattern.
> The system would examine the fs specific table before the generic one,
> allowing overrides.  fsconfig would have the ability to both get and
> set attributes, permitting retrieval as well as setting (which is how I
> get rid of the fsinfo syscall), we'd have a global parameter, which
> would retrieve the entire table by name and type so the whole thing is
> introspectable because the upper layer knows a-priori all the
> attributes which can be set for a given fs type and what type they are
> (so we can make more of the parsing generic).  Any attribute which
> doesn't have a set routine would be read only and all attributes would
> have to have a get routine meaning everything is queryable.

And that makes me wonder: would a
"/sys/class/fs/$ST_DEV/options/$OPTION" type interface be feasible for
this?

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-24 10:24   ` Miklos Szeredi
@ 2020-02-24 14:55     ` James Bottomley
  2020-02-24 15:28       ` Miklos Szeredi
  0 siblings, 1 reply; 117+ messages in thread
From: James Bottomley @ 2020-02-24 14:55 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: David Howells, viro, Ian Kent, christian, Jann Horn,
	darrick.wong, Linux API, linux-fsdevel, lkml

On Mon, 2020-02-24 at 11:24 +0100, Miklos Szeredi wrote:
> On Fri, Feb 21, 2020 at 9:21 PM James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
[...]
> > Could I make a suggestion about how this should be done in a way
> > that doesn't actually require the fsinfo syscall at all: it could
> > just be done with fsconfig.  The idea is based on something I've
> > wanted to do for configfd but couldn't because otherwise it
> > wouldn't substitute for fsconfig, but Christian made me think it
> > was actually essential to the ability of the seccomp and other
> > verifier tools in the critique of configfd and I belive the same
> > critique applies here.
> > 
> > Instead of making fsconfig functionally configure ... as in you
> > pass the attribute name, type and parameters down into the fs
> > specific handler and the handler does a string match and then
> > verifies the parameters and then acts on them, make it table
> > configured, so what each fstype does is register a table of
> > attributes which can be got and optionally set (with each attribute
> > having a get and optional set function).  We'd have multiple tables
> > per fstype, so the generic VFS can register a table of attributes
> > it understands for every fstype (things like name, uuid and the
> > like) and then each fs type would register a table of fs specific
> > attributes following the same pattern. The system would examine the
> > fs specific table before the generic one, allowing
> > overrides.  fsconfig would have the ability to both get and
> > set attributes, permitting retrieval as well as setting (which is
> > how I get rid of the fsinfo syscall), we'd have a global parameter,
> > which would retrieve the entire table by name and type so the whole
> > thing is introspectable because the upper layer knows a-priori all
> > the attributes which can be set for a given fs type and what type
> > they are (so we can make more of the parsing generic).  Any
> > attribute which doesn't have a set routine would be read only and
> > all attributes would have to have a get routine meaning everything
> > is queryable.
> 
> And that makes me wonder: would a
> "/sys/class/fs/$ST_DEV/options/$OPTION" type interface be feasible
> for this?

Once it's table driven, certainly a sysfs directory becomes possible. 
The problem with ST_DEV is filesystems like btrfs and xfs that may have
multiple devices.  The current fsinfo takes a fspick'd directory fd so
the input to the query is a path, which gets messy in sysfs, although I
could see something like /sys/class/fs/mount/<path>/$OPTION working.

James


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-24 14:55     ` James Bottomley
@ 2020-02-24 15:28       ` Miklos Szeredi
  2020-02-25 12:13         ` Steven Whitehouse
  0 siblings, 1 reply; 117+ messages in thread
From: Miklos Szeredi @ 2020-02-24 15:28 UTC (permalink / raw)
  To: James Bottomley
  Cc: Miklos Szeredi, David Howells, viro, Ian Kent, Christian Brauner,
	Jann Horn, Darrick J. Wong, Linux API, linux-fsdevel, lkml

On Mon, Feb 24, 2020 at 3:55 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:

> Once it's table driven, certainly a sysfs directory becomes possible.
> The problem with ST_DEV is filesystems like btrfs and xfs that may have
> multiple devices.

For XFS there's always  a single sb->s_dev though, that's what st_dev
will be set to on all files.

Btrfs subvolume is sort of a lightweight superblock, so basically all
such st_dev's are aliases of the same master superblock.  So lookup of
all subvolume st_dev's could result in referencing the same underlying
struct super_block (just like /proc/$PID will reference the same
underlying task group regardless of which of the task group member's
PID is used).

Having this info in sysfs would spare us a number of issues that a set
of new syscalls would bring.  The question is, would that be enough,
or is there a reason that sysfs can't be used to present the various
filesystem related information that fsinfo is supposed to present?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-24 15:28       ` Miklos Szeredi
@ 2020-02-25 12:13         ` Steven Whitehouse
  2020-02-25 15:28           ` James Bottomley
  0 siblings, 1 reply; 117+ messages in thread
From: Steven Whitehouse @ 2020-02-25 12:13 UTC (permalink / raw)
  To: Miklos Szeredi, James Bottomley
  Cc: Miklos Szeredi, David Howells, viro, Ian Kent, Christian Brauner,
	Jann Horn, Darrick J. Wong, Linux API, linux-fsdevel, lkml

Hi,

On 24/02/2020 15:28, Miklos Szeredi wrote:
> On Mon, Feb 24, 2020 at 3:55 PM James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
>
>> Once it's table driven, certainly a sysfs directory becomes possible.
>> The problem with ST_DEV is filesystems like btrfs and xfs that may have
>> multiple devices.
> For XFS there's always  a single sb->s_dev though, that's what st_dev
> will be set to on all files.
>
> Btrfs subvolume is sort of a lightweight superblock, so basically all
> such st_dev's are aliases of the same master superblock.  So lookup of
> all subvolume st_dev's could result in referencing the same underlying
> struct super_block (just like /proc/$PID will reference the same
> underlying task group regardless of which of the task group member's
> PID is used).
>
> Having this info in sysfs would spare us a number of issues that a set
> of new syscalls would bring.  The question is, would that be enough,
> or is there a reason that sysfs can't be used to present the various
> filesystem related information that fsinfo is supposed to present?
>
> Thanks,
> Miklos
>
We need a unique id for superblocks anyway. I had wondered about using 
s_dev some time back, but for the reasons mentioned earlier in this 
thread I think it might just land up being confusing and difficult to 
manage. While fake s_devs are created for sbs that don't have a device, 
I can't help thinking that something closer to ifindex, but for 
superblocks, is needed here. That would avoid the issue of which device 
number to use.

In fact we need that anyway for the notifications, since without that 
there is a race that can lead to missing remounts of the same device, in 
case a umount/mount pair is missed due to an overrun, and then fsinfo 
returns the same device as before, with potentially the same mount 
options too. So I think a unique id for a superblock is a generically 
useful feature, which would also allow for sensible sysfs directory 
naming, if required,

Steve.



^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-25 12:13         ` Steven Whitehouse
@ 2020-02-25 15:28           ` James Bottomley
  2020-02-25 15:47             ` Steven Whitehouse
                               ` (3 more replies)
  0 siblings, 4 replies; 117+ messages in thread
From: James Bottomley @ 2020-02-25 15:28 UTC (permalink / raw)
  To: Steven Whitehouse, Miklos Szeredi
  Cc: Miklos Szeredi, David Howells, viro, Ian Kent, Christian Brauner,
	Jann Horn, Darrick J. Wong, Linux API, linux-fsdevel, lkml

On Tue, 2020-02-25 at 12:13 +0000, Steven Whitehouse wrote:
> Hi,
> 
> On 24/02/2020 15:28, Miklos Szeredi wrote:
> > On Mon, Feb 24, 2020 at 3:55 PM James Bottomley
> > <James.Bottomley@hansenpartnership.com> wrote:
> > 
> > > Once it's table driven, certainly a sysfs directory becomes
> > > possible. The problem with ST_DEV is filesystems like btrfs and
> > > xfs that may have multiple devices.
> > 
> > For XFS there's always  a single sb->s_dev though, that's what
> > st_dev will be set to on all files.
> > 
> > Btrfs subvolume is sort of a lightweight superblock, so basically
> > all such st_dev's are aliases of the same master superblock.  So
> > lookup of all subvolume st_dev's could result in referencing the
> > same underlying struct super_block (just like /proc/$PID will
> > reference the same underlying task group regardless of which of the
> > task group member's PID is used).
> > 
> > Having this info in sysfs would spare us a number of issues that a
> > set of new syscalls would bring.  The question is, would that be
> > enough, or is there a reason that sysfs can't be used to present
> > the various filesystem related information that fsinfo is supposed
> > to present?
> > 
> > Thanks,
> > Miklos
> > 
> 
> We need a unique id for superblocks anyway. I had wondered about
> using s_dev some time back, but for the reasons mentioned earlier in
> this thread I think it might just land up being confusing and
> difficult to manage. While fake s_devs are created for sbs that don't
> have a device, I can't help thinking that something closer to
> ifindex, but for superblocks, is needed here. That would avoid the
> issue of which device number to use.
> 
> In fact we need that anyway for the notifications, since without
> that  there is a race that can lead to missing remounts of the same
> device, in  case a umount/mount pair is missed due to an overrun, and
> then fsinfo returns the same device as before, with potentially the
> same mount options too. So I think a unique id for a superblock is a
> generically useful feature, which would also allow for sensible sysfs
> directory naming, if required,

But would this be informative and useful for the user?  I'm sure we can
find a persistent id for a persistent superblock, but what about tmpfs
... that's going to have to change with every reboot.  It's going to be
remarkably inconvenient if I want to get fsinfo on /run to have to keep
finding what the id is.

The other thing a file descriptor does that sysfs doesn't is that it
solves the information leak: if I'm in a mount namespace that has no
access to certain mounts, I can't fspick them and thus I can't see the
information.  By default, with sysfs I can.

James


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-25 15:28           ` James Bottomley
@ 2020-02-25 15:47             ` Steven Whitehouse
  2020-02-26  9:11             ` Miklos Szeredi
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 117+ messages in thread
From: Steven Whitehouse @ 2020-02-25 15:47 UTC (permalink / raw)
  To: James Bottomley, Miklos Szeredi
  Cc: Miklos Szeredi, David Howells, viro, Ian Kent, Christian Brauner,
	Jann Horn, Darrick J. Wong, Linux API, linux-fsdevel, lkml

Hi,

On 25/02/2020 15:28, James Bottomley wrote:
> On Tue, 2020-02-25 at 12:13 +0000, Steven Whitehouse wrote:
>> Hi,
>>
>> On 24/02/2020 15:28, Miklos Szeredi wrote:
>>> On Mon, Feb 24, 2020 at 3:55 PM James Bottomley
>>> <James.Bottomley@hansenpartnership.com> wrote:
>>>
>>>> Once it's table driven, certainly a sysfs directory becomes
>>>> possible. The problem with ST_DEV is filesystems like btrfs and
>>>> xfs that may have multiple devices.
>>> For XFS there's always  a single sb->s_dev though, that's what
>>> st_dev will be set to on all files.
>>>
>>> Btrfs subvolume is sort of a lightweight superblock, so basically
>>> all such st_dev's are aliases of the same master superblock.  So
>>> lookup of all subvolume st_dev's could result in referencing the
>>> same underlying struct super_block (just like /proc/$PID will
>>> reference the same underlying task group regardless of which of the
>>> task group member's PID is used).
>>>
>>> Having this info in sysfs would spare us a number of issues that a
>>> set of new syscalls would bring.  The question is, would that be
>>> enough, or is there a reason that sysfs can't be used to present
>>> the various filesystem related information that fsinfo is supposed
>>> to present?
>>>
>>> Thanks,
>>> Miklos
>>>
>> We need a unique id for superblocks anyway. I had wondered about
>> using s_dev some time back, but for the reasons mentioned earlier in
>> this thread I think it might just land up being confusing and
>> difficult to manage. While fake s_devs are created for sbs that don't
>> have a device, I can't help thinking that something closer to
>> ifindex, but for superblocks, is needed here. That would avoid the
>> issue of which device number to use.
>>
>> In fact we need that anyway for the notifications, since without
>> that  there is a race that can lead to missing remounts of the same
>> device, in  case a umount/mount pair is missed due to an overrun, and
>> then fsinfo returns the same device as before, with potentially the
>> same mount options too. So I think a unique id for a superblock is a
>> generically useful feature, which would also allow for sensible sysfs
>> directory naming, if required,
> But would this be informative and useful for the user?  I'm sure we can
> find a persistent id for a persistent superblock, but what about tmpfs
> ... that's going to have to change with every reboot.  It's going to be
> remarkably inconvenient if I want to get fsinfo on /run to have to keep
> finding what the id is.

That is a different question though, or at least it might be... the idea 
of the superblock id is to uniquely identify a particular superblock. 
The mount notification should give you the association between that 
superblock and any devices (assuming those are applicable), or you can 
use fsinfo if you were not listening to the notifications at the time of 
the mount to get the same information.

If someone unmounts /run and remounts it, then the superblock id would 
change, but otherwise it would stay the same, so you know that it is the 
same mount that is being described in future notifications. One of the 
main aims here being to combine the fsinfo information with the 
notifications in a race free manner.

There are a number of ways one might want to specify a filesystem: by 
device, by uuid, by volume label and so forth but we can't use any of 
those very easily as a unique id. Someone might remove a drive and 
replace it with a different one (so same device, but different content) 
or they might have two filesystems with the same uuid if they've just 
done a dd copy to a new device. For the mount notifications we need 
something that doesn't suffer from these issues, but which can also be 
very easily associated with what in most cases are more convenient ways 
to specify a particular filesystem.


>
> The other thing a file descriptor does that sysfs doesn't is that it
> solves the information leak: if I'm in a mount namespace that has no
> access to certain mounts, I can't fspick them and thus I can't see the
> information.  By default, with sysfs I can.
>
> James
>
Yes, thats true, and I wasn't advocating for the sysfs method over 
fspick here, just pointing out that a unique superblock id would be a 
generically useful thing to have,

Steve.



^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 07/17] fsinfo: Add fsinfo() syscall to query filesystem information [ver #17]
  2020-02-21 18:02 ` [PATCH 07/17] fsinfo: Add fsinfo() syscall to query filesystem information " David Howells
@ 2020-02-26  2:29   ` Aleksa Sarai
  2020-02-28 14:44   ` David Howells
  1 sibling, 0 replies; 117+ messages in thread
From: Aleksa Sarai @ 2020-02-26  2:29 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, mszeredi, christian, jannh, darrick.wong, linux-api,
	linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 64914 bytes --]

On 2020-02-21, David Howells <dhowells@redhat.com> wrote:
> Add a system call to allow filesystem information to be queried.  A request
> value can be given to indicate the desired attribute.  Support is provided
> for enumerating multi-value attributes.
> 
> ===============
> NEW SYSTEM CALL
> ===============
> 
> The new system call looks like:
> 
> 	int ret = fsinfo(int dfd,
> 			 const char *filename,
> 			 const struct fsinfo_params *params,
> 			 void *buffer,
> 			 size_t buf_size);
> 
> The params parameter optionally points to a block of parameters:
> 
> 	struct fsinfo_params {
> 		__u32	at_flags;
> 		__u32	flags;
> 		__u32	request;
> 		__u32	Nth;
> 		__u32	Mth;
> 		__u64	__reserved[3];
> 	};
> 
> If params is NULL, it is assumed params->request should be
> FSINFO_ATTR_STATFS, params->Nth should be 0, params->Mth should be 0,
> params->at_flags should be 0 and params->flags should be 0.
> 
> If params is given, all of params->__reserved[] must be 0.

I would suggest that rather than having a reserved field for future
extensions, you make use of copy_struct_from_user() and have extensible
structs:

	int ret = fsinfo(int dfd,
			 const char *filename,
			 struct fsinfo_params *params,
			 size_t params_usize,
			 void *buffer,
			 size_t buf_usize);

 	struct fsinfo_params {
 		__u64	flags;
 		__u32	at_flags;
 		__u32	request;
 		__u32	Nth;
 		__u32	Mth;
 	};

I dropped the "const" on fsinfo_params because the planned CHECK_FiELDS
feature for extensible-struct syscalls requires writing to the struct. I
also switched the flags field to u64 because CHECK_FiELDS is intended to
use (1<<63) for all syscalls (this has the nice benefit of removing the
need of a padding field entirely).

> dfd, filename and params->at_flags indicate the file to query.  There is no
> equivalent of lstat() as that can be emulated with fsinfo() by setting
> AT_SYMLINK_NOFOLLOW in params->at_flags.

Minor gripe -- can we make the default be AT_SYMLINK_NOFOLLOW and you
need to explicitly pass AT_SYMLINK_FOLLOW? Accidentally following
symlinks is a constant source of security bugs.

> There is also no equivalent of fstat() as that can be emulated by
> passing a NULL filename to fsinfo() with the fd of interest in dfd.

Presumably you also need to pass AT_EMPTY_PATH?

> params->request indicates the attribute/attributes to be queried.  This can
> be one of:
> 
> 	FSINFO_ATTR_STATFS		- statfs-style info
> 	FSINFO_ATTR_IDS			- Filesystem IDs
> 	FSINFO_ATTR_LIMITS		- Filesystem limits
> 	FSINFO_ATTR_SUPPORTS		- What's supported in statx(), IOC flags
> 	FSINFO_ATTR_TIMESTAMP_INFO	- Inode timestamp info
> 	FSINFO_ATTR_VOLUME_ID		- Volume ID (string)
> 	FSINFO_ATTR_VOLUME_UUID		- Volume UUID
> 	FSINFO_ATTR_VOLUME_NAME		- Volume name (string)
> 	FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO - Information about attr Nth
> 	FSINFO_ATTR_FSINFO_ATTRIBUTES	- List of supported attrs
> 
> Some attributes (such as the servers backing a network filesystem) can have
> multiple values.  These can be enumerated by setting params->Nth and
> params->Mth to 0, 1, ... until ENODATA is returned.
> 
> buffer and buf_size point to the reply buffer.  The buffer is filled up to
> the specified size, even if this means truncating the reply.  The full size
> of the reply is returned.  In future versions, this will allow extra fields
> to be tacked on to the end of the reply, but anyone not expecting them will
> only get the subset they're expecting.  If either buffer of buf_size are 0,
> no copy will take place and the data size will be returned.

Sounds good, though I think we should zero-fill the tail end of the
buffer (if the buffer is larger than the in-kernel one). This is
basically what a theoretical copy_struct_to_user() would do. It will
also ensure that CHECK_FiELDS will act consistently on a syscall that
has two extensible struct arguments.

> At the moment, this will only work on x86_64 and i386 as it requires the
> system call to be wired up.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: linux-api@vger.kernel.org
> ---
> 
>  arch/alpha/kernel/syscalls/syscall.tbl      |    1 
>  arch/arm/tools/syscall.tbl                  |    1 
>  arch/arm64/include/asm/unistd.h             |    2 
>  arch/ia64/kernel/syscalls/syscall.tbl       |    1 
>  arch/m68k/kernel/syscalls/syscall.tbl       |    1 
>  arch/microblaze/kernel/syscalls/syscall.tbl |    1 
>  arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 
>  arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 
>  arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 
>  arch/parisc/kernel/syscalls/syscall.tbl     |    1 
>  arch/powerpc/kernel/syscalls/syscall.tbl    |    1 
>  arch/s390/kernel/syscalls/syscall.tbl       |    1 
>  arch/sh/kernel/syscalls/syscall.tbl         |    1 
>  arch/sparc/kernel/syscalls/syscall.tbl      |    1 
>  arch/x86/entry/syscalls/syscall_32.tbl      |    1 
>  arch/x86/entry/syscalls/syscall_64.tbl      |    1 
>  arch/xtensa/kernel/syscalls/syscall.tbl     |    1 
>  fs/Kconfig                                  |    7 
>  fs/Makefile                                 |    1 
>  fs/fsinfo.c                                 |  566 +++++++++++++++++++++++++
>  include/linux/fs.h                          |    4 
>  include/linux/fsinfo.h                      |   72 +++
>  include/linux/syscalls.h                    |    4 
>  include/uapi/asm-generic/unistd.h           |    4 
>  include/uapi/linux/fsinfo.h                 |  187 ++++++++
>  kernel/sys_ni.c                             |    1 
>  samples/vfs/Makefile                        |    5 
>  samples/vfs/test-fsinfo.c                   |  607 +++++++++++++++++++++++++++
>  28 files changed, 1474 insertions(+), 2 deletions(-)
>  create mode 100644 fs/fsinfo.c
>  create mode 100644 include/linux/fsinfo.h
>  create mode 100644 include/uapi/linux/fsinfo.h
>  create mode 100644 samples/vfs/test-fsinfo.c
> 
> diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
> index 7c0115af9010..4d0b07dde12d 100644
> --- a/arch/alpha/kernel/syscalls/syscall.tbl
> +++ b/arch/alpha/kernel/syscalls/syscall.tbl
> @@ -479,3 +479,4 @@
>  548	common	pidfd_getfd			sys_pidfd_getfd
>  549	common	watch_mount			sys_watch_mount
>  550	common	watch_sb			sys_watch_sb
> +551	common	fsinfo				sys_fsinfo
> diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> index f256f009a89f..fdda8382b420 100644
> --- a/arch/arm/tools/syscall.tbl
> +++ b/arch/arm/tools/syscall.tbl
> @@ -453,3 +453,4 @@
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	watch_mount			sys_watch_mount
>  440	common	watch_sb			sys_watch_sb
> +441	common	fsinfo				sys_fsinfo
> diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> index bc0f923e0e04..388eeb71cff0 100644
> --- a/arch/arm64/include/asm/unistd.h
> +++ b/arch/arm64/include/asm/unistd.h
> @@ -38,7 +38,7 @@
>  #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
>  #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
>  
> -#define __NR_compat_syscalls		441
> +#define __NR_compat_syscalls		442
>  #endif
>  
>  #define __ARCH_WANT_SYS_CLONE
> diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
> index a4dafc659647..2316e60e031a 100644
> --- a/arch/ia64/kernel/syscalls/syscall.tbl
> +++ b/arch/ia64/kernel/syscalls/syscall.tbl
> @@ -360,3 +360,4 @@
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	watch_mount			sys_watch_mount
>  440	common	watch_sb			sys_watch_sb
> +441	common	fsinfo				sys_fsinfo
> diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
> index 893fb4151547..efc2723ca91f 100644
> --- a/arch/m68k/kernel/syscalls/syscall.tbl
> +++ b/arch/m68k/kernel/syscalls/syscall.tbl
> @@ -439,3 +439,4 @@
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	watch_mount			sys_watch_mount
>  440	common	watch_sb			sys_watch_sb
> +441	common	fsinfo				sys_fsinfo
> diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
> index 54aaf0d40c64..745c0f462fce 100644
> --- a/arch/microblaze/kernel/syscalls/syscall.tbl
> +++ b/arch/microblaze/kernel/syscalls/syscall.tbl
> @@ -445,3 +445,4 @@
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	watch_mount			sys_watch_mount
>  440	common	watch_sb			sys_watch_sb
> +441	common	fsinfo				sys_fsinfo
> diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
> index fd34dd0efed0..499f83562a8c 100644
> --- a/arch/mips/kernel/syscalls/syscall_n32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
> @@ -378,3 +378,4 @@
>  438	n32	pidfd_getfd			sys_pidfd_getfd
>  439	n32	watch_mount			sys_watch_mount
>  440	n32	watch_sb			sys_watch_sb
> +441	n32	fsinfo				sys_fsinfo
> diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
> index db0f4c0a0a0b..b3188bc3ab3c 100644
> --- a/arch/mips/kernel/syscalls/syscall_n64.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
> @@ -354,3 +354,4 @@
>  438	n64	pidfd_getfd			sys_pidfd_getfd
>  439	n64	watch_mount			sys_watch_mount
>  440	n64	watch_sb			sys_watch_sb
> +441	n64	fsinfo				sys_fsinfo
> diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
> index ce2e1326de8f..1a3e8ed5e538 100644
> --- a/arch/mips/kernel/syscalls/syscall_o32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
> @@ -427,3 +427,4 @@
>  438	o32	pidfd_getfd			sys_pidfd_getfd
>  439	o32	watch_mount			sys_watch_mount
>  440	o32	watch_sb			sys_watch_sb
> +441	o32	fsinfo				sys_fsinfo
> diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
> index 6e4a7c08b64b..2572c215d861 100644
> --- a/arch/parisc/kernel/syscalls/syscall.tbl
> +++ b/arch/parisc/kernel/syscalls/syscall.tbl
> @@ -437,3 +437,4 @@
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	watch_mount			sys_watch_mount
>  440	common	watch_sb			sys_watch_sb
> +441	common	fsinfo				sys_fsinfo
> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> index 08943f3b8206..39d7ac7e918c 100644
> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> @@ -521,3 +521,4 @@
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	watch_mount			sys_watch_mount
>  440	common	watch_sb			sys_watch_sb
> +441	common	fsinfo				sys_fsinfo
> diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
> index b3b8529d2b74..ae4cefd3dd1b 100644
> --- a/arch/s390/kernel/syscalls/syscall.tbl
> +++ b/arch/s390/kernel/syscalls/syscall.tbl
> @@ -442,3 +442,4 @@
>  438  common	pidfd_getfd		sys_pidfd_getfd			sys_pidfd_getfd
>  439	common	watch_mount		sys_watch_mount			sys_watch_mount
>  440	common	watch_sb		sys_watch_sb			sys_watch_sb
> +441  common	fsinfo			sys_fsinfo			sys_fsinfo
> diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
> index 89307a20657c..05945b9aee4b 100644
> --- a/arch/sh/kernel/syscalls/syscall.tbl
> +++ b/arch/sh/kernel/syscalls/syscall.tbl
> @@ -442,3 +442,4 @@
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	watch_mount			sys_watch_mount
>  440	common	watch_sb			sys_watch_sb
> +441	common	fsinfo				sys_fsinfo
> diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
> index 4ff841a00450..b71b34d4b45c 100644
> --- a/arch/sparc/kernel/syscalls/syscall.tbl
> +++ b/arch/sparc/kernel/syscalls/syscall.tbl
> @@ -485,3 +485,4 @@
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	watch_mount			sys_watch_mount
>  440	common	watch_sb			sys_watch_sb
> +441	common	fsinfo				sys_fsinfo
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index e2731d295f88..e118ba9aca4c 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -444,3 +444,4 @@
>  438	i386	pidfd_getfd		sys_pidfd_getfd			__ia32_sys_pidfd_getfd
>  439	i386	watch_mount		sys_watch_mount			__ia32_sys_watch_mount
>  440	i386	watch_sb		sys_watch_sb			__ia32_sys_watch_sb
> +441	i386	fsinfo			sys_fsinfo			__ia32_sys_fsinfo
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index f4391176102c..067f247471d0 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -361,6 +361,7 @@
>  438	common	pidfd_getfd		__x64_sys_pidfd_getfd
>  439	common	watch_mount		__x64_sys_watch_mount
>  440	common	watch_sb		__x64_sys_watch_sb
> +441	common	fsinfo			__x64_sys_fsinfo
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
> index 8e7d731ed6cf..e1ec25099d10 100644
> --- a/arch/xtensa/kernel/syscalls/syscall.tbl
> +++ b/arch/xtensa/kernel/syscalls/syscall.tbl
> @@ -410,3 +410,4 @@
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	watch_mount			sys_watch_mount
>  440	common	watch_sb			sys_watch_sb
> +441	common	fsinfo				sys_fsinfo
> diff --git a/fs/Kconfig b/fs/Kconfig
> index fef1365c23a5..01d0d436b3cd 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -15,6 +15,13 @@ config VALIDATE_FS_PARSER
>  	  Enable this to perform validation of the parameter description for a
>  	  filesystem when it is registered.
>  
> +config FSINFO
> +	bool "Enable the fsinfo() system call"
> +	help
> +	  Enable the file system information querying system call to allow
> +	  comprehensive information to be retrieved about a filesystem,
> +	  superblock or mount object.
> +
>  if BLOCK
>  
>  config FS_IOMAP
> diff --git a/fs/Makefile b/fs/Makefile
> index 4477757780d0..b6bf2424c7f7 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -55,6 +55,7 @@ obj-$(CONFIG_COREDUMP)		+= coredump.o
>  obj-$(CONFIG_SYSCTL)		+= drop_caches.o
>  
>  obj-$(CONFIG_FHANDLE)		+= fhandle.o
> +obj-$(CONFIG_FSINFO)		+= fsinfo.o
>  obj-y				+= iomap/
>  
>  obj-y				+= quota/
> diff --git a/fs/fsinfo.c b/fs/fsinfo.c
> new file mode 100644
> index 000000000000..5d3ba3c3a7ad
> --- /dev/null
> +++ b/fs/fsinfo.c
> @@ -0,0 +1,566 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Filesystem information query.
> + *
> + * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + */
> +#include <linux/syscalls.h>
> +#include <linux/fs.h>
> +#include <linux/file.h>
> +#include <linux/mount.h>
> +#include <linux/namei.h>
> +#include <linux/statfs.h>
> +#include <linux/security.h>
> +#include <linux/uaccess.h>
> +#include <linux/fsinfo.h>
> +#include <uapi/linux/mount.h>
> +#include "internal.h"
> +
> +/**
> + * fsinfo_string - Store a NUL-terminated string as an fsinfo attribute value.
> + * @s: The string to store (may be NULL)
> + * @ctx: The parameter context
> + */
> +int fsinfo_string(const char *s, struct fsinfo_context *ctx)
> +{
> +	unsigned int len;
> +	char *p = ctx->buffer;
> +	int ret = 0;
> +
> +	if (s) {
> +		len = min_t(size_t, strlen(s), ctx->buf_size - 1);
> +		if (!ctx->want_size_only) {
> +			memcpy(p, s, len);
> +			p[len] = 0;
> +		}
> +		ret = len;
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(fsinfo_string);
> +
> +/*
> + * Get basic filesystem stats from statfs.
> + */
> +static int fsinfo_generic_statfs(struct path *path, struct fsinfo_context *ctx)
> +{
> +	struct fsinfo_statfs *p = ctx->buffer;
> +	struct kstatfs buf;
> +	int ret;
> +
> +	ret = vfs_statfs(path, &buf);
> +	if (ret < 0)
> +		return ret;
> +
> +	p->f_blocks.lo	= buf.f_blocks;
> +	p->f_bfree.lo	= buf.f_bfree;
> +	p->f_bavail.lo	= buf.f_bavail;
> +	p->f_files.lo	= buf.f_files;
> +	p->f_ffree.lo	= buf.f_ffree;
> +	p->f_favail.lo	= buf.f_ffree;
> +	p->f_bsize	= buf.f_bsize;
> +	p->f_frsize	= buf.f_frsize;
> +	return sizeof(*p);
> +}
> +
> +static int fsinfo_generic_ids(struct path *path, struct fsinfo_context *ctx)
> +{
> +	struct fsinfo_ids *p = ctx->buffer;
> +	struct super_block *sb;
> +	struct kstatfs buf;
> +	int ret;
> +
> +	ret = vfs_statfs(path, &buf);
> +	if (ret < 0 && ret != -ENOSYS)
> +		return ret;
> +	if (ret == 0)
> +		memcpy(&p->f_fsid, &buf.f_fsid, sizeof(p->f_fsid));
> +
> +	sb = path->dentry->d_sb;
> +	p->f_fstype	= sb->s_magic;
> +	p->f_dev_major	= MAJOR(sb->s_dev);
> +	p->f_dev_minor	= MINOR(sb->s_dev);
> +	p->f_sb_id	= sb->s_unique_id;
> +	strlcpy(p->f_fs_name, sb->s_type->name, sizeof(p->f_fs_name));
> +	return sizeof(*p);
> +}
> +
> +int fsinfo_generic_limits(struct path *path, struct fsinfo_context *ctx)
> +{
> +	struct fsinfo_limits *p = ctx->buffer;
> +	struct super_block *sb = path->dentry->d_sb;
> +
> +	p->max_file_size.hi	= 0;
> +	p->max_file_size.lo	= sb->s_maxbytes;
> +	p->max_ino.hi		= 0;
> +	p->max_ino.lo		= UINT_MAX;
> +	p->max_hard_links	= sb->s_max_links;
> +	p->max_uid		= UINT_MAX;
> +	p->max_gid		= UINT_MAX;
> +	p->max_projid		= UINT_MAX;
> +	p->max_filename_len	= NAME_MAX;
> +	p->max_symlink_len	= PATH_MAX;
> +	p->max_xattr_name_len	= XATTR_NAME_MAX;
> +	p->max_xattr_body_len	= XATTR_SIZE_MAX;
> +	p->max_dev_major	= 0xffffff;
> +	p->max_dev_minor	= 0xff;
> +	return sizeof(*p);
> +}
> +EXPORT_SYMBOL(fsinfo_generic_limits);
> +
> +int fsinfo_generic_supports(struct path *path, struct fsinfo_context *ctx)
> +{
> +	struct fsinfo_supports *p = ctx->buffer;
> +	struct super_block *sb = path->dentry->d_sb;
> +
> +	p->stx_mask = STATX_BASIC_STATS;
> +	if (sb->s_d_op && sb->s_d_op->d_automount)
> +		p->stx_attributes |= STATX_ATTR_AUTOMOUNT;
> +	return sizeof(*p);
> +}
> +EXPORT_SYMBOL(fsinfo_generic_supports);
> +
> +static const struct fsinfo_timestamp_info fsinfo_default_timestamp_info = {
> +	.atime = {
> +		.minimum	= S64_MIN,
> +		.maximum	= S64_MAX,
> +		.gran_mantissa	= 1,
> +		.gran_exponent	= 0,
> +	},
> +	.mtime = {
> +		.minimum	= S64_MIN,
> +		.maximum	= S64_MAX,
> +		.gran_mantissa	= 1,
> +		.gran_exponent	= 0,
> +	},
> +	.ctime = {
> +		.minimum	= S64_MIN,
> +		.maximum	= S64_MAX,
> +		.gran_mantissa	= 1,
> +		.gran_exponent	= 0,
> +	},
> +	.btime = {
> +		.minimum	= S64_MIN,
> +		.maximum	= S64_MAX,
> +		.gran_mantissa	= 1,
> +		.gran_exponent	= 0,
> +	},
> +};
> +
> +int fsinfo_generic_timestamp_info(struct path *path, struct fsinfo_context *ctx)
> +{
> +	struct fsinfo_timestamp_info *p = ctx->buffer;
> +	struct super_block *sb = path->dentry->d_sb;
> +	s8 exponent;
> +
> +	*p = fsinfo_default_timestamp_info;
> +
> +	if (sb->s_time_gran < 1000000000) {
> +		if (sb->s_time_gran < 1000)
> +			exponent = -9;
> +		else if (sb->s_time_gran < 1000000)
> +			exponent = -6;
> +		else
> +			exponent = -3;
> +
> +		p->atime.gran_exponent = exponent;
> +		p->mtime.gran_exponent = exponent;
> +		p->ctime.gran_exponent = exponent;
> +		p->btime.gran_exponent = exponent;
> +	}
> +
> +	return sizeof(*p);
> +}
> +EXPORT_SYMBOL(fsinfo_generic_timestamp_info);
> +
> +static int fsinfo_generic_volume_uuid(struct path *path, struct fsinfo_context *ctx)
> +{
> +	struct fsinfo_volume_uuid *p = ctx->buffer;
> +	struct super_block *sb = path->dentry->d_sb;
> +
> +	memcpy(p, &sb->s_uuid, sizeof(*p));
> +	return sizeof(*p);
> +}
> +
> +static int fsinfo_generic_volume_id(struct path *path, struct fsinfo_context *ctx)
> +{
> +	return fsinfo_string(path->dentry->d_sb->s_id, ctx);
> +}
> +
> +static const struct fsinfo_attribute fsinfo_common_attributes[] = {
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_STATFS,		fsinfo_generic_statfs),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_IDS,		fsinfo_generic_ids),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_LIMITS,		fsinfo_generic_limits),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_SUPPORTS,		fsinfo_generic_supports),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_TIMESTAMP_INFO,	fsinfo_generic_timestamp_info),
> +	FSINFO_STRING	(FSINFO_ATTR_VOLUME_ID,		fsinfo_generic_volume_id),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_VOLUME_UUID,	fsinfo_generic_volume_uuid),
> +
> +	FSINFO_LIST	(FSINFO_ATTR_FSINFO_ATTRIBUTES,	(void *)123UL),
> +	FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, (void *)123UL),
> +	{}
> +};
> +
> +/*
> + * Determine an attribute's minimum buffer size and, if the buffer is large
> + * enough, get the attribute value.
> + */
> +static int fsinfo_get_this_attribute(struct path *path,
> +				     struct fsinfo_context *ctx,
> +				     const struct fsinfo_attribute *attr)
> +{
> +	int buf_size;
> +
> +	if (ctx->Nth != 0 && !(attr->flags & (FSINFO_FLAGS_N | FSINFO_FLAGS_NM)))
> +		return -ENODATA;
> +	if (ctx->Mth != 0 && !(attr->flags & FSINFO_FLAGS_NM))
> +		return -ENODATA;
> +
> +	switch (attr->type) {
> +	case FSINFO_TYPE_VSTRUCT:
> +		ctx->clear_tail = true;
> +		buf_size = attr->size;
> +		break;
> +	case FSINFO_TYPE_STRING:
> +	case FSINFO_TYPE_OPAQUE:
> +	case FSINFO_TYPE_LIST:
> +		buf_size = 4096;
> +		break;
> +	default:
> +		return -ENOPKG;
> +	}
> +
> +	if (ctx->buf_size < buf_size)
> +		return buf_size;
> +
> +	return attr->get(path, ctx);
> +}
> +
> +static void fsinfo_attributes_insert(struct fsinfo_context *ctx,
> +				     const struct fsinfo_attribute *attr)
> +{
> +	__u32 *p = ctx->buffer;
> +	unsigned int i;
> +
> +	if (ctx->usage >= ctx->buf_size ||
> +	    ctx->buf_size - ctx->usage < sizeof(__u32)) {
> +		ctx->usage += sizeof(__u32);
> +		return;
> +	}
> +
> +	for (i = 0; i < ctx->usage / sizeof(__u32); i++)
> +		if (p[i] == attr->attr_id)
> +			return;
> +
> +	p[i] = attr->attr_id;
> +	ctx->usage += sizeof(__u32);
> +}
> +
> +static int fsinfo_list_attributes(struct path *path,
> +				  struct fsinfo_context *ctx,
> +				  const struct fsinfo_attribute *attributes)
> +{
> +	const struct fsinfo_attribute *a;
> +
> +	for (a = attributes; a->get; a++)
> +		fsinfo_attributes_insert(ctx, a);
> +	return -EOPNOTSUPP; /* We want to go through all the lists */
> +}
> +
> +static int fsinfo_get_attribute_info(struct path *path,
> +				     struct fsinfo_context *ctx,
> +				     const struct fsinfo_attribute *attributes)
> +{
> +	const struct fsinfo_attribute *a;
> +	struct fsinfo_attribute_info *p = ctx->buffer;
> +
> +	if (!ctx->buf_size)
> +		return sizeof(*p);
> +
> +	for (a = attributes; a->get; a++) {
> +		if (a->attr_id == ctx->Nth) {
> +			p->attr_id	= a->attr_id;
> +			p->type		= a->type;
> +			p->flags	= a->flags;
> +			p->size		= a->size;
> +			p->size		= a->size;
> +			return sizeof(*p);
> +		}
> +	}
> +	return -EOPNOTSUPP; /* We want to go through all the lists */
> +}
> +
> +/**
> + * fsinfo_get_attribute - Look up and handle an attribute
> + * @path: The object to query
> + * @params: Parameters to define a request and place to store result
> + * @attributes: List of attributes to search.
> + *
> + * Look through a list of attributes for one that matches the requested
> + * attribute then call the handler for it.
> + */
> +int fsinfo_get_attribute(struct path *path, struct fsinfo_context *ctx,
> +			 const struct fsinfo_attribute *attributes)
> +{
> +	const struct fsinfo_attribute *a;
> +
> +	switch (ctx->requested_attr) {
> +	case FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO:
> +		return fsinfo_get_attribute_info(path, ctx, attributes);
> +	case FSINFO_ATTR_FSINFO_ATTRIBUTES:
> +		return fsinfo_list_attributes(path, ctx, attributes);
> +	default:
> +		for (a = attributes; a->get; a++)
> +			if (a->attr_id == ctx->requested_attr)
> +				return fsinfo_get_this_attribute(path, ctx, a);
> +		return -EOPNOTSUPP;
> +	}
> +}
> +EXPORT_SYMBOL(fsinfo_get_attribute);
> +
> +/**
> + * generic_fsinfo - Handle an fsinfo attribute generically
> + * @path: The object to query
> + * @params: Parameters to define a request and place to store result
> + */
> +static int fsinfo_call(struct path *path, struct fsinfo_context *ctx)
> +{
> +	int ret;
> +
> +	if (path->dentry->d_sb->s_op->fsinfo) {
> +		ret = path->dentry->d_sb->s_op->fsinfo(path, ctx);
> +		if (ret != -EOPNOTSUPP)
> +			return ret;
> +	}
> +	ret = fsinfo_get_attribute(path, ctx, fsinfo_common_attributes);
> +	if (ret != -EOPNOTSUPP)
> +		return ret;
> +
> +	switch (ctx->requested_attr) {
> +	case FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO:
> +		return -ENODATA;
> +	case FSINFO_ATTR_FSINFO_ATTRIBUTES:
> +		return ctx->usage;
> +	default:
> +		return -EOPNOTSUPP;
> +	}
> +}
> +
> +/**
> + * vfs_fsinfo - Retrieve filesystem information
> + * @path: The object to query
> + * @params: Parameters to define a request and place to store result
> + *
> + * Get an attribute on a filesystem or an object within a filesystem.  The
> + * filesystem attribute to be queried is indicated by @ctx->requested_attr, and
> + * if it's a multi-valued attribute, the particular value is selected by
> + * @ctx->Nth and then @ctx->Mth.
> + *
> + * For common attributes, a value may be fabricated if it is not supported by
> + * the filesystem.
> + *
> + * On success, the size of the attribute's value is returned (0 is a valid
> + * size).  A buffer will have been allocated and will be pointed to by
> + * @ctx->buffer.  The caller must free this with kvfree().
> + *
> + * Errors can also be returned: -ENOMEM if a buffer cannot be allocated, -EPERM
> + * or -EACCES if permission is denied by the LSM, -EOPNOTSUPP if an attribute
> + * doesn't exist for the specified object or -ENODATA if the attribute exists,
> + * but the Nth,Mth value does not exist.  -EMSGSIZE indicates that the value is
> + * unmanageable internally and -ENOPKG indicates other internal failure.
> + *
> + * Errors such as -EIO may also come from attempts to access media or servers
> + * to obtain the requested information if it's not immediately to hand.
> + *
> + * [*] Note that the caller may set @ctx->want_size_only if it only wants the
> + *     size of the value and not the data.  If this is set, a buffer may not be
> + *     allocated under some circumstances.  This is intended for size query by
> + *     userspace.
> + *
> + * [*] Note that @ctx->clear_tail will be returned set if the data should be
> + *     padded out with zeros when writing it to userspace.
> + */
> +static int vfs_fsinfo(struct path *path, struct fsinfo_context *ctx)
> +{
> +	struct dentry *dentry = path->dentry;
> +	int ret;
> +
> +	ret = security_sb_statfs(dentry);
> +	if (ret)
> +		return ret;
> +
> +	/* Call the handler to find out the buffer size required. */
> +	ctx->buf_size = 0;
> +	ret = fsinfo_call(path, ctx);
> +	if (ret < 0 || ctx->want_size_only)
> +		return ret;
> +	ctx->buf_size = ret;
> +
> +	do {
> +		/* Allocate a buffer of the requested size. */
> +		if (ctx->buf_size > INT_MAX)
> +			return -EMSGSIZE;
> +		ctx->buffer = kvzalloc(ctx->buf_size, GFP_KERNEL);
> +		if (!ctx->buffer)
> +			return -ENOMEM;
> +
> +		ctx->usage = 0;
> +		ret = fsinfo_call(path, ctx);
> +		if (IS_ERR_VALUE((long)ret))
> +			return ret;
> +		if ((unsigned int)ret <= ctx->buf_size)
> +			return ret; /* It fitted */
> +
> +		/* We need to resize the buffer */
> +		ctx->buf_size = roundup(ret, PAGE_SIZE);
> +		kvfree(ctx->buffer);
> +		ctx->buffer = NULL;
> +	} while (!signal_pending(current));
> +
> +	return -ERESTARTSYS;
> +}
> +
> +static int vfs_fsinfo_path(int dfd, const char __user *pathname,
> +			   unsigned int at_flags, struct fsinfo_context *ctx)
> +{
> +	struct path path;
> +	unsigned lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
> +	int ret = -EINVAL;
> +
> +	if ((at_flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
> +			  AT_EMPTY_PATH)) != 0)
> +		return -EINVAL;
> +
> +	if (at_flags & AT_SYMLINK_NOFOLLOW)
> +		lookup_flags &= ~LOOKUP_FOLLOW;
> +	if (at_flags & AT_NO_AUTOMOUNT)
> +		lookup_flags &= ~LOOKUP_AUTOMOUNT;
> +	if (at_flags & AT_EMPTY_PATH)
> +		lookup_flags |= LOOKUP_EMPTY;
> +
> +retry:
> +	ret = user_path_at(dfd, pathname, lookup_flags, &path);
> +	if (ret)
> +		goto out;
> +
> +	ret = vfs_fsinfo(&path, ctx);
> +	path_put(&path);
> +	if (retry_estale(ret, lookup_flags)) {
> +		lookup_flags |= LOOKUP_REVAL;
> +		goto retry;
> +	}
> +out:
> +	return ret;
> +}
> +
> +static int vfs_fsinfo_fd(unsigned int fd, struct fsinfo_context *ctx)
> +{
> +	struct fd f = fdget_raw(fd);
> +	int ret = -EBADF;
> +
> +	if (f.file) {
> +		ret = vfs_fsinfo(&f.file->f_path, ctx);
> +		fdput(f);
> +	}
> +	return ret;
> +}
> +
> +/**
> + * sys_fsinfo - System call to get filesystem information
> + * @dfd: Base directory to pathwalk from or fd referring to filesystem.
> + * @pathname: Filesystem to query or NULL.
> + * @_params: Parameters to define request (or NULL for enhanced statfs).
> + * @user_buffer: Result buffer.
> + * @user_buf_size: Size of result buffer.
> + *
> + * Get information on a filesystem.  The filesystem attribute to be queried is
> + * indicated by @_params->request, and some of the attributes can have multiple
> + * values, indexed by @_params->Nth and @_params->Mth.  If @_params is NULL,
> + * then the 0th fsinfo_attr_statfs attribute is queried.  If an attribute does
> + * not exist, EOPNOTSUPP is returned; if the Nth,Mth value does not exist,
> + * ENODATA is returned.
> + *
> + * On success, the size of the attribute's value is returned.  If
> + * @user_buf_size is 0 or @user_buffer is NULL, only the size is returned.  If
> + * the size of the value is larger than @user_buf_size, it will be truncated by
> + * the copy.  If the size of the value is smaller than @user_buf_size then the
> + * excess buffer space will be cleared.  The full size of the value will be
> + * returned, irrespective of how much data is actually placed in the buffer.
> + */
> +SYSCALL_DEFINE5(fsinfo,
> +		int, dfd, const char __user *, pathname,
> +		struct fsinfo_params __user *, params,
> +		void __user *, user_buffer, size_t, user_buf_size)
> +{
> +	struct fsinfo_context ctx;
> +	struct fsinfo_params user_params;
> +	unsigned int at_flags = 0, result_size;
> +	int ret;
> +
> +	if (!user_buffer && user_buf_size)
> +		return -EINVAL;
> +	if (user_buffer && !user_buf_size)
> +		return -EINVAL;
> +	if (user_buf_size > UINT_MAX)
> +		return -EOVERFLOW;
> +
> +	memset(&ctx, 0, sizeof(ctx));
> +	ctx.requested_attr = FSINFO_ATTR_STATFS;
> +	if (user_buf_size == 0)
> +		ctx.want_size_only = true;
> +
> +	if (params) {
> +		if (copy_from_user(&user_params, params, sizeof(user_params)))
> +			return -EFAULT;
> +		if (user_params.__reserved32[0] ||
> +		    user_params.__reserved[0] ||
> +		    user_params.__reserved[1] ||
> +		    user_params.__reserved[2] ||
> +		    user_params.flags & ~FSINFO_FLAGS_QUERY_MASK)
> +			return -EINVAL;
> +		at_flags = user_params.at_flags;
> +		ctx.flags = user_params.flags;
> +		ctx.requested_attr = user_params.request;
> +		ctx.Nth = user_params.Nth;
> +		ctx.Mth = user_params.Mth;
> +	}
> +
> +	switch (ctx.flags & FSINFO_FLAGS_QUERY_MASK) {
> +	case FSINFO_FLAGS_QUERY_PATH:
> +		ret = vfs_fsinfo_path(dfd, pathname, at_flags, &ctx);
> +		break;
> +	case FSINFO_FLAGS_QUERY_FD:
> +		if (pathname)
> +			return -EINVAL;
> +		ret = vfs_fsinfo_fd(dfd, &ctx);
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	if (ret < 0)
> +		goto error;
> +
> +	result_size = min_t(size_t, ret, user_buf_size);
> +	if (result_size > 0 &&
> +	    copy_to_user(user_buffer, ctx.buffer, result_size) != 0) {
> +		ret = -EFAULT;
> +		goto error;
> +	}
> +
> +	/* Clear any part of the buffer that we won't fill if we're putting a
> +	 * struct in there.  Strings, opaque objects and arrays are expected to
> +	 * be variable length.
> +	 */
> +	if (ctx.clear_tail &&
> +	    user_buf_size > result_size &&
> +	    clear_user(user_buffer + result_size, user_buf_size - result_size) != 0) {
> +		ret = -EFAULT;
> +		goto error;
> +	}
> +
> +error:
> +	kvfree(ctx.buffer);
> +	return ret;
> +}
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index d5128d112384..d2476c0fc978 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -69,6 +69,7 @@ struct fsverity_info;
>  struct fsverity_operations;
>  struct fs_context;
>  struct fs_parameter_spec;
> +struct fsinfo_context;
>  
>  extern void __init inode_init(void);
>  extern void __init inode_init_early(void);
> @@ -1963,6 +1964,9 @@ struct super_operations {
>  	int (*thaw_super) (struct super_block *);
>  	int (*unfreeze_fs) (struct super_block *);
>  	int (*statfs) (struct dentry *, struct kstatfs *);
> +#ifdef CONFIG_FSINFO
> +	int (*fsinfo)(struct path *, struct fsinfo_context *);
> +#endif
>  	int (*remount_fs) (struct super_block *, int *, char *);
>  	void (*umount_begin) (struct super_block *);
>  
> diff --git a/include/linux/fsinfo.h b/include/linux/fsinfo.h
> new file mode 100644
> index 000000000000..943fbd6640f9
> --- /dev/null
> +++ b/include/linux/fsinfo.h
> @@ -0,0 +1,72 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Filesystem information query
> + *
> + * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + */
> +
> +#ifndef _LINUX_FSINFO_H
> +#define _LINUX_FSINFO_H
> +
> +#ifdef CONFIG_FSINFO
> +
> +#include <uapi/linux/fsinfo.h>
> +
> +struct path;
> +
> +#define FSINFO_NORMAL_ATTR_MAX_SIZE 4096
> +
> +struct fsinfo_context {
> +	__u32		flags;		/* [in] FSINFO_FLAGS_* */
> +	__u32		requested_attr;	/* [in] What is being asking for */
> +	__u32		Nth;		/* [in] Instance of it (some may have multiple) */
> +	__u32		Mth;		/* [in] Subinstance */
> +	bool		want_size_only;	/* [in] Just want to know the size, not the data */
> +	bool		clear_tail;	/* [out] T if tail of buffer should be cleared */
> +	unsigned int	usage;		/* [tmp] Amount of buffer used (if large) */
> +	unsigned int	buf_size;	/* [tmp] Size of ->buffer[] */
> +	void		*buffer;	/* [out] The reply buffer */
> +};
> +
> +/*
> + * A filesystem information attribute definition.
> + */
> +struct fsinfo_attribute {
> +	unsigned int		attr_id;	/* The ID of the attribute */
> +	enum fsinfo_value_type	type:8;		/* The type of the attribute's value(s) */
> +	unsigned int		flags:8;
> +	unsigned int		size:16;	/* - Value size (FSINFO_STRUCT/LIST) */
> +	int (*get)(struct path *path, struct fsinfo_context *params);
> +};
> +
> +#define __FSINFO(A, T, S, G, F) \
> +	{ .attr_id = A, .type = T, .flags = F, .size = S, .get = G }
> +
> +#define _FSINFO(A, T, S, G)	__FSINFO(A, T, S, G, 0)
> +#define _FSINFO_N(A, T, S, G)	__FSINFO(A, T, S, G, FSINFO_FLAGS_N)
> +#define _FSINFO_NM(A, T, S, G)	__FSINFO(A, T, S, G, FSINFO_FLAGS_NM)
> +
> +#define _FSINFO_VSTRUCT(A,S,G)	  _FSINFO   (A, FSINFO_TYPE_VSTRUCT, sizeof(S), G)
> +#define _FSINFO_VSTRUCT_N(A,S,G)  _FSINFO_N (A, FSINFO_TYPE_VSTRUCT, sizeof(S), G)
> +#define _FSINFO_VSTRUCT_NM(A,S,G) _FSINFO_NM(A, FSINFO_TYPE_VSTRUCT, sizeof(S), G)
> +
> +#define FSINFO_VSTRUCT(A,G)	_FSINFO_VSTRUCT   (A, A##__STRUCT, G)
> +#define FSINFO_VSTRUCT_N(A,G)	_FSINFO_VSTRUCT_N (A, A##__STRUCT, G)
> +#define FSINFO_VSTRUCT_NM(A,G)	_FSINFO_VSTRUCT_NM(A, A##__STRUCT, G)
> +#define FSINFO_STRING(A,G)	_FSINFO   (A, FSINFO_TYPE_STRING, 0, G)
> +#define FSINFO_STRING_N(A,G)	_FSINFO_N (A, FSINFO_TYPE_STRING, 0, G)
> +#define FSINFO_STRING_NM(A,G)	_FSINFO_NM(A, FSINFO_TYPE_STRING, 0, G)
> +#define FSINFO_OPAQUE(A,G)	_FSINFO   (A, FSINFO_TYPE_OPAQUE, 0, G)
> +#define FSINFO_LIST(A,G)	_FSINFO   (A, FSINFO_TYPE_LIST, sizeof(A##__STRUCT), G)
> +#define FSINFO_LIST_N(A,G)	_FSINFO_N (A, FSINFO_TYPE_LIST, sizeof(A##__STRUCT), G)
> +
> +extern int fsinfo_string(const char *, struct fsinfo_context *);
> +extern int fsinfo_generic_timestamp_info(struct path *, struct fsinfo_context *);
> +extern int fsinfo_generic_supports(struct path *, struct fsinfo_context *);
> +extern int fsinfo_generic_limits(struct path *, struct fsinfo_context *);
> +extern int fsinfo_get_attribute(struct path *, struct fsinfo_context *,
> +				const struct fsinfo_attribute *);
> +
> +#endif /* CONFIG_FSINFO */
> +
> +#endif /* _LINUX_FSINFO_H */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index c84440d57f52..936e2eb76c8f 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -47,6 +47,7 @@ struct stat64;
>  struct statfs;
>  struct statfs64;
>  struct statx;
> +struct fsinfo_params;
>  struct __sysctl_args;
>  struct sysinfo;
>  struct timespec;
> @@ -1007,6 +1008,9 @@ asmlinkage long sys_watch_mount(int dfd, const char __user *path,
>  				unsigned int at_flags, int watch_fd, int watch_id);
>  asmlinkage long sys_watch_sb(int dfd, const char __user *path,
>  			     unsigned int at_flags, int watch_fd, int watch_id);
> +asmlinkage long sys_fsinfo(int dfd, const char __user *pathname,
> +			   struct fsinfo_params __user *params,
> +			   void __user *buffer, size_t buf_size);
>  
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 5bff318b7ffa..7d764f86d3f5 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -859,9 +859,11 @@ __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
>  __SYSCALL(__NR_watch_mount, sys_watch_mount)
>  #define __NR_watch_sb 440
>  __SYSCALL(__NR_watch_sb, sys_watch_sb)
> +#define __NR_fsinfo 441
> +__SYSCALL(__NR_fsinfo, sys_fsinfo)
>  
>  #undef __NR_syscalls
> -#define __NR_syscalls 441
> +#define __NR_syscalls 442
>  
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
> new file mode 100644
> index 000000000000..6eb02de8a631
> --- /dev/null
> +++ b/include/uapi/linux/fsinfo.h
> @@ -0,0 +1,187 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/* fsinfo() definitions.
> + *
> + * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + */
> +#ifndef _UAPI_LINUX_FSINFO_H
> +#define _UAPI_LINUX_FSINFO_H
> +
> +#include <linux/types.h>
> +#include <linux/socket.h>
> +
> +/*
> + * The filesystem attributes that can be requested.  Note that some attributes
> + * may have multiple instances which can be switched in the parameter block.
> + */
> +#define FSINFO_ATTR_STATFS		0x00	/* statfs()-style state */
> +#define FSINFO_ATTR_IDS			0x01	/* Filesystem IDs */
> +#define FSINFO_ATTR_LIMITS		0x02	/* Filesystem limits */
> +#define FSINFO_ATTR_SUPPORTS		0x03	/* What's supported in statx, iocflags, ... */
> +#define FSINFO_ATTR_TIMESTAMP_INFO	0x04	/* Inode timestamp info */
> +#define FSINFO_ATTR_VOLUME_ID		0x05	/* Volume ID (string) */
> +#define FSINFO_ATTR_VOLUME_UUID		0x06	/* Volume UUID (LE uuid) */
> +#define FSINFO_ATTR_VOLUME_NAME		0x07	/* Volume name (string) */
> +
> +#define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO 0x100	/* Information about attr N (for path) */
> +#define FSINFO_ATTR_FSINFO_ATTRIBUTES	0x101	/* List of supported attrs (for path) */
> +
> +/*
> + * Optional fsinfo() parameter structure.
> + *
> + * If this is not given, it is assumed that fsinfo_attr_statfs instance 0,0 is
> + * desired.
> + */
> +struct fsinfo_params {
> +	__u32	at_flags;	/* AT_SYMLINK_NOFOLLOW and similar flags */
> +	__u32	flags;		/* Flags controlling fsinfo() specifically */
> +#define FSINFO_FLAGS_QUERY_MASK	0x0007 /* What object should fsinfo() query? */
> +#define FSINFO_FLAGS_QUERY_PATH	0x0000 /* - path, specified by dirfd,pathname,AT_EMPTY_PATH */
> +#define FSINFO_FLAGS_QUERY_FD	0x0001 /* - fd specified by dirfd */
> +	__u32	request;	/* ID of requested attribute */
> +	__u32	Nth;		/* Instance of it (some may have multiple) */
> +	__u32	Mth;		/* Subinstance of Nth instance */
> +	__u32	__reserved32[1]; /* Reserved params; all must be 0 */
> +	__u64	__reserved[3];
> +};
> +
> +enum fsinfo_value_type {
> +	FSINFO_TYPE_VSTRUCT	= 0,	/* Version-lengthed struct (up to 4096 bytes) */
> +	FSINFO_TYPE_STRING	= 1,	/* NUL-term var-length string (up to 4095 chars) */
> +	FSINFO_TYPE_OPAQUE	= 2,	/* Opaque blob (unlimited size) */
> +	FSINFO_TYPE_LIST	= 3,	/* List of ints/structs (unlimited size) */
> +};
> +
> +/*
> + * Information struct for fsinfo(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO).
> + *
> + * This gives information about the attributes supported by fsinfo for the
> + * given path.
> + */
> +struct fsinfo_attribute_info {
> +	unsigned int		attr_id;	/* The ID of the attribute */
> +	enum fsinfo_value_type	type;		/* The type of the attribute's value(s) */
> +	unsigned int		flags;
> +#define FSINFO_FLAGS_N		0x01		/* - Attr has a set of values */
> +#define FSINFO_FLAGS_NM		0x02		/* - Attr has a set of sets of values */
> +	unsigned int		size;		/* - Value size (FSINFO_STRUCT/FSINFO_LIST) */
> +};
> +
> +#define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO__STRUCT struct fsinfo_attribute_info
> +#define FSINFO_ATTR_FSINFO_ATTRIBUTES__STRUCT __u32
> +
> +struct fsinfo_u128 {
> +#if defined(__BYTE_ORDER) ? __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
> +	__u64	hi;
> +	__u64	lo;
> +#elif defined(__BYTE_ORDER) ? __BYTE_ORDER == __LITTLE_ENDIAN : defined(__LITTLE_ENDIAN)
> +	__u64	lo;
> +	__u64	hi;
> +#endif
> +};
> +
> +/*
> + * Information struct for fsinfo(FSINFO_ATTR_STATFS).
> + * - This gives extended filesystem information.
> + */
> +struct fsinfo_statfs {
> +	struct fsinfo_u128 f_blocks;	/* Total number of blocks in fs */
> +	struct fsinfo_u128 f_bfree;	/* Total number of free blocks */
> +	struct fsinfo_u128 f_bavail;	/* Number of free blocks available to ordinary user */
> +	struct fsinfo_u128 f_files;	/* Total number of file nodes in fs */
> +	struct fsinfo_u128 f_ffree;	/* Number of free file nodes */
> +	struct fsinfo_u128 f_favail;	/* Number of file nodes available to ordinary user */
> +	__u64	f_bsize;		/* Optimal block size */
> +	__u64	f_frsize;		/* Fragment size */
> +};
> +
> +#define FSINFO_ATTR_STATFS__STRUCT struct fsinfo_statfs
> +
> +/*
> + * Information struct for fsinfo(FSINFO_ATTR_IDS).
> + *
> + * List of basic identifiers as is normally found in statfs().
> + */
> +struct fsinfo_ids {
> +	char	f_fs_name[15 + 1];	/* Filesystem name */
> +	__u64	f_fsid;			/* Short 64-bit Filesystem ID (as statfs) */
> +	__u64	f_sb_id;		/* Internal superblock ID for sbnotify()/mntnotify() */
> +	__u32	f_fstype;		/* Filesystem type from linux/magic.h [uncond] */
> +	__u32	f_dev_major;		/* As st_dev_* from struct statx [uncond] */
> +	__u32	f_dev_minor;
> +	__u32	__padding[1];
> +};
> +
> +#define FSINFO_ATTR_IDS__STRUCT struct fsinfo_ids
> +
> +/*
> + * Information struct for fsinfo(FSINFO_ATTR_LIMITS).
> + *
> + * List of supported filesystem limits.
> + */
> +struct fsinfo_limits {
> +	struct fsinfo_u128 max_file_size;	/* Maximum file size */
> +	struct fsinfo_u128 max_ino;		/* Maximum inode number */
> +	__u64	max_uid;			/* Maximum UID supported */
> +	__u64	max_gid;			/* Maximum GID supported */
> +	__u64	max_projid;			/* Maximum project ID supported */
> +	__u64	max_hard_links;			/* Maximum number of hard links on a file */
> +	__u64	max_xattr_body_len;		/* Maximum xattr content length */
> +	__u32	max_xattr_name_len;		/* Maximum xattr name length */
> +	__u32	max_filename_len;		/* Maximum filename length */
> +	__u32	max_symlink_len;		/* Maximum symlink content length */
> +	__u32	max_dev_major;			/* Maximum device major representable */
> +	__u32	max_dev_minor;			/* Maximum device minor representable */
> +	__u32	__padding[1];
> +};
> +
> +#define FSINFO_ATTR_LIMITS__STRUCT struct fsinfo_limits
> +
> +/*
> + * Information struct for fsinfo(FSINFO_ATTR_SUPPORTS).
> + *
> + * What's supported in various masks, such as statx() attribute and mask bits
> + * and IOC flags.
> + */
> +struct fsinfo_supports {
> +	__u64	stx_attributes;		/* What statx::stx_attributes are supported */
> +	__u32	stx_mask;		/* What statx::stx_mask bits are supported */
> +	__u32	fs_ioc_getflags;	/* What FS_IOC_GETFLAGS may return */
> +	__u32	fs_ioc_setflags_set;	/* What FS_IOC_SETFLAGS may set */
> +	__u32	fs_ioc_setflags_clear;	/* What FS_IOC_SETFLAGS may clear */
> +	__u32	win_file_attrs;		/* What DOS/Windows FILE_* attributes are supported */
> +	__u32	__padding[1];
> +};
> +
> +#define FSINFO_ATTR_SUPPORTS__STRUCT struct fsinfo_supports
> +
> +struct fsinfo_timestamp_one {
> +	__s64	minimum;	/* Minimum timestamp value in seconds */
> +	__s64	maximum;	/* Maximum timestamp value in seconds */
> +	__u16	gran_mantissa;	/* Granularity(secs) = mant * 10^exp */
> +	__s8	gran_exponent;
> +	__u8	__padding[5];
> +};
> +
> +/*
> + * Information struct for fsinfo(FSINFO_ATTR_TIMESTAMP_INFO).
> + */
> +struct fsinfo_timestamp_info {
> +	struct fsinfo_timestamp_one	atime;	/* Access time */
> +	struct fsinfo_timestamp_one	mtime;	/* Modification time */
> +	struct fsinfo_timestamp_one	ctime;	/* Change time */
> +	struct fsinfo_timestamp_one	btime;	/* Birth/creation time */
> +};
> +
> +#define FSINFO_ATTR_TIMESTAMP_INFO__STRUCT struct fsinfo_timestamp_info
> +
> +/*
> + * Information struct for fsinfo(FSINFO_ATTR_VOLUME_UUID).
> + */
> +struct fsinfo_volume_uuid {
> +	__u8	uuid[16];
> +};
> +
> +#define FSINFO_ATTR_VOLUME_UUID__STRUCT struct fsinfo_volume_uuid
> +
> +#endif /* _UAPI_LINUX_FSINFO_H */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 0ce01f86e5db..519317f3904c 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -51,6 +51,7 @@ COND_SYSCALL_COMPAT(io_pgetevents);
>  COND_SYSCALL(io_uring_setup);
>  COND_SYSCALL(io_uring_enter);
>  COND_SYSCALL(io_uring_register);
> +COND_SYSCALL(fsinfo);
>  
>  /* fs/xattr.c */
>  
> diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
> index 65acdde5c117..9159ad1d7fc5 100644
> --- a/samples/vfs/Makefile
> +++ b/samples/vfs/Makefile
> @@ -1,10 +1,15 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>  # List of programs to build
> +
>  hostprogs := \
> +	test-fsinfo \
>  	test-fsmount \
>  	test-statx
>  
>  always-y := $(hostprogs)
>  
> +HOSTCFLAGS_test-fsinfo.o += -I$(objtree)/usr/include
> +HOSTLDLIBS_test-fsinfo += -static -lm
> +
>  HOSTCFLAGS_test-fsmount.o += -I$(objtree)/usr/include
>  HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include
> diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
> new file mode 100644
> index 000000000000..22fe3c47ff42
> --- /dev/null
> +++ b/samples/vfs/test-fsinfo.c
> @@ -0,0 +1,607 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/* Test the fsinfo() system call
> + *
> + * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + */
> +
> +#define _GNU_SOURCE
> +#define _ATFILE_SOURCE
> +#include <stdbool.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <stdint.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <ctype.h>
> +#include <errno.h>
> +#include <time.h>
> +#include <math.h>
> +#include <fcntl.h>
> +#include <sys/syscall.h>
> +#include <linux/fsinfo.h>
> +#include <linux/socket.h>
> +#include <sys/stat.h>
> +#include <arpa/inet.h>
> +
> +#ifndef __NR_fsinfo
> +#define __NR_fsinfo -1
> +#endif
> +
> +static bool debug = 0;
> +
> +static __attribute__((unused))
> +ssize_t fsinfo(int dfd, const char *filename, struct fsinfo_params *params,
> +	       void *buffer, size_t buf_size)
> +{
> +	return syscall(__NR_fsinfo, dfd, filename, params, buffer, buf_size);
> +}
> +
> +struct fsinfo_attribute {
> +	unsigned int		attr_id;
> +	enum fsinfo_value_type	type;
> +	unsigned int		size;
> +	const char		*name;
> +	void (*dump)(void *reply, unsigned int size);
> +};
> +
> +static const struct fsinfo_attribute fsinfo_attributes[];
> +
> +static void dump_hex(unsigned int *data, int from, int to)
> +{
> +	unsigned offset, print_offset = 1, col = 0;
> +
> +	from /= 4;
> +	to = (to + 3) / 4;
> +
> +	for (offset = from; offset < to; offset++) {
> +		if (print_offset) {
> +			printf("%04x: ", offset * 8);
> +			print_offset = 0;
> +		}
> +		printf("%08x", data[offset]);
> +		col++;
> +		if ((col & 3) == 0) {
> +			printf("\n");
> +			print_offset = 1;
> +		} else {
> +			printf(" ");
> +		}
> +	}
> +
> +	if (!print_offset)
> +		printf("\n");
> +}
> +
> +static void dump_attribute_info(void *reply, unsigned int size)
> +{
> +	struct fsinfo_attribute_info *attr_info = reply;
> +	const struct fsinfo_attribute *attr;
> +	char type[32], val_size[32];
> +
> +	switch (attr_info->type) {
> +	case FSINFO_TYPE_VSTRUCT:	strcpy(type, "V-STRUCT");	break;
> +	case FSINFO_TYPE_STRING:	strcpy(type, "STRING");		break;
> +	case FSINFO_TYPE_OPAQUE:	strcpy(type, "OPAQUE");		break;
> +	case FSINFO_TYPE_LIST:		strcpy(type, "LIST");		break;
> +	default:
> +		sprintf(type, "type-%x", attr_info->type);
> +		break;
> +	}
> +
> +	if (attr_info->flags & FSINFO_FLAGS_N)
> +		strcat(type, " x N");
> +	else if (attr_info->flags & FSINFO_FLAGS_NM)
> +		strcat(type, " x NM");
> +
> +	for (attr = fsinfo_attributes; attr->name; attr++)
> +		if (attr->attr_id == attr_info->attr_id)
> +			break;
> +
> +	if (attr_info->size)
> +		sprintf(val_size, "%u", attr_info->size);
> +	else
> +		strcpy(val_size, "-");
> +
> +	printf("%8x %-12s %08x %5s %s\n",
> +	       attr_info->attr_id,
> +	       type,
> +	       attr_info->flags,
> +	       val_size,
> +	       attr->name ? attr->name : "");
> +}
> +
> +static void dump_fsinfo_generic_statfs(void *reply, unsigned int size)
> +{
> +	struct fsinfo_statfs *f = reply;
> +
> +	printf("\n");
> +	printf("\tblocks       : n=%llu fr=%llu av=%llu\n",
> +	       (unsigned long long)f->f_blocks.lo,
> +	       (unsigned long long)f->f_bfree.lo,
> +	       (unsigned long long)f->f_bavail.lo);
> +
> +	printf("\tfiles        : n=%llu fr=%llu av=%llu\n",
> +	       (unsigned long long)f->f_files.lo,
> +	       (unsigned long long)f->f_ffree.lo,
> +	       (unsigned long long)f->f_favail.lo);
> +	printf("\tbsize        : %llu\n", f->f_bsize);
> +	printf("\tfrsize       : %llu\n", f->f_frsize);
> +}
> +
> +static void dump_fsinfo_generic_ids(void *reply, unsigned int size)
> +{
> +	struct fsinfo_ids *f = reply;
> +
> +	printf("\n");
> +	printf("\tdev          : %02x:%02x\n", f->f_dev_major, f->f_dev_minor);
> +	printf("\tfs           : type=%x name=%s\n", f->f_fstype, f->f_fs_name);
> +	printf("\tfsid         : %llx\n", (unsigned long long)f->f_fsid);
> +	printf("\tsbid         : %llx\n", (unsigned long long)f->f_sb_id);
> +}
> +
> +static void dump_fsinfo_generic_limits(void *reply, unsigned int size)
> +{
> +	struct fsinfo_limits *f = reply;
> +
> +	printf("\n");
> +	printf("\tmax file size: %llx%016llx\n",
> +	       (unsigned long long)f->max_file_size.hi,
> +	       (unsigned long long)f->max_file_size.lo);
> +	printf("\tmax ino      : %llx%016llx\n",
> +	       (unsigned long long)f->max_ino.hi,
> +	       (unsigned long long)f->max_ino.lo);
> +	printf("\tmax ids      : u=%llx g=%llx p=%llx\n",
> +	       (unsigned long long)f->max_uid,
> +	       (unsigned long long)f->max_gid,
> +	       (unsigned long long)f->max_projid);
> +	printf("\tmax dev      : maj=%x min=%x\n",
> +	       f->max_dev_major, f->max_dev_minor);
> +	printf("\tmax links    : %llx\n",
> +	       (unsigned long long)f->max_hard_links);
> +	printf("\tmax xattr    : n=%x b=%llx\n",
> +	       f->max_xattr_name_len,
> +	       (unsigned long long)f->max_xattr_body_len);
> +	printf("\tmax len      : file=%x sym=%x\n",
> +	       f->max_filename_len, f->max_symlink_len);
> +}
> +
> +static void dump_fsinfo_generic_supports(void *reply, unsigned int size)
> +{
> +	struct fsinfo_supports *f = reply;
> +
> +	printf("\n");
> +	printf("\tstx_attr     : %llx\n", (unsigned long long)f->stx_attributes);
> +	printf("\tstx_mask     : %x\n", f->stx_mask);
> +	printf("\tfs_ioc_*flags: get=%x set=%x clr=%x\n",
> +	       f->fs_ioc_getflags, f->fs_ioc_setflags_set, f->fs_ioc_setflags_clear);
> +	printf("\twin_fattrs   : %x\n", f->win_file_attrs);
> +}
> +
> +static void print_time(struct fsinfo_timestamp_one *t, char stamp)
> +{
> +	printf("\t%ctime       : gran=%gs range=%llx-%llx\n",
> +	       stamp,
> +	       t->gran_mantissa * pow(10., t->gran_exponent),
> +	       (long long)t->minimum,
> +	       (long long)t->maximum);
> +}
> +
> +static void dump_fsinfo_generic_timestamp_info(void *reply, unsigned int size)
> +{
> +	struct fsinfo_timestamp_info *f = reply;
> +
> +	printf("\n");
> +	print_time(&f->atime, 'a');
> +	print_time(&f->mtime, 'm');
> +	print_time(&f->ctime, 'c');
> +	print_time(&f->btime, 'b');
> +}
> +
> +static void dump_fsinfo_generic_volume_uuid(void *reply, unsigned int size)
> +{
> +	struct fsinfo_volume_uuid *f = reply;
> +
> +	printf("%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x"
> +	       "-%02x%02x%02x%02x%02x%02x\n",
> +	       f->uuid[ 0], f->uuid[ 1],
> +	       f->uuid[ 2], f->uuid[ 3],
> +	       f->uuid[ 4], f->uuid[ 5],
> +	       f->uuid[ 6], f->uuid[ 7],
> +	       f->uuid[ 8], f->uuid[ 9],
> +	       f->uuid[10], f->uuid[11],
> +	       f->uuid[12], f->uuid[13],
> +	       f->uuid[14], f->uuid[15]);
> +}
> +
> +static void dump_string(void *reply, unsigned int size)
> +{
> +	char *s = reply, *p;
> +
> +	p = s;
> +	if (size >= 4096) {
> +		size = 4096;
> +		p[4092] = '.';
> +		p[4093] = '.';
> +		p[4094] = '.';
> +		p[4095] = 0;
> +	} else {
> +		p[size] = 0;
> +	}
> +
> +	for (p = s; *p; p++) {
> +		if (!isprint(*p)) {
> +			printf("<non-printable>\n");
> +			continue;
> +		}
> +	}
> +
> +	printf("%s\n", s);
> +}
> +
> +#define dump_fsinfo_generic_volume_id		dump_string
> +#define dump_fsinfo_generic_volume_name		dump_string
> +
> +/*
> + *
> + */
> +#define __FSINFO(A, T, S, U, G, F) \
> +	{ .attr_id = A, .type = T, .size = S, .name = #G, .dump = dump_##G }
> +
> +#define _FSINFO(A, T, S, U, G)	  __FSINFO(A, T, S, U, G, 0)
> +#define _FSINFO_N(A, T, S, U, G)  __FSINFO(A, T, S, U, G, FSINFO_FLAGS_N)
> +#define _FSINFO_NM(A, T, S, U, G) __FSINFO(A, T, S, U, G, FSINFO_FLAGS_NM)
> +
> +#define _FSINFO_VSTRUCT(A,S,G)	 _FSINFO    (A, FSINFO_TYPE_VSTRUCT, sizeof(S), 0, G)
> +#define _FSINFO_VSTRUCT_N(A,S,G) _FSINFO_N  (A, FSINFO_TYPE_VSTRUCT, sizeof(S), 0, G)
> +#define _FSINFO_VSTRUCT_NM(A,S,G) _FSINFO_NM(A, FSINFO_TYPE_VSTRUCT, sizeof(S), 0, G)
> +
> +#define FSINFO_VSTRUCT(A,G)	_FSINFO_VSTRUCT   (A, A##__STRUCT, G)
> +#define FSINFO_VSTRUCT_N(A,G)	_FSINFO_VSTRUCT_N (A, A##__STRUCT, G)
> +#define FSINFO_VSTRUCT_NM(A,G)	_FSINFO_VSTRUCT_NM(A, A##__STRUCT, G)
> +#define FSINFO_STRING(A,G)	_FSINFO   (A, FSINFO_TYPE_STRING, 0, 0, G)
> +#define FSINFO_STRING_N(A,G)	_FSINFO_N (A, FSINFO_TYPE_STRING, 0, 0, G)
> +#define FSINFO_STRING_NM(A,G)	_FSINFO_NM(A, FSINFO_TYPE_STRING, 0, 0, G)
> +#define FSINFO_OPAQUE(A,G)	_FSINFO   (A, FSINFO_TYPE_OPAQUE, 0, 0, G)
> +#define FSINFO_LIST(A,G)	_FSINFO   (A, FSINFO_TYPE_LIST, 0, sizeof(A##__STRUCT), G)
> +#define FSINFO_LIST_N(A,G)	_FSINFO_N (A, FSINFO_TYPE_LIST, 0, sizeof(A##__STRUCT), G)
> +
> +static const struct fsinfo_attribute fsinfo_attributes[] = {
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_STATFS,		fsinfo_generic_statfs),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_IDS,		fsinfo_generic_ids),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_LIMITS,		fsinfo_generic_limits),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_SUPPORTS,		fsinfo_generic_supports),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_TIMESTAMP_INFO,	fsinfo_generic_timestamp_info),
> +	FSINFO_STRING	(FSINFO_ATTR_VOLUME_ID,		fsinfo_generic_volume_id),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_VOLUME_UUID,	fsinfo_generic_volume_uuid),
> +	FSINFO_STRING	(FSINFO_ATTR_VOLUME_NAME,	fsinfo_generic_volume_name),
> +	{}
> +};
> +
> +static void dump_value(unsigned int attr_id,
> +		       const struct fsinfo_attribute *attr,
> +		       const struct fsinfo_attribute_info *attr_info,
> +		       void *reply, unsigned int size)
> +{
> +	if (!attr || !attr->dump) {
> +		printf("<no dumper>\n");
> +		return;
> +	}
> +
> +	if (attr->type == FSINFO_TYPE_VSTRUCT && size < attr->size) {
> +		printf("<short data %u/%u>\n", size, attr->size);
> +		return;
> +	}
> +
> +	attr->dump(reply, size);
> +}
> +
> +static void dump_list(unsigned int attr_id,
> +		      const struct fsinfo_attribute *attr,
> +		      const struct fsinfo_attribute_info *attr_info,
> +		      void *reply, unsigned int size)
> +{
> +	size_t elem_size = attr_info->size;
> +	unsigned int ix = 0;
> +
> +	printf("\n");
> +	if (!attr || !attr->dump) {
> +		printf("<no dumper>\n");
> +		return;
> +	}
> +
> +	if (attr->type == FSINFO_TYPE_VSTRUCT && size < attr->size) {
> +		printf("<short data %u/%u>\n", size, attr->size);
> +		return;
> +	}
> +
> +	while (size >= elem_size) {
> +		printf("\t[%02x] ", ix);
> +		attr->dump(reply, size);
> +		reply += elem_size;
> +		size -= elem_size;
> +		ix++;
> +	}
> +}
> +
> +/*
> + * Call fsinfo, expanding the buffer as necessary.
> + */
> +static ssize_t get_fsinfo(const char *file, const char *name,
> +			  struct fsinfo_params *params, void **_r)
> +{
> +	ssize_t ret;
> +	size_t buf_size = 4096;
> +	void *r;
> +
> +	for (;;) {
> +		r = malloc(buf_size);
> +		if (!r) {
> +			perror("malloc");
> +			exit(1);
> +		}
> +		memset(r, 0xbd, buf_size);
> +
> +		errno = 0;
> +		ret = fsinfo(AT_FDCWD, file, params, r, buf_size);
> +		if (ret == -1) {
> +			free(r);
> +			*_r = NULL;
> +			return ret;
> +		}
> +
> +		if (ret <= buf_size)
> +			break;
> +		buf_size = (ret + 4096 - 1) & ~(4096 - 1);
> +	}
> +
> +	if (debug) {
> +		if (ret == -1)
> +			printf("fsinfo(%s,%s,%u,%u) = %m\n",
> +			       file, name, params->Nth, params->Mth);
> +		else
> +			printf("fsinfo(%s,%s,%u,%u) = %zd\n",
> +			       file, name, params->Nth, params->Mth, ret);
> +	}
> +
> +	*_r = r;
> +	return ret;
> +}
> +
> +/*
> + * Try one subinstance of an attribute.
> + */
> +static int try_one(const char *file, struct fsinfo_params *params,
> +		   const struct fsinfo_attribute_info *attr_info, bool raw)
> +{
> +	const struct fsinfo_attribute *attr;
> +	const char *name;
> +	size_t size = 4096;
> +	char namebuf[32];
> +	void *r;
> +
> +	for (attr = fsinfo_attributes; attr->name; attr++) {
> +		if (attr->attr_id == params->request) {
> +			name = attr->name;
> +			if (strncmp(name, "fsinfo_generic_", 15) == 0)
> +				name += 15;
> +			goto found;
> +		}
> +	}
> +
> +	sprintf(namebuf, "<unknown-%x>", params->request);
> +	name = namebuf;
> +	attr = NULL;
> +
> +found:
> +	size = get_fsinfo(file, name, params, &r);
> +
> +	if (size == -1) {
> +		if (errno == ENODATA) {
> +			if (!(attr_info->flags & (FSINFO_FLAGS_N | FSINFO_FLAGS_NM)) &&
> +			    params->Nth == 0 && params->Mth == 0) {
> +				fprintf(stderr,
> +					"Unexpected ENODATA (0x%x{%u}{%u})\n",
> +					params->request, params->Nth, params->Mth);
> +				exit(1);
> +			}
> +			free(r);
> +			return (params->Mth == 0) ? 2 : 1;
> +		}
> +		if (errno == EOPNOTSUPP) {
> +			if (params->Nth > 0 || params->Mth > 0) {
> +				fprintf(stderr,
> +					"Should return -ENODATA (0x%x{%u}{%u})\n",
> +					params->request, params->Nth, params->Mth);
> +				exit(1);
> +			}
> +			//printf("\e[33m%s\e[m: <not supported>\n",
> +			//       fsinfo_attr_names[attr]);
> +			free(r);
> +			return 2;
> +		}
> +		perror(file);
> +		exit(1);
> +	}
> +
> +	if (raw) {
> +		if (size > 4096)
> +			size = 4096;
> +		dump_hex(r, 0, size);
> +		free(r);
> +		return 0;
> +	}
> +
> +	switch (attr_info->flags & (FSINFO_FLAGS_N | FSINFO_FLAGS_NM)) {
> +	case 0:
> +		printf("\e[33m%s\e[m: ", name);
> +		break;
> +	case FSINFO_FLAGS_N:
> +		printf("\e[33m%s{%u}\e[m: ", name, params->Nth);
> +		break;
> +	case FSINFO_FLAGS_NM:
> +		printf("\e[33m%s{%u,%u}\e[m: ", name, params->Nth, params->Mth);
> +		break;
> +	}
> +
> +	switch (attr_info->type) {
> +	case FSINFO_TYPE_VSTRUCT:
> +	case FSINFO_TYPE_STRING:
> +		dump_value(params->request, attr, attr_info, r, size);
> +		free(r);
> +		return 0;
> +
> +	case FSINFO_TYPE_LIST:
> +		dump_list(params->request, attr, attr_info, r, size);
> +		free(r);
> +		return 0;
> +
> +	case FSINFO_TYPE_OPAQUE:
> +		free(r);
> +		return 0;
> +
> +	default:
> +		fprintf(stderr, "Fishy about %u 0x%x,%x,%x\n",
> +			params->request, attr_info->type, attr_info->flags, attr_info->size);
> +		exit(1);
> +	}
> +}
> +
> +static int cmp_u32(const void *a, const void *b)
> +{
> +	return *(const int *)a - *(const int *)b;
> +}
> +
> +/*
> + *
> + */
> +int main(int argc, char **argv)
> +{
> +	struct fsinfo_attribute_info attr_info;
> +	struct fsinfo_params params = {
> +		.at_flags	= AT_SYMLINK_NOFOLLOW,
> +		.flags		= FSINFO_FLAGS_QUERY_PATH,
> +	};
> +	unsigned int *attrs, ret, nr, i;
> +	bool meta = false;
> +	int raw = 0, opt, Nth, Mth;
> +
> +	while ((opt = getopt(argc, argv, "adlmr"))) {
> +		switch (opt) {
> +		case 'a':
> +			params.at_flags |= AT_NO_AUTOMOUNT;
> +			continue;
> +		case 'd':
> +			debug = true;
> +			continue;
> +		case 'l':
> +			params.at_flags &= ~AT_SYMLINK_NOFOLLOW;
> +			continue;
> +		case 'm':
> +			meta = true;
> +			continue;
> +		case 'r':
> +			raw = 1;
> +			continue;
> +		}
> +		break;
> +	}
> +
> +	argc -= optind;
> +	argv += optind;
> +
> +	if (argc != 1) {
> +		printf("Format: test-fsinfo [-alr] <file>\n");
> +		exit(2);
> +	}
> +
> +	/* Retrieve a list of supported attribute IDs */
> +	params.request = FSINFO_ATTR_FSINFO_ATTRIBUTES;
> +	params.Nth = 0;
> +	params.Mth = 0;
> +	ret = get_fsinfo(argv[0], "attributes", &params, (void **)&attrs);
> +	if (ret == -1) {
> +		fprintf(stderr, "Unable to get attribute list: %m\n");
> +		exit(1);
> +	}
> +
> +	if (ret % sizeof(attrs[0])) {
> +		fprintf(stderr, "Bad length of attribute list (0x%x)\n", ret);
> +		exit(2);
> +	}
> +
> +	nr = ret / sizeof(attrs[0]);
> +	qsort(attrs, nr, sizeof(attrs[0]), cmp_u32);
> +
> +	if (meta) {
> +		printf("ATTR ID  TYPE         FLAGS    SIZE  NAME\n");
> +		printf("======== ============ ======== ===== =========\n");
> +		for (i = 0; i < nr; i++) {
> +			params.request = FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO;
> +			params.Nth = attrs[i];
> +			params.Mth = 0;
> +			ret = fsinfo(AT_FDCWD, argv[0], &params, &attr_info, sizeof(attr_info));
> +			if (ret == -1) {
> +				fprintf(stderr, "Can't get info for attribute %x: %m\n", attrs[i]);
> +				exit(1);
> +			}
> +
> +			dump_attribute_info(&attr_info, ret);
> +		}
> +		exit(0);
> +	}
> +
> +	for (i = 0; i < nr; i++) {
> +		params.request = FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO;
> +		params.Nth = attrs[i];
> +		params.Mth = 0;
> +		ret = fsinfo(AT_FDCWD, argv[0], &params, &attr_info, sizeof(attr_info));
> +		if (ret == -1) {
> +			fprintf(stderr, "Can't get info for attribute %x: %m\n", attrs[i]);
> +			exit(1);
> +		}
> +
> +		if (attrs[i] == FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO ||
> +		    attrs[i] == FSINFO_ATTR_FSINFO_ATTRIBUTES)
> +			continue;
> +
> +		if (attrs[i] != attr_info.attr_id) {
> +			fprintf(stderr, "ID for %03x returned %03x\n",
> +				attrs[i], attr_info.attr_id);
> +			break;
> +		}
> +		Nth = 0;
> +		do {
> +			Mth = 0;
> +			do {
> +				params.request = attrs[i];
> +				params.Nth = Nth;
> +				params.Mth = Mth;
> +
> +				switch (try_one(argv[0], &params, &attr_info, raw)) {
> +				case 0:
> +					continue;
> +				case 1:
> +					goto done_M;
> +				case 2:
> +					goto done_N;
> +				}
> +			} while (++Mth < 100);
> +
> +		done_M:
> +			if (Mth >= 100) {
> +				fprintf(stderr, "Fishy: Mth %x[%u][%u]\n", attrs[i], Nth, Mth);
> +				break;
> +			}
> +
> +		} while (++Nth < 100);
> +
> +	done_N:
> +		if (Nth >= 100) {
> +			fprintf(stderr, "Fishy: Nth %x[%u]\n", attrs[i], Nth);
> +			break;
> +		}
> +	}
> +
> +	return 0;
> +}
> 
> 


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-25 15:28           ` James Bottomley
  2020-02-25 15:47             ` Steven Whitehouse
@ 2020-02-26  9:11             ` Miklos Szeredi
  2020-02-26 10:51               ` Steven Whitehouse
  2020-02-27  5:06               ` Ian Kent
  2020-02-28 15:52             ` Christian Brauner
  2020-02-28 16:36             ` David Howells
  3 siblings, 2 replies; 117+ messages in thread
From: Miklos Szeredi @ 2020-02-26  9:11 UTC (permalink / raw)
  To: James Bottomley
  Cc: Steven Whitehouse, Miklos Szeredi, David Howells, viro, Ian Kent,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml

On Tue, Feb 25, 2020 at 4:29 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:

> The other thing a file descriptor does that sysfs doesn't is that it
> solves the information leak: if I'm in a mount namespace that has no
> access to certain mounts, I can't fspick them and thus I can't see the
> information.  By default, with sysfs I can.

That's true, but procfs/sysfs has to deal with various namespacing
issues anyway.  If this is just about hiding a number of entries, then
I don't think that's going to be a big deal.

The syscall API is efficient: single syscall per query instead of
several, no parsing necessary.

However, it is difficult to extend, because the ABI must be updated,
possibly libc and util-linux also, so that scripts can also consume
the new parameter.  With the sysfs approach only the kernel needs to
be updated, and possibly only the filesystem code, not even the VFS.

So I think the question comes down to:  do we need a highly efficient
way to query the superblock parameters all at once, or not?

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-26  9:11             ` Miklos Szeredi
@ 2020-02-26 10:51               ` Steven Whitehouse
  2020-02-27  5:06               ` Ian Kent
  1 sibling, 0 replies; 117+ messages in thread
From: Steven Whitehouse @ 2020-02-26 10:51 UTC (permalink / raw)
  To: Miklos Szeredi, James Bottomley
  Cc: Miklos Szeredi, David Howells, viro, Ian Kent, Christian Brauner,
	Jann Horn, Darrick J. Wong, Linux API, linux-fsdevel, lkml

Hi,

On 26/02/2020 09:11, Miklos Szeredi wrote:
> On Tue, Feb 25, 2020 at 4:29 PM James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
>
>> The other thing a file descriptor does that sysfs doesn't is that it
>> solves the information leak: if I'm in a mount namespace that has no
>> access to certain mounts, I can't fspick them and thus I can't see the
>> information.  By default, with sysfs I can.
> That's true, but procfs/sysfs has to deal with various namespacing
> issues anyway.  If this is just about hiding a number of entries, then
> I don't think that's going to be a big deal.
>
> The syscall API is efficient: single syscall per query instead of
> several, no parsing necessary.
>
> However, it is difficult to extend, because the ABI must be updated,
> possibly libc and util-linux also, so that scripts can also consume
> the new parameter.  With the sysfs approach only the kernel needs to
> be updated, and possibly only the filesystem code, not even the VFS.
>
> So I think the question comes down to:  do we need a highly efficient
> way to query the superblock parameters all at once, or not?
>
> Thanks,
> Miklos
>

That is Ian's use case for autofs I think, and it will also be what is 
needed at start up of most applications using the fs notifications, as 
well as at resync time if there has been an overrun leading to lost fs 
notification messages. We do need a solution that can scale to large 
numbers of mounts efficiently. Being able to extend it is also an 
important consideration too, so hopefully David has a solution to that,

Steve.



^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-26  9:11             ` Miklos Szeredi
  2020-02-26 10:51               ` Steven Whitehouse
@ 2020-02-27  5:06               ` Ian Kent
  2020-02-27  9:36                 ` Miklos Szeredi
  1 sibling, 1 reply; 117+ messages in thread
From: Ian Kent @ 2020-02-27  5:06 UTC (permalink / raw)
  To: Miklos Szeredi, James Bottomley
  Cc: Steven Whitehouse, Miklos Szeredi, David Howells, viro,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml

On Wed, 2020-02-26 at 10:11 +0100, Miklos Szeredi wrote:
> On Tue, Feb 25, 2020 at 4:29 PM James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> 
> > The other thing a file descriptor does that sysfs doesn't is that
> > it
> > solves the information leak: if I'm in a mount namespace that has
> > no
> > access to certain mounts, I can't fspick them and thus I can't see
> > the
> > information.  By default, with sysfs I can.
> 
> That's true, but procfs/sysfs has to deal with various namespacing
> issues anyway.  If this is just about hiding a number of entries,
> then
> I don't think that's going to be a big deal.

I didn't see name space considerations in sysfs when I was looking at
it recently. Obeying name space requirements is likely a lot of work
in sysfs.

> 
> The syscall API is efficient: single syscall per query instead of
> several, no parsing necessary.
> 
> However, it is difficult to extend, because the ABI must be updated,
> possibly libc and util-linux also, so that scripts can also consume
> the new parameter.  With the sysfs approach only the kernel needs to
> be updated, and possibly only the filesystem code, not even the VFS.
> 
> So I think the question comes down to:  do we need a highly efficient
> way to query the superblock parameters all at once, or not?

Or a similar question could be, how could a sysfs interface work
to provide mount information.

Getting information about all mounts might not be too bad but the
sysfs directory structure that would be needed to represent all
system mounts (without considering name spaces) would likely
result in somewhat busy user space code.

For example, given a path, and the path is all I know, how do I
get mount information?

Ignoring possible multiple mounts on a mount point, call fsinfo()
with the path and get the id (the path walk is low overhead) to
use with fsinfo() to get the all the info I need ... done.

Again, ignoring possible multiple mounts on a mount point, and
assuming there is a sysfs tree enumerating all the system mounts.
I could open <sysfs base> + mount point path followed buy opening
and reading the individual attribute files ... a bit more busy
that one ... particularly if I need to do it for several thousand
mounts.

Then there's the code that would need to be added to maintain the
various views in the sysfs tree, which can't be restricted only to
the VFS because there's file system specific info needed too (the
maintain a table idea), and that's before considering name space
handling changes to sysfs.

At the least the question of "do we need a highly efficient way
to query the superblock parameters all at once" needs to be
extended to include mount table enumeration as well as getting
the info.

But this is just me thinking about mount table handling and the
quite significant problem we now have with user space scanning
the proc mount tables to get this information.

Ian


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-27  5:06               ` Ian Kent
@ 2020-02-27  9:36                 ` Miklos Szeredi
  2020-02-27 11:34                   ` Ian Kent
  0 siblings, 1 reply; 117+ messages in thread
From: Miklos Szeredi @ 2020-02-27  9:36 UTC (permalink / raw)
  To: Ian Kent
  Cc: Miklos Szeredi, James Bottomley, Steven Whitehouse,
	David Howells, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml

On Thu, Feb 27, 2020 at 6:06 AM Ian Kent <raven@themaw.net> wrote:

> At the least the question of "do we need a highly efficient way
> to query the superblock parameters all at once" needs to be
> extended to include mount table enumeration as well as getting
> the info.
>
> But this is just me thinking about mount table handling and the
> quite significant problem we now have with user space scanning
> the proc mount tables to get this information.

Right.

So the problem is that currently autofs needs to rescan the proc mount
table on every change.   The solution to that is to

 - add a notification mechanism
 - and a way to selectively query mount/superblock information

right?

For the notification we have uevents in sysfs, which also supplies the
changed parameters.  Taking aside namespace issues and addressing
mounts would this work for autofs?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-27  9:36                 ` Miklos Szeredi
@ 2020-02-27 11:34                   ` Ian Kent
  2020-02-27 13:45                     ` Miklos Szeredi
  0 siblings, 1 reply; 117+ messages in thread
From: Ian Kent @ 2020-02-27 11:34 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Miklos Szeredi, James Bottomley, Steven Whitehouse,
	David Howells, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml

On Thu, 2020-02-27 at 10:36 +0100, Miklos Szeredi wrote:
> On Thu, Feb 27, 2020 at 6:06 AM Ian Kent <raven@themaw.net> wrote:
> 
> > At the least the question of "do we need a highly efficient way
> > to query the superblock parameters all at once" needs to be
> > extended to include mount table enumeration as well as getting
> > the info.
> > 
> > But this is just me thinking about mount table handling and the
> > quite significant problem we now have with user space scanning
> > the proc mount tables to get this information.
> 
> Right.
> 
> So the problem is that currently autofs needs to rescan the proc
> mount
> table on every change.   The solution to that is to

Actually no, that's not quite the problem I see.

autofs handles large mount tables fairly well (necessarily) and
in time I plan to remove the need to read the proc tables at all
(that's proven very difficult but I'll get back to that).

This has to be done to resolve the age old problem of autofs not
being able to handle large direct mount maps. But, because of
the large number of mounts associated with large direct mount
maps, other system processes are badly affected too.

So the problem I want to see fixed is the effect of very large
mount tables on other user space applications, particularly the
effect when a large number of mounts or umounts are performed.

Clearly large mount tables not only result from autofs and the
problems caused by them are slightly different to the mount and
umount problem I describe. But they are a problem nevertheless
in the sense that frequent notifications that lead to reading
a large proc mount table has significant overhead that can't be
avoided because the table may have changed since the last time
it was read.

It's easy to cause several system processes to peg a fair number
of CPU's when a large number of mounts/umounts are being performed,
namely systemd, udisks2 and a some others. Also I've seen couple
of application processes badly affected purely by the presence of
a large number of mounts in the proc tables, that's not quite so
bad though.

> 
>  - add a notification mechanism   - lookup a mount based on path
>  - and a way to selectively query mount/superblock information
based on path ...
> 
> right?
> 
> For the notification we have uevents in sysfs, which also supplies
> the
> changed parameters.  Taking aside namespace issues and addressing
> mounts would this work for autofs?

The parameters supplied by the notification mechanism are important.

The place this is needed will be libmount since it catches a broad
number of user space applications, including those I mentioned above
(well at least systemd, I think also udisks2, very probably others).

So that means mount table info. needs to be maintained, whether that
can be achieved using sysfs I don't know. Creating and maintaining
the sysfs tree would be a big challenge I think.

But before trying to work out how to use a notification mechanism
just having a way to get the info provided by the proc tables using
a path alone should give initial immediate improvement in libmount.

Ian


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-27 11:34                   ` Ian Kent
@ 2020-02-27 13:45                     ` Miklos Szeredi
  2020-02-27 15:14                       ` Karel Zak
  2020-02-28  0:12                       ` Ian Kent
  0 siblings, 2 replies; 117+ messages in thread
From: Miklos Szeredi @ 2020-02-27 13:45 UTC (permalink / raw)
  To: Ian Kent
  Cc: Miklos Szeredi, James Bottomley, Steven Whitehouse,
	David Howells, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml, Karel Zak,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek, util-linux

On Thu, Feb 27, 2020 at 12:34 PM Ian Kent <raven@themaw.net> wrote:
>
> On Thu, 2020-02-27 at 10:36 +0100, Miklos Szeredi wrote:
> > On Thu, Feb 27, 2020 at 6:06 AM Ian Kent <raven@themaw.net> wrote:
> >
> > > At the least the question of "do we need a highly efficient way
> > > to query the superblock parameters all at once" needs to be
> > > extended to include mount table enumeration as well as getting
> > > the info.
> > >
> > > But this is just me thinking about mount table handling and the
> > > quite significant problem we now have with user space scanning
> > > the proc mount tables to get this information.
> >
> > Right.
> >
> > So the problem is that currently autofs needs to rescan the proc
> > mount
> > table on every change.   The solution to that is to
>
> Actually no, that's not quite the problem I see.
>
> autofs handles large mount tables fairly well (necessarily) and
> in time I plan to remove the need to read the proc tables at all
> (that's proven very difficult but I'll get back to that).
>
> This has to be done to resolve the age old problem of autofs not
> being able to handle large direct mount maps. But, because of
> the large number of mounts associated with large direct mount
> maps, other system processes are badly affected too.
>
> So the problem I want to see fixed is the effect of very large
> mount tables on other user space applications, particularly the
> effect when a large number of mounts or umounts are performed.
>
> Clearly large mount tables not only result from autofs and the
> problems caused by them are slightly different to the mount and
> umount problem I describe. But they are a problem nevertheless
> in the sense that frequent notifications that lead to reading
> a large proc mount table has significant overhead that can't be
> avoided because the table may have changed since the last time
> it was read.
>
> It's easy to cause several system processes to peg a fair number
> of CPU's when a large number of mounts/umounts are being performed,
> namely systemd, udisks2 and a some others. Also I've seen couple
> of application processes badly affected purely by the presence of
> a large number of mounts in the proc tables, that's not quite so
> bad though.
>
> >
> >  - add a notification mechanism   - lookup a mount based on path
> >  - and a way to selectively query mount/superblock information
> based on path ...
> >
> > right?
> >
> > For the notification we have uevents in sysfs, which also supplies
> > the
> > changed parameters.  Taking aside namespace issues and addressing
> > mounts would this work for autofs?
>
> The parameters supplied by the notification mechanism are important.
>
> The place this is needed will be libmount since it catches a broad
> number of user space applications, including those I mentioned above
> (well at least systemd, I think also udisks2, very probably others).
>
> So that means mount table info. needs to be maintained, whether that
> can be achieved using sysfs I don't know. Creating and maintaining
> the sysfs tree would be a big challenge I think.
>
> But before trying to work out how to use a notification mechanism
> just having a way to get the info provided by the proc tables using
> a path alone should give initial immediate improvement in libmount.

Adding Karel, Lennart, Zbigniew and util-linux@vger...

At a quick glance at libmount and systemd code, it appears that just
switching out the implementation in libmount will not be enough:
systemd is calling functions like mnt_table_parse_*() when it receives
a notification that the mount table changed.

What is the end purpose of parsing the mount tables?  Can systemd guys
comment on that?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-27 13:45                     ` Miklos Szeredi
@ 2020-02-27 15:14                       ` Karel Zak
  2020-02-28  0:43                         ` Ian Kent
  2020-02-28  0:12                       ` Ian Kent
  1 sibling, 1 reply; 117+ messages in thread
From: Karel Zak @ 2020-02-27 15:14 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, Miklos Szeredi, James Bottomley, Steven Whitehouse,
	David Howells, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek, util-linux

On Thu, Feb 27, 2020 at 02:45:27PM +0100, Miklos Szeredi wrote:
> > So the problem I want to see fixed is the effect of very large
> > mount tables on other user space applications, particularly the
> > effect when a large number of mounts or umounts are performed.

Yes, now you have to generate (in kernel) and parse (in
userspace) all mount table to get information about just 
one mount table entry. This is typical for umount or systemd.

> > >  - add a notification mechanism   - lookup a mount based on path
> > >  - and a way to selectively query mount/superblock information
> > based on path ...

For umount-like use-cases we need mountpoint/ to mount entry
conversion; I guess something like open(mountpoint/) + fsinfo() 
should be good enough.

For systemd we need the same, but triggered by notification. The ideal
solution is to get mount entry ID or FD from notification and later use this
ID or FD to ask for details about the mount entry (probably again fsinfo()).
The notification has to be usable with in epoll() set.

This solves 99% of our performance issues I guess.

> > So that means mount table info. needs to be maintained, whether that
> > can be achieved using sysfs I don't know. Creating and maintaining
> > the sysfs tree would be a big challenge I think.

It will be still necessary to get complete mount table sometimes, but 
not in performance sensitive scenarios.

I'm not sure about sysfs/, you need somehow resolve namespaces, order
of the mount entries (which one is the last one), etc. IMHO translate
mountpoint path to sysfs/ path will be complicated.

> > But before trying to work out how to use a notification mechanism
> > just having a way to get the info provided by the proc tables using
> > a path alone should give initial immediate improvement in libmount.
> 
> Adding Karel, Lennart, Zbigniew and util-linux@vger...
> 
> At a quick glance at libmount and systemd code, it appears that just
> switching out the implementation in libmount will not be enough:
> systemd is calling functions like mnt_table_parse_*() when it receives
> a notification that the mount table changed.

We're ready to change this stuff in systemd if there will be something
better (something per-mount-entry).

My plan is add new API to libmount to query information about one
mount entry (but I had no time to play with fsinfo yet).

> What is the end purpose of parsing the mount tables?  Can systemd guys
> comment on that?

If mount/umount is triggered by systemd than it need verification
about success and final version of the mount options. It also reads
information from libmount to get userspace mount options (.e.g.
_netdev -- libmount uses mount source, target and fsroot to join
kernel and userpace stuff).

And don't forget that mount units are part of systemd dependencies, so
umount/mount is important event for systemd and it need details about
the changes (what, where, ... etc.)

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-27 13:45                     ` Miklos Szeredi
  2020-02-27 15:14                       ` Karel Zak
@ 2020-02-28  0:12                       ` Ian Kent
  1 sibling, 0 replies; 117+ messages in thread
From: Ian Kent @ 2020-02-28  0:12 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Miklos Szeredi, James Bottomley, Steven Whitehouse,
	David Howells, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml, Karel Zak,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek, util-linux

On Thu, 2020-02-27 at 14:45 +0100, Miklos Szeredi wrote:
> On Thu, Feb 27, 2020 at 12:34 PM Ian Kent <raven@themaw.net> wrote:
> > On Thu, 2020-02-27 at 10:36 +0100, Miklos Szeredi wrote:
> > > On Thu, Feb 27, 2020 at 6:06 AM Ian Kent <raven@themaw.net>
> > > wrote:
> > > 
> > > > At the least the question of "do we need a highly efficient way
> > > > to query the superblock parameters all at once" needs to be
> > > > extended to include mount table enumeration as well as getting
> > > > the info.
> > > > 
> > > > But this is just me thinking about mount table handling and the
> > > > quite significant problem we now have with user space scanning
> > > > the proc mount tables to get this information.
> > > 
> > > Right.
> > > 
> > > So the problem is that currently autofs needs to rescan the proc
> > > mount
> > > table on every change.   The solution to that is to
> > 
> > Actually no, that's not quite the problem I see.
> > 
> > autofs handles large mount tables fairly well (necessarily) and
> > in time I plan to remove the need to read the proc tables at all
> > (that's proven very difficult but I'll get back to that).
> > 
> > This has to be done to resolve the age old problem of autofs not
> > being able to handle large direct mount maps. But, because of
> > the large number of mounts associated with large direct mount
> > maps, other system processes are badly affected too.
> > 
> > So the problem I want to see fixed is the effect of very large
> > mount tables on other user space applications, particularly the
> > effect when a large number of mounts or umounts are performed.
> > 
> > Clearly large mount tables not only result from autofs and the
> > problems caused by them are slightly different to the mount and
> > umount problem I describe. But they are a problem nevertheless
> > in the sense that frequent notifications that lead to reading
> > a large proc mount table has significant overhead that can't be
> > avoided because the table may have changed since the last time
> > it was read.
> > 
> > It's easy to cause several system processes to peg a fair number
> > of CPU's when a large number of mounts/umounts are being performed,
> > namely systemd, udisks2 and a some others. Also I've seen couple
> > of application processes badly affected purely by the presence of
> > a large number of mounts in the proc tables, that's not quite so
> > bad though.
> > 
> > >  - add a notification mechanism   - lookup a mount based on path
> > >  - and a way to selectively query mount/superblock information
> > based on path ...
> > > right?
> > > 
> > > For the notification we have uevents in sysfs, which also
> > > supplies
> > > the
> > > changed parameters.  Taking aside namespace issues and addressing
> > > mounts would this work for autofs?
> > 
> > The parameters supplied by the notification mechanism are
> > important.
> > 
> > The place this is needed will be libmount since it catches a broad
> > number of user space applications, including those I mentioned
> > above
> > (well at least systemd, I think also udisks2, very probably
> > others).
> > 
> > So that means mount table info. needs to be maintained, whether
> > that
> > can be achieved using sysfs I don't know. Creating and maintaining
> > the sysfs tree would be a big challenge I think.
> > 
> > But before trying to work out how to use a notification mechanism
> > just having a way to get the info provided by the proc tables using
> > a path alone should give initial immediate improvement in libmount.
> 
> Adding Karel, Lennart, Zbigniew and util-linux@vger...
> 
> At a quick glance at libmount and systemd code, it appears that just
> switching out the implementation in libmount will not be enough:
> systemd is calling functions like mnt_table_parse_*() when it
> receives
> a notification that the mount table changed.

Maybe I wasn't clear, my bad, sorry about that.

There's no question that change notification handling is needed too.

I'm claiming that an initial change to use something that can get
the mount information without using the proc tables alone will give
an "initial immediate improvement".

The work needed to implement mount table change notification
handling will take much more time and exactly what changes that
will bring is not clear yet and I do plan to work on that too,
together with Karel.

Ian


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-27 15:14                       ` Karel Zak
@ 2020-02-28  0:43                         ` Ian Kent
  2020-02-28  8:35                           ` Miklos Szeredi
  0 siblings, 1 reply; 117+ messages in thread
From: Ian Kent @ 2020-02-28  0:43 UTC (permalink / raw)
  To: Karel Zak, Miklos Szeredi
  Cc: Miklos Szeredi, James Bottomley, Steven Whitehouse,
	David Howells, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek, util-linux

On Thu, 2020-02-27 at 16:14 +0100, Karel Zak wrote:
> On Thu, Feb 27, 2020 at 02:45:27PM +0100, Miklos Szeredi wrote:
> > > So the problem I want to see fixed is the effect of very large
> > > mount tables on other user space applications, particularly the
> > > effect when a large number of mounts or umounts are performed.
> 
> Yes, now you have to generate (in kernel) and parse (in
> userspace) all mount table to get information about just 
> one mount table entry. This is typical for umount or systemd.
> 
> > > >  - add a notification mechanism   - lookup a mount based on
> > > > path
> > > >  - and a way to selectively query mount/superblock information
> > > based on path ...
> 
> For umount-like use-cases we need mountpoint/ to mount entry
> conversion; I guess something like open(mountpoint/) + fsinfo() 
> should be good enough.
> 
> For systemd we need the same, but triggered by notification. The
> ideal
> solution is to get mount entry ID or FD from notification and later
> use this
> ID or FD to ask for details about the mount entry (probably again
> fsinfo()).
> The notification has to be usable with in epoll() set.
> 
> This solves 99% of our performance issues I guess.
> 
> > > So that means mount table info. needs to be maintained, whether
> > > that
> > > can be achieved using sysfs I don't know. Creating and
> > > maintaining
> > > the sysfs tree would be a big challenge I think.
> 
> It will be still necessary to get complete mount table sometimes,
> but 
> not in performance sensitive scenarios.

That was my understanding too.

Mount table enumeration is possible with fsinfo() but you still
have to handle each and every mount so improvement there is not
going to be as much as cases where the proc mount table needs to
be scanned independently for an individual mount. It will be
somewhat more straight forward without the need to dissect text
records though.

> 
> I'm not sure about sysfs/, you need somehow resolve namespaces, order
> of the mount entries (which one is the last one), etc. IMHO translate
> mountpoint path to sysfs/ path will be complicated.

I wonder about that too, after all sysfs contains a tree of nodes
from which the view is created unlike proc which translates kernel
information directly based on what the process should see.

We'll need to wait a bit and see what Miklos has in mind for mount
table enumeration and nothing has been said about name spaces yet.

While fsinfo() is not similar to proc it does handle name spaces
in a sensible way via. file handles, a bit similar to the proc fs,
and ordering is catered for in the fsinfo() enumeration in a natural
way. Not sure how that would be handled using sysfs ...

Ian


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-28  0:43                         ` Ian Kent
@ 2020-02-28  8:35                           ` Miklos Szeredi
  2020-02-28 12:27                             ` Greg Kroah-Hartman
  2020-02-28 15:08                             ` James Bottomley
  0 siblings, 2 replies; 117+ messages in thread
From: Miklos Szeredi @ 2020-02-28  8:35 UTC (permalink / raw)
  To: Ian Kent
  Cc: Karel Zak, Miklos Szeredi, James Bottomley, Steven Whitehouse,
	David Howells, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Greg Kroah-Hartman, util-linux

On Fri, Feb 28, 2020 at 1:43 AM Ian Kent <raven@themaw.net> wrote:

> > I'm not sure about sysfs/, you need somehow resolve namespaces, order
> > of the mount entries (which one is the last one), etc. IMHO translate
> > mountpoint path to sysfs/ path will be complicated.
>
> I wonder about that too, after all sysfs contains a tree of nodes
> from which the view is created unlike proc which translates kernel
> information directly based on what the process should see.
>
> We'll need to wait a bit and see what Miklos has in mind for mount
> table enumeration and nothing has been said about name spaces yet.

Adding Greg for sysfs knowledge.

As far as I understand the sysfs model is, basically:

  - list of devices sorted by class and address
  - with each class having a given set of attributes

Superblocks and mounts could get enumerated by a unique identifier.
mnt_id seems to be good for mounts, s_dev may or may not be good for
superblock, but  s_id (as introduced in this patchset) could be used
instead.

As for namespaces, that's "just" an access control issue, AFAICS.
For example a task with a non-initial mount namespace should not have
access to attributes of mounts outside of its namespace.  Checking
access to superblock attributes would be similar: scan the list of
mounts and only allow access if at least one mount would get access.

> While fsinfo() is not similar to proc it does handle name spaces
> in a sensible way via. file handles, a bit similar to the proc fs,
> and ordering is catered for in the fsinfo() enumeration in a natural
> way. Not sure how that would be handled using sysfs ...

I agree that the access control is much more straightforward with
fsinfo(2) and this may be the single biggest reason to introduce a new
syscall.

Let's see what others thing.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-28  8:35                           ` Miklos Szeredi
@ 2020-02-28 12:27                             ` Greg Kroah-Hartman
  2020-02-28 16:24                               ` Miklos Szeredi
  2020-02-28 16:42                               ` David Howells
  2020-02-28 15:08                             ` James Bottomley
  1 sibling, 2 replies; 117+ messages in thread
From: Greg Kroah-Hartman @ 2020-02-28 12:27 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, Karel Zak, Miklos Szeredi, James Bottomley,
	Steven Whitehouse, David Howells, viro, Christian Brauner,
	Jann Horn, Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek, util-linux

On Fri, Feb 28, 2020 at 09:35:17AM +0100, Miklos Szeredi wrote:
> On Fri, Feb 28, 2020 at 1:43 AM Ian Kent <raven@themaw.net> wrote:
> 
> > > I'm not sure about sysfs/, you need somehow resolve namespaces, order
> > > of the mount entries (which one is the last one), etc. IMHO translate
> > > mountpoint path to sysfs/ path will be complicated.
> >
> > I wonder about that too, after all sysfs contains a tree of nodes
> > from which the view is created unlike proc which translates kernel
> > information directly based on what the process should see.
> >
> > We'll need to wait a bit and see what Miklos has in mind for mount
> > table enumeration and nothing has been said about name spaces yet.
> 
> Adding Greg for sysfs knowledge.
> 
> As far as I understand the sysfs model is, basically:
> 
>   - list of devices sorted by class and address
>   - with each class having a given set of attributes

Close enough :)

> Superblocks and mounts could get enumerated by a unique identifier.
> mnt_id seems to be good for mounts, s_dev may or may not be good for
> superblock, but  s_id (as introduced in this patchset) could be used
> instead.

So what would the sysfs tree look like with this?

> As for namespaces, that's "just" an access control issue, AFAICS.
> For example a task with a non-initial mount namespace should not have
> access to attributes of mounts outside of its namespace.  Checking
> access to superblock attributes would be similar: scan the list of
> mounts and only allow access if at least one mount would get access.

sysfs does handle namespaces, look at how networking does this.  But,
it's not exactly the simplest thing to do so, so be careful with that as
this is going to be essential for this type of work.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 07/17] fsinfo: Add fsinfo() syscall to query filesystem information [ver #17]
  2020-02-21 18:02 ` [PATCH 07/17] fsinfo: Add fsinfo() syscall to query filesystem information " David Howells
  2020-02-26  2:29   ` Aleksa Sarai
@ 2020-02-28 14:44   ` David Howells
  1 sibling, 0 replies; 117+ messages in thread
From: David Howells @ 2020-02-28 14:44 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: dhowells, viro, raven, mszeredi, christian, jannh, darrick.wong,
	linux-api, linux-fsdevel, linux-kernel

Aleksa Sarai <cyphar@cyphar.com> wrote:

> > If params is given, all of params->__reserved[] must be 0.
> 
> I would suggest that rather than having a reserved field for future
> extensions, you make use of copy_struct_from_user() and have extensible
> structs:

Yeah.  I seem to recall that special support was required for 6-arg syscalls
on some arches, though I could move the dfd argument into the parameter block
and make AT_FDCWD the default.

> I dropped the "const" on fsinfo_params because the planned CHECK_FiELDS
> feature for extensible-struct syscalls requires writing to the struct.

Ummm...  Why?  You shouldn't be trying to alter the parameters structure.  It
could feasibly be stored static const in userspace (though I'm not sure how
likely it would be that someone would do that).

> I also switched the flags field to u64 because CHECK_FiELDS is intended to
> use (1<<63) for all syscalls (this has the nice benefit of removing the need
> of a padding field entirely).

 	struct fsinfo_params {
 		__u32	flags;
 		__u32	at_flags;
 		__u32	request;
 		__u32	Nth;
 		__u32	Mth;
 	};

What padding? ;-)

Though possibly the struct does need forcing to 64-bit alignment for future
expansion.

> > dfd, filename and params->at_flags indicate the file to query.  There is no
> > equivalent of lstat() as that can be emulated with fsinfo() by setting
> > AT_SYMLINK_NOFOLLOW in params->at_flags.
> 
> Minor gripe -- can we make the default be AT_SYMLINK_NOFOLLOW and you
> need to explicitly pass AT_SYMLINK_FOLLOW? Accidentally following
> symlinks is a constant source of security bugs.

Someone else has said that all new syscalls should be using RESOLVE_* flags in
preference to AT_* flags (even though RESOLVE_* flags are not a superset of
AT_* flags and appear to be in a header named specifically for the openat2()
syscall, not generic).

I'm not sure who authored openat2.h, but they went with a RESOLVE_NO_SYMLINKS
rather than a RESOLVE_SYMLINKS ;-)

> > There is also no equivalent of fstat() as that can be emulated by
> > passing a NULL filename to fsinfo() with the fd of interest in dfd.
> 
> Presumably you also need to pass AT_EMPTY_PATH?

Actually, you need to set FSINFO_FLAGS_QUERY_FD in fsinfo_params::flags.  I
need to update the description for this.

> Sounds good, though I think we should zero-fill the tail end of the
> buffer (if the buffer is larger than the in-kernel one).

I do that.  I should make it clearer in the patch description.

David


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-28  8:35                           ` Miklos Szeredi
  2020-02-28 12:27                             ` Greg Kroah-Hartman
@ 2020-02-28 15:08                             ` James Bottomley
  2020-02-28 15:40                               ` Miklos Szeredi
  1 sibling, 1 reply; 117+ messages in thread
From: James Bottomley @ 2020-02-28 15:08 UTC (permalink / raw)
  To: Miklos Szeredi, Ian Kent
  Cc: Karel Zak, Miklos Szeredi, Steven Whitehouse, David Howells,
	viro, Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Greg Kroah-Hartman, util-linux

On Fri, 2020-02-28 at 09:35 +0100, Miklos Szeredi wrote:
> On Fri, Feb 28, 2020 at 1:43 AM Ian Kent <raven@themaw.net> wrote:
> 
> > > I'm not sure about sysfs/, you need somehow resolve namespaces,
> > > order of the mount entries (which one is the last one), etc. IMHO
> > > translate mountpoint path to sysfs/ path will be complicated.
> > 
> > I wonder about that too, after all sysfs contains a tree of nodes
> > from which the view is created unlike proc which translates kernel
> > information directly based on what the process should see.
> > 
> > We'll need to wait a bit and see what Miklos has in mind for mount
> > table enumeration and nothing has been said about name spaces yet.
> 
> Adding Greg for sysfs knowledge.
> 
> As far as I understand the sysfs model is, basically:
> 
>   - list of devices sorted by class and address
>   - with each class having a given set of attributes
> 
> Superblocks and mounts could get enumerated by a unique identifier.
> mnt_id seems to be good for mounts, s_dev may or may not be good for
> superblock, but  s_id (as introduced in this patchset) could be used
> instead.
> 
> As for namespaces, that's "just" an access control issue, AFAICS.

That's an easy thing to say but not an easy thing to check:  it can be
made so for label based namespaces like the network, but the mount
namespace is shared/cloned tree based.  Assessing whether a given
superblock is within your current namespace root can become a large
search exercise.  You can see how much of one in fs/proc_namespaces.c
which controls how /proc/self/mounts appears in your current namespace.

> For example a task with a non-initial mount namespace should not have
> access to attributes of mounts outside of its namespace.  Checking
> access to superblock attributes would be similar: scan the list of
> mounts and only allow access if at least one mount would get access.

That scan can be expensive as I explained above.  That's really why I
think this is a bad idea.  Sysfs itself is nicely currently restricted
to system information that most containers don't need to know, so a lot
of the sysfs issues with containers can be solved by not mounting it. 
If you suddenly make it required for filesystem information and
notifications, that security measure gets blown out of the water.

> > While fsinfo() is not similar to proc it does handle name spaces
> > in a sensible way via. file handles, a bit similar to the proc fs,
> > and ordering is catered for in the fsinfo() enumeration in a
> > natural way. Not sure how that would be handled using sysfs ...
> 
> I agree that the access control is much more straightforward with
> fsinfo(2) and this may be the single biggest reason to introduce a
> new syscall.
> 
> Let's see what others thing.

Containers are file based entities, so file descriptors are their most
natural thing and they have full ACL protection within the container
(can't open the file, can't then get the fd).  The other reason
container people like file descriptors (all the Xat system calls that
have been introduced) is that if we do actually need to break the
boundaries or privileges of the container, we can do so by getting the
orchestration system to pass in a fd the interior of the container
wouldn't have access to.

James


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-28 15:08                             ` James Bottomley
@ 2020-02-28 15:40                               ` Miklos Szeredi
  0 siblings, 0 replies; 117+ messages in thread
From: Miklos Szeredi @ 2020-02-28 15:40 UTC (permalink / raw)
  To: James Bottomley
  Cc: Ian Kent, Karel Zak, Miklos Szeredi, Steven Whitehouse,
	David Howells, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Greg Kroah-Hartman, util-linux

On Fri, Feb 28, 2020 at 4:09 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:

> Containers are file based entities, so file descriptors are their most
> natural thing and they have full ACL protection within the container
> (can't open the file, can't then get the fd).  The other reason
> container people like file descriptors (all the Xat system calls that
> have been introduced) is that if we do actually need to break the
> boundaries or privileges of the container, we can do so by getting the
> orchestration system to pass in a fd the interior of the container
> wouldn't have access to.

Yeah, agreed about the simplicity of fd based access.   Then again a
filesystem access would allow immediate access to all scripts,
languages, etc.  That, I think is a huge bonus compared to the
ioctl-like mess that the current proposal is, which would require
library, utility, language binding updates on all changes.  Ugh.

One way to resolve that is to have the mount information
magic-symlinked from /proc/PID/fdmount/FD directly to the mountinfo
dir, which would then have a link into the sbinfo dir.  With other
access denied to all except sysadmin.

Would that work?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-25 15:28           ` James Bottomley
  2020-02-25 15:47             ` Steven Whitehouse
  2020-02-26  9:11             ` Miklos Szeredi
@ 2020-02-28 15:52             ` Christian Brauner
  2020-02-28 16:36             ` David Howells
  3 siblings, 0 replies; 117+ messages in thread
From: Christian Brauner @ 2020-02-28 15:52 UTC (permalink / raw)
  To: James Bottomley
  Cc: Steven Whitehouse, Miklos Szeredi, Miklos Szeredi, David Howells,
	viro, Ian Kent, Christian Brauner, Jann Horn, Darrick J. Wong,
	Linux API, linux-fsdevel, lkml

On Tue, Feb 25, 2020 at 07:28:55AM -0800, James Bottomley wrote:
> On Tue, 2020-02-25 at 12:13 +0000, Steven Whitehouse wrote:
> > Hi,
> > 
> > On 24/02/2020 15:28, Miklos Szeredi wrote:
> > > On Mon, Feb 24, 2020 at 3:55 PM James Bottomley
> > > <James.Bottomley@hansenpartnership.com> wrote:
> > > 
> > > > Once it's table driven, certainly a sysfs directory becomes
> > > > possible. The problem with ST_DEV is filesystems like btrfs and
> > > > xfs that may have multiple devices.
> > > 
> > > For XFS there's always  a single sb->s_dev though, that's what
> > > st_dev will be set to on all files.
> > > 
> > > Btrfs subvolume is sort of a lightweight superblock, so basically
> > > all such st_dev's are aliases of the same master superblock.  So
> > > lookup of all subvolume st_dev's could result in referencing the
> > > same underlying struct super_block (just like /proc/$PID will
> > > reference the same underlying task group regardless of which of the
> > > task group member's PID is used).
> > > 
> > > Having this info in sysfs would spare us a number of issues that a
> > > set of new syscalls would bring.  The question is, would that be
> > > enough, or is there a reason that sysfs can't be used to present
> > > the various filesystem related information that fsinfo is supposed
> > > to present?
> > > 
> > > Thanks,
> > > Miklos
> > > 
> > 
> > We need a unique id for superblocks anyway. I had wondered about
> > using s_dev some time back, but for the reasons mentioned earlier in
> > this thread I think it might just land up being confusing and
> > difficult to manage. While fake s_devs are created for sbs that don't
> > have a device, I can't help thinking that something closer to
> > ifindex, but for superblocks, is needed here. That would avoid the
> > issue of which device number to use.
> > 
> > In fact we need that anyway for the notifications, since without
> > that  there is a race that can lead to missing remounts of the same
> > device, in  case a umount/mount pair is missed due to an overrun, and
> > then fsinfo returns the same device as before, with potentially the
> > same mount options too. So I think a unique id for a superblock is a
> > generically useful feature, which would also allow for sensible sysfs
> > directory naming, if required,
> 
> But would this be informative and useful for the user?  I'm sure we can
> find a persistent id for a persistent superblock, but what about tmpfs
> ... that's going to have to change with every reboot.  It's going to be
> remarkably inconvenient if I want to get fsinfo on /run to have to keep
> finding what the id is.
> 
> The other thing a file descriptor does that sysfs doesn't is that it
> solves the information leak: if I'm in a mount namespace that has no
> access to certain mounts, I can't fspick them and thus I can't see the
> information.  By default, with sysfs I can.

Difficult to figure out which part of the thread to reply too. :)

sysfs strikes me as fundamentally misguided for this task.

Init systems or any large-scale daemon will hate parsing things, there's
that and parts of the reason why mountinfo sucks is because of parsing a
possibly a potentially enormous file. Exposing information in sysfs will
require parsing again one way or the other. I've been discussing these
bottlenecks with Lennart quite a bit and reliable and performant mount
notifications without needing to parse stuff is very high on the issue
list. But even if that isn't an issue for some reason the namespace
aspect is definitely something I'd consider a no-go.
James has been poking at this a little already and I agree. More
specifically, sysfs and proc already are a security nightmare for
namespace-aware workloads and require special care. Not leaking
information in any way is a difficult task. I mean, over the last two
years I sent quite a lot of patches to the networking-namespace aware
part of sysfs alone either fixing information leaks, or making other
parts namespace aware that weren't and were causing issues (There's
another large-ish series sitting in Dave's tree right now.). And tbh,
network namespacing in sysfs is imho trivial compared to what we would
need to do to handle mount namespacing and especially mount propagation.
fsinfo() is way cleaner and ultimately simpler approach. We very much
want it file-descriptor based. The mount api opens up the road to secure
and _delegatable_ querying of filesystem information.

Christian

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-28 12:27                             ` Greg Kroah-Hartman
@ 2020-02-28 16:24                               ` Miklos Szeredi
  2020-02-28 17:15                                 ` Al Viro
  2020-03-02 10:34                                 ` Karel Zak
  2020-02-28 16:42                               ` David Howells
  1 sibling, 2 replies; 117+ messages in thread
From: Miklos Szeredi @ 2020-02-28 16:24 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Ian Kent, Karel Zak, Miklos Szeredi, James Bottomley,
	Steven Whitehouse, David Howells, viro, Christian Brauner,
	Jann Horn, Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek, util-linux

On Fri, Feb 28, 2020 at 1:27 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:

> > Superblocks and mounts could get enumerated by a unique identifier.
> > mnt_id seems to be good for mounts, s_dev may or may not be good for
> > superblock, but  s_id (as introduced in this patchset) could be used
> > instead.
>
> So what would the sysfs tree look like with this?

For a start something like this:

mounts/$MOUNT_ID/
  parent -> ../$PARENT_ID
  super -> ../../supers/$SUPER_ID
  root: path from mount root to fs root (could be optional as usually
they are the same)
  mountpoint -> $MOUNTPOINT
  flags: mount flags
  propagation: mount propagation
  children/$CHILD_ID -> ../../$CHILD_ID

 supers/$SUPER_ID/
   type: fstype
   source: mount source (devname)
   options: csv of mount options

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-25 15:28           ` James Bottomley
                               ` (2 preceding siblings ...)
  2020-02-28 15:52             ` Christian Brauner
@ 2020-02-28 16:36             ` David Howells
  2020-03-02  9:09               ` Miklos Szeredi
  3 siblings, 1 reply; 117+ messages in thread
From: David Howells @ 2020-02-28 16:36 UTC (permalink / raw)
  To: Christian Brauner
  Cc: dhowells, James Bottomley, Steven Whitehouse, Miklos Szeredi,
	Miklos Szeredi, viro, Ian Kent, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml

sysfs also has some other disadvantages for this:

 (1) There's a potential chicken-and-egg problem in that you have to create a
     bunch of files and dirs in sysfs for every created mount and superblock
     (possibly excluding special ones like the socket mount) - but this
     includes sysfs itself.  This might work - provided you create sysfs
     first.

 (2) sysfs is memory intensive.  The directory structure has to be backed by
     dentries and inodes that linger as long as the referenced object does
     (procfs is more efficient in this regard for files that aren't being
     accessed).

 (3) It gives people extra, indirect ways to pin mount objects and
     superblocks.

For the moment, fsinfo() gives you three ways of referring to a filesystem
object:

 (a) Directly by path.

 (b) By path associated with an fd.

 (c) By mount ID (perm checked by working back up the tree).

but will need to add:

 (d) By fscontext fd (which is hard to find in sysfs).  Indeed, the superblock
     may not even exist yet.

David


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-28 12:27                             ` Greg Kroah-Hartman
  2020-02-28 16:24                               ` Miklos Szeredi
@ 2020-02-28 16:42                               ` David Howells
  1 sibling, 0 replies; 117+ messages in thread
From: David Howells @ 2020-02-28 16:42 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, Greg Kroah-Hartman, Ian Kent, Karel Zak,
	Miklos Szeredi, James Bottomley, Steven Whitehouse, viro,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, util-linux

Miklos Szeredi <miklos@szeredi.hu> wrote:

>   children/$CHILD_ID -> ../../$CHILD_ID

This would really suck.  This bit would particularly affect rescanning time.

You also really want to read the entire child set atomically and, ideally,
include notification counters.

>  supers/$SUPER_ID/
>    type: fstype
>    source: mount source (devname)
>    options: csv of mount options

There's a lot more to fsinfo() than just this lot - and there's the
possibility that some of the values may change depending on exactly which file
you're looking at.

David


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-28 16:24                               ` Miklos Szeredi
@ 2020-02-28 17:15                                 ` Al Viro
  2020-03-02  8:43                                   ` Miklos Szeredi
  2020-03-02 10:34                                 ` Karel Zak
  1 sibling, 1 reply; 117+ messages in thread
From: Al Viro @ 2020-02-28 17:15 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Greg Kroah-Hartman, Ian Kent, Karel Zak, Miklos Szeredi,
	James Bottomley, Steven Whitehouse, David Howells,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, util-linux

On Fri, Feb 28, 2020 at 05:24:23PM +0100, Miklos Szeredi wrote:
> On Fri, Feb 28, 2020 at 1:27 PM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> 
> > > Superblocks and mounts could get enumerated by a unique identifier.
> > > mnt_id seems to be good for mounts, s_dev may or may not be good for
> > > superblock, but  s_id (as introduced in this patchset) could be used
> > > instead.
> >
> > So what would the sysfs tree look like with this?
> 
> For a start something like this:
> 
> mounts/$MOUNT_ID/
>   parent -> ../$PARENT_ID
>   super -> ../../supers/$SUPER_ID
>   root: path from mount root to fs root (could be optional as usually
> they are the same)
>   mountpoint -> $MOUNTPOINT
>   flags: mount flags
>   propagation: mount propagation
>   children/$CHILD_ID -> ../../$CHILD_ID
> 
>  supers/$SUPER_ID/
>    type: fstype
>    source: mount source (devname)
>    options: csv of mount options

Oh, wonderful.  So let me see if I got it right - any namespace operation
can create/destroy/move around an arbitrary amount of sysfs objects.
Better yet, we suddenly have to express the lifetime rules for struct mount
and struct superblock in terms of struct device garbage.

I'm less than thrilled by the entire fsinfo circus, but this really takes
the cake.

In case it needs to be spelled out: NAK.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-28 17:15                                 ` Al Viro
@ 2020-03-02  8:43                                   ` Miklos Szeredi
  0 siblings, 0 replies; 117+ messages in thread
From: Miklos Szeredi @ 2020-03-02  8:43 UTC (permalink / raw)
  To: Al Viro
  Cc: Greg Kroah-Hartman, Ian Kent, Karel Zak, Miklos Szeredi,
	James Bottomley, Steven Whitehouse, David Howells,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, util-linux

On Fri, Feb 28, 2020 at 6:15 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Feb 28, 2020 at 05:24:23PM +0100, Miklos Szeredi wrote:
> > On Fri, Feb 28, 2020 at 1:27 PM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> >
> > > > Superblocks and mounts could get enumerated by a unique identifier.
> > > > mnt_id seems to be good for mounts, s_dev may or may not be good for
> > > > superblock, but  s_id (as introduced in this patchset) could be used
> > > > instead.
> > >
> > > So what would the sysfs tree look like with this?
> >
> > For a start something like this:
> >
> > mounts/$MOUNT_ID/
> >   parent -> ../$PARENT_ID
> >   super -> ../../supers/$SUPER_ID
> >   root: path from mount root to fs root (could be optional as usually
> > they are the same)
> >   mountpoint -> $MOUNTPOINT
> >   flags: mount flags
> >   propagation: mount propagation
> >   children/$CHILD_ID -> ../../$CHILD_ID
> >
> >  supers/$SUPER_ID/
> >    type: fstype
> >    source: mount source (devname)
> >    options: csv of mount options
>
> Oh, wonderful.  So let me see if I got it right - any namespace operation
> can create/destroy/move around an arbitrary amount of sysfs objects.

Parent/children symlinks may be excessive...

> Better yet, we suddenly have to express the lifetime rules for struct mount
> and struct superblock in terms of struct device garbage.

How so?   struct mount and struct superblock would hold a ref on
struct device, not the other way round.

In any case, I'm not insistent on the use of sysfs device classes for
this; struct device (488B) does seem too heavy for struct mount
(328B).

What I'm pretty sure about is that a read(2) based interface would be
way more useful than the syscall multiplexer that the current proposal
is.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-28 16:36             ` David Howells
@ 2020-03-02  9:09               ` Miklos Szeredi
  2020-03-02  9:38                 ` Greg Kroah-Hartman
  2020-03-03  5:27                 ` Ian Kent
  0 siblings, 2 replies; 117+ messages in thread
From: Miklos Szeredi @ 2020-03-02  9:09 UTC (permalink / raw)
  To: David Howells
  Cc: Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Ian Kent, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Fri, Feb 28, 2020 at 5:36 PM David Howells <dhowells@redhat.com> wrote:
>
> sysfs also has some other disadvantages for this:
>
>  (1) There's a potential chicken-and-egg problem in that you have to create a
>      bunch of files and dirs in sysfs for every created mount and superblock
>      (possibly excluding special ones like the socket mount) - but this
>      includes sysfs itself.  This might work - provided you create sysfs
>      first.

Sysfs architecture looks something like this (I hope Greg will correct
me if I'm wrong):

device driver -> kobj tree <- sysfs tree

The kobj tree is created by the device driver, and the dentry tree is
created on demand from the kobj tree.   Lifetime of kobjs is bound to
both the sysfs objects and the device but not the other way round.
I.e. device can go away while the sysfs object is still being
referenced, and sysfs can be freely mounted and unmounted
independently of device initialization.

So there's no ordering requirement between sysfs mounts and other
mounts.   I might be wrong on the details, since mounts are created
very early in the boot process...

>
>  (2) sysfs is memory intensive.  The directory structure has to be backed by
>      dentries and inodes that linger as long as the referenced object does
>      (procfs is more efficient in this regard for files that aren't being
>      accessed)

See above: I don't think dentries and inodes are pinned, only kobjs
and their associated cruft.  Which may be too heavy, depending on the
details of the kobj tree.

>  (3) It gives people extra, indirect ways to pin mount objects and
>      superblocks.

See above.

> For the moment, fsinfo() gives you three ways of referring to a filesystem
> object:
>
>  (a) Directly by path.

A path is always representable by an O_PATH descriptor.

>
>  (b) By path associated with an fd.

See my proposal about linking from /proc/$PID/fdmount/$FD ->
/sys/devices/virtual/mounts/$MOUNT_ID.

>
>  (c) By mount ID (perm checked by working back up the tree).

Check that perm on lookup of /sys/devices/virtual/mounts/$MOUNT_ID.
The proc symlink would bypass the lookup check by directly jumping to
the mountinfo dir.

> but will need to add:
>
>  (d) By fscontext fd (which is hard to find in sysfs).  Indeed, the superblock
>      may not even exist yet.

Proc symlink would work for that too.

If sysfs is too heavy, this could be proc or a completely new
filesystem.  The implementation is much less relevant at this stage of
the discussion than the interface.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-02  9:09               ` Miklos Szeredi
@ 2020-03-02  9:38                 ` Greg Kroah-Hartman
  2020-03-03  5:27                 ` Ian Kent
  1 sibling, 0 replies; 117+ messages in thread
From: Greg Kroah-Hartman @ 2020-03-02  9:38 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: David Howells, Christian Brauner, James Bottomley,
	Steven Whitehouse, Miklos Szeredi, viro, Ian Kent,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml

On Mon, Mar 02, 2020 at 10:09:51AM +0100, Miklos Szeredi wrote:
> On Fri, Feb 28, 2020 at 5:36 PM David Howells <dhowells@redhat.com> wrote:
> >
> > sysfs also has some other disadvantages for this:
> >
> >  (1) There's a potential chicken-and-egg problem in that you have to create a
> >      bunch of files and dirs in sysfs for every created mount and superblock
> >      (possibly excluding special ones like the socket mount) - but this
> >      includes sysfs itself.  This might work - provided you create sysfs
> >      first.
> 
> Sysfs architecture looks something like this (I hope Greg will correct
> me if I'm wrong):
> 
> device driver -> kobj tree <- sysfs tree
> 
> The kobj tree is created by the device driver, and the dentry tree is
> created on demand from the kobj tree.   Lifetime of kobjs is bound to
> both the sysfs objects and the device but not the other way round.
> I.e. device can go away while the sysfs object is still being
> referenced, and sysfs can be freely mounted and unmounted
> independently of device initialization.
> 
> So there's no ordering requirement between sysfs mounts and other
> mounts.   I might be wrong on the details, since mounts are created
> very early in the boot process...
> 
> >
> >  (2) sysfs is memory intensive.  The directory structure has to be backed by
> >      dentries and inodes that linger as long as the referenced object does
> >      (procfs is more efficient in this regard for files that aren't being
> >      accessed)
> 
> See above: I don't think dentries and inodes are pinned, only kobjs
> and their associated cruft.  Which may be too heavy, depending on the
> details of the kobj tree.

That is correct, they should not be pinned, that is what kernfs handles
and why we can handle 30k virtual block devices on a 31bit s390 instance
:)

So you shouldn't have to worry about memory for sysfs.

There are loads of other reasons probably not to use sysfs for this
instead :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-28 16:24                               ` Miklos Szeredi
  2020-02-28 17:15                                 ` Al Viro
@ 2020-03-02 10:34                                 ` Karel Zak
  1 sibling, 0 replies; 117+ messages in thread
From: Karel Zak @ 2020-03-02 10:34 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Greg Kroah-Hartman, Ian Kent, Miklos Szeredi, James Bottomley,
	Steven Whitehouse, David Howells, viro, Christian Brauner,
	Jann Horn, Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek, util-linux

On Fri, Feb 28, 2020 at 05:24:23PM +0100, Miklos Szeredi wrote:
> ned-By: MIMEDefang 2.78 on 10.11.54.4
> 
> On Fri, Feb 28, 2020 at 1:27 PM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> 
> > > Superblocks and mounts could get enumerated by a unique identifier.
> > > mnt_id seems to be good for mounts, s_dev may or may not be good for
> > > superblock, but  s_id (as introduced in this patchset) could be used
> > > instead.
> >
> > So what would the sysfs tree look like with this?
> 
> For a start something like this:
> 
> mounts/$MOUNT_ID/
>   parent -> ../$PARENT_ID
>   super -> ../../supers/$SUPER_ID
>   root: path from mount root to fs root (could be optional as usually
> they are the same)
>   mountpoint -> $MOUNTPOINT
>   flags: mount flags
>   propagation: mount propagation
>   children/$CHILD_ID -> ../../$CHILD_ID
> 
>  supers/$SUPER_ID/
>    type: fstype
>    source: mount source (devname)
>    options:

What about use-cases where I have no ID, but I have mountpoint path
(e.g. "umount /foo")?  In this case I have to go to open() + fsinfo()
and then sysfs does not make sense for me, right?

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-02  9:09               ` Miklos Szeredi
  2020-03-02  9:38                 ` Greg Kroah-Hartman
@ 2020-03-03  5:27                 ` Ian Kent
  2020-03-03  7:46                   ` Miklos Szeredi
  2020-03-03  9:12                   ` David Howells
  1 sibling, 2 replies; 117+ messages in thread
From: Ian Kent @ 2020-03-03  5:27 UTC (permalink / raw)
  To: Miklos Szeredi, David Howells
  Cc: Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Mon, 2020-03-02 at 10:09 +0100, Miklos Szeredi wrote:
> On Fri, Feb 28, 2020 at 5:36 PM David Howells <dhowells@redhat.com>
> wrote:
> > sysfs also has some other disadvantages for this:
> > 
> >  (1) There's a potential chicken-and-egg problem in that you have
> > to create a
> >      bunch of files and dirs in sysfs for every created mount and
> > superblock
> >      (possibly excluding special ones like the socket mount) - but
> > this
> >      includes sysfs itself.  This might work - provided you create
> > sysfs
> >      first.
> 
> Sysfs architecture looks something like this (I hope Greg will
> correct
> me if I'm wrong):
> 
> device driver -> kobj tree <- sysfs tree
> 
> The kobj tree is created by the device driver, and the dentry tree is
> created on demand from the kobj tree.   Lifetime of kobjs is bound to
> both the sysfs objects and the device but not the other way round.
> I.e. device can go away while the sysfs object is still being
> referenced, and sysfs can be freely mounted and unmounted
> independently of device initialization.
> 
> So there's no ordering requirement between sysfs mounts and other
> mounts.   I might be wrong on the details, since mounts are created
> very early in the boot process...
> 
> >  (2) sysfs is memory intensive.  The directory structure has to be
> > backed by
> >      dentries and inodes that linger as long as the referenced
> > object does
> >      (procfs is more efficient in this regard for files that aren't
> > being
> >      accessed)
> 
> See above: I don't think dentries and inodes are pinned, only kobjs
> and their associated cruft.  Which may be too heavy, depending on the
> details of the kobj tree.
> 
> >  (3) It gives people extra, indirect ways to pin mount objects and
> >      superblocks.
> 
> See above.
> 
> > For the moment, fsinfo() gives you three ways of referring to a
> > filesystem
> > object:
> > 
> >  (a) Directly by path.
> 
> A path is always representable by an O_PATH descriptor.
> 
> >  (b) By path associated with an fd.
> 
> See my proposal about linking from /proc/$PID/fdmount/$FD ->
> /sys/devices/virtual/mounts/$MOUNT_ID.
> 
> >  (c) By mount ID (perm checked by working back up the tree).
> 
> Check that perm on lookup of /sys/devices/virtual/mounts/$MOUNT_ID.
> The proc symlink would bypass the lookup check by directly jumping to
> the mountinfo dir.
> 
> > but will need to add:
> > 
> >  (d) By fscontext fd (which is hard to find in sysfs).  Indeed, the
> > superblock
> >      may not even exist yet.
> 
> Proc symlink would work for that too.

There's mounts enumeration too, ordering is required to identify the
top (or bottom depending on terminology) with more than one mount on
a mount point.

> 
> If sysfs is too heavy, this could be proc or a completely new
> filesystem.  The implementation is much less relevant at this stage
> of
> the discussion than the interface.

Ha, proc with the seq file interface, that's already proved to not
work properly and looks difficult to fix.

Ian


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03  5:27                 ` Ian Kent
@ 2020-03-03  7:46                   ` Miklos Szeredi
  2020-03-06 16:25                     ` Miklos Szeredi
  2020-03-03  9:12                   ` David Howells
  1 sibling, 1 reply; 117+ messages in thread
From: Miklos Szeredi @ 2020-03-03  7:46 UTC (permalink / raw)
  To: Ian Kent
  Cc: David Howells, Christian Brauner, James Bottomley,
	Steven Whitehouse, Miklos Szeredi, viro, Christian Brauner,
	Jann Horn, Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Tue, Mar 3, 2020 at 6:28 AM Ian Kent <raven@themaw.net> wrote:
>
> On Mon, 2020-03-02 at 10:09 +0100, Miklos Szeredi wrote:
> > On Fri, Feb 28, 2020 at 5:36 PM David Howells <dhowells@redhat.com>
> > wrote:
> > > sysfs also has some other disadvantages for this:
> > >
> > >  (1) There's a potential chicken-and-egg problem in that you have
> > > to create a
> > >      bunch of files and dirs in sysfs for every created mount and
> > > superblock
> > >      (possibly excluding special ones like the socket mount) - but
> > > this
> > >      includes sysfs itself.  This might work - provided you create
> > > sysfs
> > >      first.
> >
> > Sysfs architecture looks something like this (I hope Greg will
> > correct
> > me if I'm wrong):
> >
> > device driver -> kobj tree <- sysfs tree
> >
> > The kobj tree is created by the device driver, and the dentry tree is
> > created on demand from the kobj tree.   Lifetime of kobjs is bound to
> > both the sysfs objects and the device but not the other way round.
> > I.e. device can go away while the sysfs object is still being
> > referenced, and sysfs can be freely mounted and unmounted
> > independently of device initialization.
> >
> > So there's no ordering requirement between sysfs mounts and other
> > mounts.   I might be wrong on the details, since mounts are created
> > very early in the boot process...
> >
> > >  (2) sysfs is memory intensive.  The directory structure has to be
> > > backed by
> > >      dentries and inodes that linger as long as the referenced
> > > object does
> > >      (procfs is more efficient in this regard for files that aren't
> > > being
> > >      accessed)
> >
> > See above: I don't think dentries and inodes are pinned, only kobjs
> > and their associated cruft.  Which may be too heavy, depending on the
> > details of the kobj tree.
> >
> > >  (3) It gives people extra, indirect ways to pin mount objects and
> > >      superblocks.
> >
> > See above.
> >
> > > For the moment, fsinfo() gives you three ways of referring to a
> > > filesystem
> > > object:
> > >
> > >  (a) Directly by path.
> >
> > A path is always representable by an O_PATH descriptor.
> >
> > >  (b) By path associated with an fd.
> >
> > See my proposal about linking from /proc/$PID/fdmount/$FD ->
> > /sys/devices/virtual/mounts/$MOUNT_ID.
> >
> > >  (c) By mount ID (perm checked by working back up the tree).
> >
> > Check that perm on lookup of /sys/devices/virtual/mounts/$MOUNT_ID.
> > The proc symlink would bypass the lookup check by directly jumping to
> > the mountinfo dir.
> >
> > > but will need to add:
> > >
> > >  (d) By fscontext fd (which is hard to find in sysfs).  Indeed, the
> > > superblock
> > >      may not even exist yet.
> >
> > Proc symlink would work for that too.
>
> There's mounts enumeration too, ordering is required to identify the
> top (or bottom depending on terminology) with more than one mount on
> a mount point.
>
> >
> > If sysfs is too heavy, this could be proc or a completely new
> > filesystem.  The implementation is much less relevant at this stage
> > of
> > the discussion than the interface.
>
> Ha, proc with the seq file interface, that's already proved to not
> work properly and looks difficult to fix.

I'm doing a patch.   Let's see how it fares in the face of all these
preconceptions.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03  5:27                 ` Ian Kent
  2020-03-03  7:46                   ` Miklos Szeredi
@ 2020-03-03  9:12                   ` David Howells
  2020-03-03  9:26                     ` Miklos Szeredi
  1 sibling, 1 reply; 117+ messages in thread
From: David Howells @ 2020-03-03  9:12 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, Ian Kent, Christian Brauner, James Bottomley,
	Steven Whitehouse, Miklos Szeredi, viro, Christian Brauner,
	Jann Horn, Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

Miklos Szeredi <miklos@szeredi.hu> wrote:

> I'm doing a patch.   Let's see how it fares in the face of all these
> preconceptions.

Don't forget the efficiency criterion.  One reason for going with fsinfo(2) is
that scanning /proc/mounts when there are a lot of mounts in the system is
slow (not to mention the global lock that is held during the read).

Now, going with sysfs files on top of procfs links might avoid the global
lock, and you can avoid rereading the options string if you export a change
notification, but you're going to end up injecting a whole lot of pathwalk
latency into the system.

On top of that, it isn't going to help with the case that I'm working towards
implementing where a container manager can monitor for mounts taking place
inside the container and supervise them.  What I'm proposing is that during
the action phase (eg. FSCONFIG_CMD_CREATE), fsconfig() would hand an fd
referring to the context under construction to the manager, which would then
be able to call fsinfo() to query it and fsconfig() to adjust it, reject it or
permit it.  Something like:

	fd = receive_context_to_supervise();
	struct fsinfo_params params = {
		.flags		= FSINFO_FLAGS_QUERY_FSCONTEXT,
		.request	= FSINFO_ATTR_SB_OPTIONS,
	};
	fsinfo(fd, NULL, &params, sizeof(params), buffer, sizeof(buffer));
	supervise_parameters(buffer);
	fsconfig(fd, FSCONFIG_SET_FLAG, "hard", NULL, 0);
	fsconfig(fd, FSCONFIG_SET_STRING, "vers", "4.2", 0);
	fsconfig(fd, FSCONFIG_CMD_SUPERVISE_CREATE, NULL, NULL, 0);
	struct fsinfo_params params = {
		.flags		= FSINFO_FLAGS_QUERY_FSCONTEXT,
		.request	= FSINFO_ATTR_SB_NOTIFICATIONS,
	};
	struct fsinfo_sb_notifications sbnotify;
	fsinfo(fd, NULL, &params, sizeof(params), &sbnotify, sizeof(sbnotify));
	watch_super(fd, "", AT_EMPTY_PATH, watch_fd, 0x03);
	fsconfig(fd, FSCONFIG_CMD_SUPERVISE_PERMIT, NULL, NULL, 0);
	close(fd);

However, the supervised mount may be happening in a completely different set
of namespaces, in which case the supervisor presumably wouldn't be able to see
the links in procfs and the relevant portions of sysfs.

David


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03  9:12                   ` David Howells
@ 2020-03-03  9:26                     ` Miklos Szeredi
  2020-03-03  9:48                       ` Miklos Szeredi
                                         ` (2 more replies)
  0 siblings, 3 replies; 117+ messages in thread
From: Miklos Szeredi @ 2020-03-03  9:26 UTC (permalink / raw)
  To: David Howells
  Cc: Ian Kent, Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Tue, Mar 3, 2020 at 10:13 AM David Howells <dhowells@redhat.com> wrote:
>
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > I'm doing a patch.   Let's see how it fares in the face of all these
> > preconceptions.
>
> Don't forget the efficiency criterion.  One reason for going with fsinfo(2) is
> that scanning /proc/mounts when there are a lot of mounts in the system is
> slow (not to mention the global lock that is held during the read).
>
> Now, going with sysfs files on top of procfs links might avoid the global
> lock, and you can avoid rereading the options string if you export a change
> notification, but you're going to end up injecting a whole lot of pathwalk
> latency into the system.

Completely irrelevant.  Cached lookup is so much optimized, that you
won't be able to see any of it.

No, I don't think this is going to be a performance issue at all, but
if anything we could introduce a syscall

  ssize_t readfile(int dfd, const char *path, char *buf, size_t
bufsize, int flags);

that is basically the equivalent of open + read + close, or even a
vectored variant that reads multiple files.  But that's off topic
again, since I don't think there's going to be any performance issue
even with plain I/O syscalls.

>
> On top of that, it isn't going to help with the case that I'm working towards
> implementing where a container manager can monitor for mounts taking place
> inside the container and supervise them.  What I'm proposing is that during
> the action phase (eg. FSCONFIG_CMD_CREATE), fsconfig() would hand an fd
> referring to the context under construction to the manager, which would then
> be able to call fsinfo() to query it and fsconfig() to adjust it, reject it or
> permit it.  Something like:
>
>         fd = receive_context_to_supervise();
>         struct fsinfo_params params = {
>                 .flags          = FSINFO_FLAGS_QUERY_FSCONTEXT,
>                 .request        = FSINFO_ATTR_SB_OPTIONS,
>         };
>         fsinfo(fd, NULL, &params, sizeof(params), buffer, sizeof(buffer));
>         supervise_parameters(buffer);
>         fsconfig(fd, FSCONFIG_SET_FLAG, "hard", NULL, 0);
>         fsconfig(fd, FSCONFIG_SET_STRING, "vers", "4.2", 0);
>         fsconfig(fd, FSCONFIG_CMD_SUPERVISE_CREATE, NULL, NULL, 0);
>         struct fsinfo_params params = {
>                 .flags          = FSINFO_FLAGS_QUERY_FSCONTEXT,
>                 .request        = FSINFO_ATTR_SB_NOTIFICATIONS,
>         };
>         struct fsinfo_sb_notifications sbnotify;
>         fsinfo(fd, NULL, &params, sizeof(params), &sbnotify, sizeof(sbnotify));
>         watch_super(fd, "", AT_EMPTY_PATH, watch_fd, 0x03);
>         fsconfig(fd, FSCONFIG_CMD_SUPERVISE_PERMIT, NULL, NULL, 0);
>         close(fd);
>
> However, the supervised mount may be happening in a completely different set
> of namespaces, in which case the supervisor presumably wouldn't be able to see
> the links in procfs and the relevant portions of sysfs.

It would be a "jump" link to the otherwise invisible directory.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03  9:26                     ` Miklos Szeredi
@ 2020-03-03  9:48                       ` Miklos Szeredi
  2020-03-03 10:21                         ` Steven Whitehouse
  2020-03-03 10:00                       ` Christian Brauner
  2020-03-03 11:38                       ` Karel Zak
  2 siblings, 1 reply; 117+ messages in thread
From: Miklos Szeredi @ 2020-03-03  9:48 UTC (permalink / raw)
  To: David Howells
  Cc: Ian Kent, Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Tue, Mar 3, 2020 at 10:26 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Tue, Mar 3, 2020 at 10:13 AM David Howells <dhowells@redhat.com> wrote:
> >
> > Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > > I'm doing a patch.   Let's see how it fares in the face of all these
> > > preconceptions.
> >
> > Don't forget the efficiency criterion.  One reason for going with fsinfo(2) is
> > that scanning /proc/mounts when there are a lot of mounts in the system is
> > slow (not to mention the global lock that is held during the read).

BTW, I do feel that there's room for improvement in userspace code as
well.  Even quite big mount table could be scanned for *changes* very
efficiently.  l.e. cache previous contents of /proc/self/mountinfo and
compare with new contents, line-by-line.  Only need to parse the
changed/added/removed lines.

Also it would be pretty easy to throttle the number of updates so
systemd et al. wouldn't hog the system with unnecessary processing.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03  9:26                     ` Miklos Szeredi
  2020-03-03  9:48                       ` Miklos Szeredi
@ 2020-03-03 10:00                       ` Christian Brauner
  2020-03-03 10:13                         ` Miklos Szeredi
  2020-03-03 11:38                       ` Karel Zak
  2 siblings, 1 reply; 117+ messages in thread
From: Christian Brauner @ 2020-03-03 10:00 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: David Howells, Ian Kent, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Tue, Mar 03, 2020 at 10:26:21AM +0100, Miklos Szeredi wrote:
> On Tue, Mar 3, 2020 at 10:13 AM David Howells <dhowells@redhat.com> wrote:
> >
> > Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > > I'm doing a patch.   Let's see how it fares in the face of all these
> > > preconceptions.
> >
> > Don't forget the efficiency criterion.  One reason for going with fsinfo(2) is
> > that scanning /proc/mounts when there are a lot of mounts in the system is
> > slow (not to mention the global lock that is held during the read).
> >
> > Now, going with sysfs files on top of procfs links might avoid the global
> > lock, and you can avoid rereading the options string if you export a change
> > notification, but you're going to end up injecting a whole lot of pathwalk
> > latency into the system.
> 
> Completely irrelevant.  Cached lookup is so much optimized, that you
> won't be able to see any of it.
> 
> No, I don't think this is going to be a performance issue at all, but
> if anything we could introduce a syscall
> 
>   ssize_t readfile(int dfd, const char *path, char *buf, size_t
> bufsize, int flags);
> 
> that is basically the equivalent of open + read + close, or even a
> vectored variant that reads multiple files.  But that's off topic
> again, since I don't think there's going to be any performance issue
> even with plain I/O syscalls.
> 
> >
> > On top of that, it isn't going to help with the case that I'm working towards
> > implementing where a container manager can monitor for mounts taking place
> > inside the container and supervise them.  What I'm proposing is that during
> > the action phase (eg. FSCONFIG_CMD_CREATE), fsconfig() would hand an fd
> > referring to the context under construction to the manager, which would then
> > be able to call fsinfo() to query it and fsconfig() to adjust it, reject it or
> > permit it.  Something like:
> >
> >         fd = receive_context_to_supervise();
> >         struct fsinfo_params params = {
> >                 .flags          = FSINFO_FLAGS_QUERY_FSCONTEXT,
> >                 .request        = FSINFO_ATTR_SB_OPTIONS,
> >         };
> >         fsinfo(fd, NULL, &params, sizeof(params), buffer, sizeof(buffer));
> >         supervise_parameters(buffer);
> >         fsconfig(fd, FSCONFIG_SET_FLAG, "hard", NULL, 0);
> >         fsconfig(fd, FSCONFIG_SET_STRING, "vers", "4.2", 0);
> >         fsconfig(fd, FSCONFIG_CMD_SUPERVISE_CREATE, NULL, NULL, 0);
> >         struct fsinfo_params params = {
> >                 .flags          = FSINFO_FLAGS_QUERY_FSCONTEXT,
> >                 .request        = FSINFO_ATTR_SB_NOTIFICATIONS,
> >         };
> >         struct fsinfo_sb_notifications sbnotify;
> >         fsinfo(fd, NULL, &params, sizeof(params), &sbnotify, sizeof(sbnotify));
> >         watch_super(fd, "", AT_EMPTY_PATH, watch_fd, 0x03);
> >         fsconfig(fd, FSCONFIG_CMD_SUPERVISE_PERMIT, NULL, NULL, 0);
> >         close(fd);
> >
> > However, the supervised mount may be happening in a completely different set
> > of namespaces, in which case the supervisor presumably wouldn't be able to see
> > the links in procfs and the relevant portions of sysfs.
> 
> It would be a "jump" link to the otherwise invisible directory.

More magic links to beam you around sounds like a bad idea. We had a
bunch of CVEs around them in containers and they were one of the major
reasons behind us pushing for openat2(). That's why it has a
RESOLVE_NO_MAGICLINKS flag.

Christian

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 10:00                       ` Christian Brauner
@ 2020-03-03 10:13                         ` Miklos Szeredi
  2020-03-03 10:25                           ` Christian Brauner
  0 siblings, 1 reply; 117+ messages in thread
From: Miklos Szeredi @ 2020-03-03 10:13 UTC (permalink / raw)
  To: Christian Brauner
  Cc: David Howells, Ian Kent, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Tue, Mar 3, 2020 at 11:00 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Tue, Mar 03, 2020 at 10:26:21AM +0100, Miklos Szeredi wrote:
> > On Tue, Mar 3, 2020 at 10:13 AM David Howells <dhowells@redhat.com> wrote:
> > >
> > > Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > > I'm doing a patch.   Let's see how it fares in the face of all these
> > > > preconceptions.
> > >
> > > Don't forget the efficiency criterion.  One reason for going with fsinfo(2) is
> > > that scanning /proc/mounts when there are a lot of mounts in the system is
> > > slow (not to mention the global lock that is held during the read).
> > >
> > > Now, going with sysfs files on top of procfs links might avoid the global
> > > lock, and you can avoid rereading the options string if you export a change
> > > notification, but you're going to end up injecting a whole lot of pathwalk
> > > latency into the system.
> >
> > Completely irrelevant.  Cached lookup is so much optimized, that you
> > won't be able to see any of it.
> >
> > No, I don't think this is going to be a performance issue at all, but
> > if anything we could introduce a syscall
> >
> >   ssize_t readfile(int dfd, const char *path, char *buf, size_t
> > bufsize, int flags);
> >
> > that is basically the equivalent of open + read + close, or even a
> > vectored variant that reads multiple files.  But that's off topic
> > again, since I don't think there's going to be any performance issue
> > even with plain I/O syscalls.
> >
> > >
> > > On top of that, it isn't going to help with the case that I'm working towards
> > > implementing where a container manager can monitor for mounts taking place
> > > inside the container and supervise them.  What I'm proposing is that during
> > > the action phase (eg. FSCONFIG_CMD_CREATE), fsconfig() would hand an fd
> > > referring to the context under construction to the manager, which would then
> > > be able to call fsinfo() to query it and fsconfig() to adjust it, reject it or
> > > permit it.  Something like:
> > >
> > >         fd = receive_context_to_supervise();
> > >         struct fsinfo_params params = {
> > >                 .flags          = FSINFO_FLAGS_QUERY_FSCONTEXT,
> > >                 .request        = FSINFO_ATTR_SB_OPTIONS,
> > >         };
> > >         fsinfo(fd, NULL, &params, sizeof(params), buffer, sizeof(buffer));
> > >         supervise_parameters(buffer);
> > >         fsconfig(fd, FSCONFIG_SET_FLAG, "hard", NULL, 0);
> > >         fsconfig(fd, FSCONFIG_SET_STRING, "vers", "4.2", 0);
> > >         fsconfig(fd, FSCONFIG_CMD_SUPERVISE_CREATE, NULL, NULL, 0);
> > >         struct fsinfo_params params = {
> > >                 .flags          = FSINFO_FLAGS_QUERY_FSCONTEXT,
> > >                 .request        = FSINFO_ATTR_SB_NOTIFICATIONS,
> > >         };
> > >         struct fsinfo_sb_notifications sbnotify;
> > >         fsinfo(fd, NULL, &params, sizeof(params), &sbnotify, sizeof(sbnotify));
> > >         watch_super(fd, "", AT_EMPTY_PATH, watch_fd, 0x03);
> > >         fsconfig(fd, FSCONFIG_CMD_SUPERVISE_PERMIT, NULL, NULL, 0);
> > >         close(fd);
> > >
> > > However, the supervised mount may be happening in a completely different set
> > > of namespaces, in which case the supervisor presumably wouldn't be able to see
> > > the links in procfs and the relevant portions of sysfs.
> >
> > It would be a "jump" link to the otherwise invisible directory.
>
> More magic links to beam you around sounds like a bad idea. We had a
> bunch of CVEs around them in containers and they were one of the major
> reasons behind us pushing for openat2(). That's why it has a
> RESOLVE_NO_MAGICLINKS flag.

No, that link wouldn't beam you around at all, it would end up in an
internally mounted instance of a mountfs, a safe place where no
dangerous CVE's roam.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03  9:48                       ` Miklos Szeredi
@ 2020-03-03 10:21                         ` Steven Whitehouse
  2020-03-03 10:32                           ` Miklos Szeredi
  0 siblings, 1 reply; 117+ messages in thread
From: Steven Whitehouse @ 2020-03-03 10:21 UTC (permalink / raw)
  To: Miklos Szeredi, David Howells
  Cc: Ian Kent, Christian Brauner, James Bottomley, Miklos Szeredi,
	viro, Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml, Greg Kroah-Hartman

Hi,

On 03/03/2020 09:48, Miklos Szeredi wrote:
> On Tue, Mar 3, 2020 at 10:26 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Tue, Mar 3, 2020 at 10:13 AM David Howells <dhowells@redhat.com> wrote:
>>> Miklos Szeredi <miklos@szeredi.hu> wrote:
>>>
>>>> I'm doing a patch.   Let's see how it fares in the face of all these
>>>> preconceptions.
>>> Don't forget the efficiency criterion.  One reason for going with fsinfo(2) is
>>> that scanning /proc/mounts when there are a lot of mounts in the system is
>>> slow (not to mention the global lock that is held during the read).
> BTW, I do feel that there's room for improvement in userspace code as
> well.  Even quite big mount table could be scanned for *changes* very
> efficiently.  l.e. cache previous contents of /proc/self/mountinfo and
> compare with new contents, line-by-line.  Only need to parse the
> changed/added/removed lines.
>
> Also it would be pretty easy to throttle the number of updates so
> systemd et al. wouldn't hog the system with unnecessary processing.
>
> Thanks,
> Miklos
>

At least having patches to compare would allow us to look at the 
performance here and gain some numbers, which would be helpful to frame 
the discussions. However I'm not seeing how it would be easy to throttle 
updates... they occur at whatever rate they are generated and this can 
be fairly high. Also I'm not sure that I follow how the notifications 
and the dumping of the whole table are synchronized in this case, either.

Al has pointed out before that a single mount operation on a subtree can 
generate a large number of changes on that subtree. That kind of 
scenario will need to be dealt with efficiently so that we don't miss 
things, and we also minimize the possibility of overruns, and additional 
overhead on the mount changes themselves, by keeping the notification 
messages small.

We should also look at what the likely worst case might be. I seem to 
remember from what Ian has said in the past that there can be tens of 
thousands of autofs mounts on some large systems. I assume that worst 
case might be something like that, but multiplied by however many 
containers might be on a system. Can anybody think of a situation which 
might require even more mounts?

The network subsystem had a similar problem... they use rtnetlink for 
the routing information, and just like the proposal here it contains a 
dump mechanism, and a way to listen to events (add/remove routes) which 
is synchronized with that dump. Ian did start looking at netlink some 
time ago, but it also has some issues (it is in the network namespace 
not the fs namespace, it also has various things accumulated over the 
years that we don't need for filesystems) but that was part of the 
original inspiration for the fs notifications.

There is also, of course, /proc/net/route which can be useful in many 
circumstances, but for efficiency and synchronization reasons if is not 
the interface of choice for routing protocols. David's proposal has a 
number of the important attributes of an rtnetlink-like (in a conceptual 
sense) solution, and I remain skeptical that a /sysfs or similar 
interface would be an efficient solution to the original problem, even 
if it might perhaps make a useful addition.

There is also the chicken-and-egg issue, in the sense that if the 
interface is via a filesystem (sysfs, proc or whatever), how does one 
receive a notification for that filesystem itself being mounted until 
after it has been mounted? Maybe that is not a particular problem, but I 
think a cleaner solution would not require a mount in order to watch for 
other mounts,

Steve.




^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 10:13                         ` Miklos Szeredi
@ 2020-03-03 10:25                           ` Christian Brauner
  2020-03-03 11:33                             ` Miklos Szeredi
  0 siblings, 1 reply; 117+ messages in thread
From: Christian Brauner @ 2020-03-03 10:25 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: David Howells, Ian Kent, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Tue, Mar 03, 2020 at 11:13:50AM +0100, Miklos Szeredi wrote:
> On Tue, Mar 3, 2020 at 11:00 AM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Tue, Mar 03, 2020 at 10:26:21AM +0100, Miklos Szeredi wrote:
> > > On Tue, Mar 3, 2020 at 10:13 AM David Howells <dhowells@redhat.com> wrote:
> > > >
> > > > Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > >
> > > > > I'm doing a patch.   Let's see how it fares in the face of all these
> > > > > preconceptions.
> > > >
> > > > Don't forget the efficiency criterion.  One reason for going with fsinfo(2) is
> > > > that scanning /proc/mounts when there are a lot of mounts in the system is
> > > > slow (not to mention the global lock that is held during the read).
> > > >
> > > > Now, going with sysfs files on top of procfs links might avoid the global
> > > > lock, and you can avoid rereading the options string if you export a change
> > > > notification, but you're going to end up injecting a whole lot of pathwalk
> > > > latency into the system.
> > >
> > > Completely irrelevant.  Cached lookup is so much optimized, that you
> > > won't be able to see any of it.
> > >
> > > No, I don't think this is going to be a performance issue at all, but
> > > if anything we could introduce a syscall
> > >
> > >   ssize_t readfile(int dfd, const char *path, char *buf, size_t
> > > bufsize, int flags);
> > >
> > > that is basically the equivalent of open + read + close, or even a
> > > vectored variant that reads multiple files.  But that's off topic
> > > again, since I don't think there's going to be any performance issue
> > > even with plain I/O syscalls.
> > >
> > > >
> > > > On top of that, it isn't going to help with the case that I'm working towards
> > > > implementing where a container manager can monitor for mounts taking place
> > > > inside the container and supervise them.  What I'm proposing is that during
> > > > the action phase (eg. FSCONFIG_CMD_CREATE), fsconfig() would hand an fd
> > > > referring to the context under construction to the manager, which would then
> > > > be able to call fsinfo() to query it and fsconfig() to adjust it, reject it or
> > > > permit it.  Something like:
> > > >
> > > >         fd = receive_context_to_supervise();
> > > >         struct fsinfo_params params = {
> > > >                 .flags          = FSINFO_FLAGS_QUERY_FSCONTEXT,
> > > >                 .request        = FSINFO_ATTR_SB_OPTIONS,
> > > >         };
> > > >         fsinfo(fd, NULL, &params, sizeof(params), buffer, sizeof(buffer));
> > > >         supervise_parameters(buffer);
> > > >         fsconfig(fd, FSCONFIG_SET_FLAG, "hard", NULL, 0);
> > > >         fsconfig(fd, FSCONFIG_SET_STRING, "vers", "4.2", 0);
> > > >         fsconfig(fd, FSCONFIG_CMD_SUPERVISE_CREATE, NULL, NULL, 0);
> > > >         struct fsinfo_params params = {
> > > >                 .flags          = FSINFO_FLAGS_QUERY_FSCONTEXT,
> > > >                 .request        = FSINFO_ATTR_SB_NOTIFICATIONS,
> > > >         };
> > > >         struct fsinfo_sb_notifications sbnotify;
> > > >         fsinfo(fd, NULL, &params, sizeof(params), &sbnotify, sizeof(sbnotify));
> > > >         watch_super(fd, "", AT_EMPTY_PATH, watch_fd, 0x03);
> > > >         fsconfig(fd, FSCONFIG_CMD_SUPERVISE_PERMIT, NULL, NULL, 0);
> > > >         close(fd);
> > > >
> > > > However, the supervised mount may be happening in a completely different set
> > > > of namespaces, in which case the supervisor presumably wouldn't be able to see
> > > > the links in procfs and the relevant portions of sysfs.
> > >
> > > It would be a "jump" link to the otherwise invisible directory.
> >
> > More magic links to beam you around sounds like a bad idea. We had a
> > bunch of CVEs around them in containers and they were one of the major
> > reasons behind us pushing for openat2(). That's why it has a
> > RESOLVE_NO_MAGICLINKS flag.
> 
> No, that link wouldn't beam you around at all, it would end up in an
> internally mounted instance of a mountfs, a safe place where no

Even if it is a magic link to a safe place it's a magic link. They
aren't a great solution to this problem. fsinfo() is cleaner and
simpler as it creates a context for a supervised mount which gives the a
managing application fine-grained control and makes it easily
extendable.
Also, we're apparently at the point where it seems were suggesting
another (pseudo)filesystem to get information about filesystems.

Christian

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 10:21                         ` Steven Whitehouse
@ 2020-03-03 10:32                           ` Miklos Szeredi
  2020-03-03 11:09                             ` Ian Kent
  0 siblings, 1 reply; 117+ messages in thread
From: Miklos Szeredi @ 2020-03-03 10:32 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: David Howells, Ian Kent, Christian Brauner, James Bottomley,
	Miklos Szeredi, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Tue, Mar 3, 2020 at 11:22 AM Steven Whitehouse <swhiteho@redhat.com> wrote:
>
> Hi,
>
> On 03/03/2020 09:48, Miklos Szeredi wrote:
> > On Tue, Mar 3, 2020 at 10:26 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >> On Tue, Mar 3, 2020 at 10:13 AM David Howells <dhowells@redhat.com> wrote:
> >>> Miklos Szeredi <miklos@szeredi.hu> wrote:
> >>>
> >>>> I'm doing a patch.   Let's see how it fares in the face of all these
> >>>> preconceptions.
> >>> Don't forget the efficiency criterion.  One reason for going with fsinfo(2) is
> >>> that scanning /proc/mounts when there are a lot of mounts in the system is
> >>> slow (not to mention the global lock that is held during the read).
> > BTW, I do feel that there's room for improvement in userspace code as
> > well.  Even quite big mount table could be scanned for *changes* very
> > efficiently.  l.e. cache previous contents of /proc/self/mountinfo and
> > compare with new contents, line-by-line.  Only need to parse the
> > changed/added/removed lines.
> >
> > Also it would be pretty easy to throttle the number of updates so
> > systemd et al. wouldn't hog the system with unnecessary processing.
> >
> > Thanks,
> > Miklos
> >
>
> At least having patches to compare would allow us to look at the
> performance here and gain some numbers, which would be helpful to frame
> the discussions. However I'm not seeing how it would be easy to throttle
> updates... they occur at whatever rate they are generated and this can
> be fairly high. Also I'm not sure that I follow how the notifications
> and the dumping of the whole table are synchronized in this case, either.

What I meant is optimizing current userspace without additional kernel
infrastructure.   Since currently there's only the monolithic
/proc/self/mountinfo, it's reasonable that if the rate of change is
very high, then we don't re-read this table on every change, only
within a reasonable time limit (e.g. 1s) to provide timely updates.
Re-reading the table on every change would (does?) slow down the
system so that the actual updates would even be slower, so throttling
in this case very much  makes sense.

Once we have per-mount information from the kernel, throttling updates
probably does not make sense.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 10:32                           ` Miklos Szeredi
@ 2020-03-03 11:09                             ` Ian Kent
  0 siblings, 0 replies; 117+ messages in thread
From: Ian Kent @ 2020-03-03 11:09 UTC (permalink / raw)
  To: Miklos Szeredi, Steven Whitehouse
  Cc: David Howells, Christian Brauner, James Bottomley,
	Miklos Szeredi, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Tue, 2020-03-03 at 11:32 +0100, Miklos Szeredi wrote:
> On Tue, Mar 3, 2020 at 11:22 AM Steven Whitehouse <
> swhiteho@redhat.com> wrote:
> > Hi,
> > 
> > On 03/03/2020 09:48, Miklos Szeredi wrote:
> > > On Tue, Mar 3, 2020 at 10:26 AM Miklos Szeredi <miklos@szeredi.hu
> > > > wrote:
> > > > On Tue, Mar 3, 2020 at 10:13 AM David Howells <
> > > > dhowells@redhat.com> wrote:
> > > > > Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > > > 
> > > > > > I'm doing a patch.   Let's see how it fares in the face of
> > > > > > all these
> > > > > > preconceptions.
> > > > > Don't forget the efficiency criterion.  One reason for going
> > > > > with fsinfo(2) is
> > > > > that scanning /proc/mounts when there are a lot of mounts in
> > > > > the system is
> > > > > slow (not to mention the global lock that is held during the
> > > > > read).
> > > BTW, I do feel that there's room for improvement in userspace
> > > code as
> > > well.  Even quite big mount table could be scanned for *changes*
> > > very
> > > efficiently.  l.e. cache previous contents of
> > > /proc/self/mountinfo and
> > > compare with new contents, line-by-line.  Only need to parse the
> > > changed/added/removed lines.
> > > 
> > > Also it would be pretty easy to throttle the number of updates so
> > > systemd et al. wouldn't hog the system with unnecessary
> > > processing.
> > > 
> > > Thanks,
> > > Miklos
> > > 
> > 
> > At least having patches to compare would allow us to look at the
> > performance here and gain some numbers, which would be helpful to
> > frame
> > the discussions. However I'm not seeing how it would be easy to
> > throttle
> > updates... they occur at whatever rate they are generated and this
> > can
> > be fairly high. Also I'm not sure that I follow how the
> > notifications
> > and the dumping of the whole table are synchronized in this case,
> > either.
> 
> What I meant is optimizing current userspace without additional
> kernel
> infrastructure.   Since currently there's only the monolithic
> /proc/self/mountinfo, it's reasonable that if the rate of change is
> very high, then we don't re-read this table on every change, only
> within a reasonable time limit (e.g. 1s) to provide timely updates.
> Re-reading the table on every change would (does?) slow down the
> system so that the actual updates would even be slower, so throttling
> in this case very much  makes sense.

Optimizing user space is a huge task.

For example, consider this (which is related to a recent upstream
discussion I had):
https://blog.janestreet.com/troubleshooting-systemd-with-systemtap/

Working on improving libmount is really useful but that can't help
with inherently inefficient approaches to keeping info. current
which is actually needed at times.

> 
> Once we have per-mount information from the kernel, throttling
> updates
> probably does not make sense.

And can easily lead to application problems. Throttling will
lead to an inability to have up to date information upon which
application decisions are made.

I don't think it's a viable solution to the separate problem
of a large number of notifications either.

Ian


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 10:25                           ` Christian Brauner
@ 2020-03-03 11:33                             ` Miklos Szeredi
  2020-03-03 11:56                               ` Christian Brauner
  0 siblings, 1 reply; 117+ messages in thread
From: Miklos Szeredi @ 2020-03-03 11:33 UTC (permalink / raw)
  To: Christian Brauner
  Cc: David Howells, Ian Kent, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Tue, Mar 3, 2020 at 11:25 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Tue, Mar 03, 2020 at 11:13:50AM +0100, Miklos Szeredi wrote:
> > On Tue, Mar 3, 2020 at 11:00 AM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:

> > > More magic links to beam you around sounds like a bad idea. We had a
> > > bunch of CVEs around them in containers and they were one of the major
> > > reasons behind us pushing for openat2(). That's why it has a
> > > RESOLVE_NO_MAGICLINKS flag.
> >
> > No, that link wouldn't beam you around at all, it would end up in an
> > internally mounted instance of a mountfs, a safe place where no
>
> Even if it is a magic link to a safe place it's a magic link. They
> aren't a great solution to this problem. fsinfo() is cleaner and
> simpler as it creates a context for a supervised mount which gives the a
> managing application fine-grained control and makes it easily
> extendable.

Yeah, it's a nice and clean interface in the ioctl(2) sense. Sure,
fsinfo() is way better than ioctl(), but it at the core it's still the
same syscall multiplexer, do everything hack.

> Also, we're apparently at the point where it seems were suggesting
> another (pseudo)filesystem to get information about filesystems.

Implementation detail.  Why would you care?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03  9:26                     ` Miklos Szeredi
  2020-03-03  9:48                       ` Miklos Szeredi
  2020-03-03 10:00                       ` Christian Brauner
@ 2020-03-03 11:38                       ` Karel Zak
  2020-03-03 13:03                         ` Greg Kroah-Hartman
  2020-03-03 14:09                         ` David Howells
  2 siblings, 2 replies; 117+ messages in thread
From: Karel Zak @ 2020-03-03 11:38 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: David Howells, Ian Kent, Christian Brauner, James Bottomley,
	Steven Whitehouse, Miklos Szeredi, viro, Christian Brauner,
	Jann Horn, Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Tue, Mar 03, 2020 at 10:26:21AM +0100, Miklos Szeredi wrote:
> No, I don't think this is going to be a performance issue at all, but
> if anything we could introduce a syscall
> 
>   ssize_t readfile(int dfd, const char *path, char *buf, size_t
> bufsize, int flags);

off-topic, but I'll buy you many many beers if you implement it ;-),
because open + read + close is pretty common for /sys and /proc in
many userspace tools; for example ps, top, lsblk, lsmem, lsns, udevd
etc. is all about it.

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 11:33                             ` Miklos Szeredi
@ 2020-03-03 11:56                               ` Christian Brauner
  0 siblings, 0 replies; 117+ messages in thread
From: Christian Brauner @ 2020-03-03 11:56 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: David Howells, Ian Kent, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Tue, Mar 03, 2020 at 12:33:48PM +0100, Miklos Szeredi wrote:
> On Tue, Mar 3, 2020 at 11:25 AM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Tue, Mar 03, 2020 at 11:13:50AM +0100, Miklos Szeredi wrote:
> > > On Tue, Mar 3, 2020 at 11:00 AM Christian Brauner
> > > <christian.brauner@ubuntu.com> wrote:
> 
> > > > More magic links to beam you around sounds like a bad idea. We had a
> > > > bunch of CVEs around them in containers and they were one of the major
> > > > reasons behind us pushing for openat2(). That's why it has a
> > > > RESOLVE_NO_MAGICLINKS flag.
> > >
> > > No, that link wouldn't beam you around at all, it would end up in an
> > > internally mounted instance of a mountfs, a safe place where no
> >
> > Even if it is a magic link to a safe place it's a magic link. They
> > aren't a great solution to this problem. fsinfo() is cleaner and
> > simpler as it creates a context for a supervised mount which gives the a
> > managing application fine-grained control and makes it easily
> > extendable.
> 
> Yeah, it's a nice and clean interface in the ioctl(2) sense. Sure,
> fsinfo() is way better than ioctl(), but it at the core it's still the
> same syscall multiplexer, do everything hack.

In contrast to a generic ioctl() it's a domain-specific separate
syscall. You can't suddenly set kvm options through fsinfo() I would
hope. I find it at least debatable that a new filesystem is preferable.
And - feel free to simply dismiss the concerns I expressed - so far
there has not been a lot of excitement about this idea.

> 
> > Also, we're apparently at the point where it seems were suggesting
> > another (pseudo)filesystem to get information about filesystems.
> 
> Implementation detail.  Why would you care?

I wouldn't call this an implementation detail. That's quite a big
design choice; it's a separate fileystem. In addition, implementation
details need to be maintained.

Christian

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 11:38                       ` Karel Zak
@ 2020-03-03 13:03                         ` Greg Kroah-Hartman
  2020-03-03 13:14                           ` Greg Kroah-Hartman
  2020-03-04  2:01                           ` Ian Kent
  2020-03-03 14:09                         ` David Howells
  1 sibling, 2 replies; 117+ messages in thread
From: Greg Kroah-Hartman @ 2020-03-03 13:03 UTC (permalink / raw)
  To: Karel Zak
  Cc: Miklos Szeredi, David Howells, Ian Kent, Christian Brauner,
	James Bottomley, Steven Whitehouse, Miklos Szeredi, viro,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml

On Tue, Mar 03, 2020 at 12:38:14PM +0100, Karel Zak wrote:
> On Tue, Mar 03, 2020 at 10:26:21AM +0100, Miklos Szeredi wrote:
> > No, I don't think this is going to be a performance issue at all, but
> > if anything we could introduce a syscall
> > 
> >   ssize_t readfile(int dfd, const char *path, char *buf, size_t
> > bufsize, int flags);
> 
> off-topic, but I'll buy you many many beers if you implement it ;-),
> because open + read + close is pretty common for /sys and /proc in
> many userspace tools; for example ps, top, lsblk, lsmem, lsns, udevd
> etc. is all about it.

Unlimited beers for a 21-line kernel patch?  Sign me up!

Totally untested, barely compiled patch below.

Actually, I like this idea (the syscall, not just the unlimited beers).
Maybe this could make a lot of sense, I'll write some actual tests for
it now that syscalls are getting "heavy" again due to CPU vendors
finally paying the price for their madness...

thanks,

greg k-h
-------------------


diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 44d510bc9b78..178cd45340e2 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -359,6 +359,7 @@
 435	common	clone3			__x64_sys_clone3/ptregs
 437	common	openat2			__x64_sys_openat2
 438	common	pidfd_getfd		__x64_sys_pidfd_getfd
+439	common	readfile		__x86_sys_readfile
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/open.c b/fs/open.c
index 0788b3715731..1a830fada750 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1340,3 +1340,23 @@ int stream_open(struct inode *inode, struct file *filp)
 }
 
 EXPORT_SYMBOL(stream_open);
+
+SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename,
+		char __user *, buffer, size_t, bufsize, int, flags)
+{
+	int retval;
+	int fd;
+
+	if (force_o_largefile())
+		flags |= O_LARGEFILE;
+
+	fd = do_sys_open(dfd, filename, flags, O_RDONLY);
+	if (fd <= 0)
+		return fd;
+
+	retval = ksys_read(fd, buffer, bufsize);
+
+	__close_fd(current->files, fd);
+
+	return retval;
+}

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 13:03                         ` Greg Kroah-Hartman
@ 2020-03-03 13:14                           ` Greg Kroah-Hartman
  2020-03-03 13:34                             ` Miklos Szeredi
  2020-03-04  2:01                           ` Ian Kent
  1 sibling, 1 reply; 117+ messages in thread
From: Greg Kroah-Hartman @ 2020-03-03 13:14 UTC (permalink / raw)
  To: Karel Zak
  Cc: Miklos Szeredi, David Howells, Ian Kent, Christian Brauner,
	James Bottomley, Steven Whitehouse, Miklos Szeredi, viro,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml

On Tue, Mar 03, 2020 at 02:03:47PM +0100, Greg Kroah-Hartman wrote:
> On Tue, Mar 03, 2020 at 12:38:14PM +0100, Karel Zak wrote:
> > On Tue, Mar 03, 2020 at 10:26:21AM +0100, Miklos Szeredi wrote:
> > > No, I don't think this is going to be a performance issue at all, but
> > > if anything we could introduce a syscall
> > > 
> > >   ssize_t readfile(int dfd, const char *path, char *buf, size_t
> > > bufsize, int flags);
> > 
> > off-topic, but I'll buy you many many beers if you implement it ;-),
> > because open + read + close is pretty common for /sys and /proc in
> > many userspace tools; for example ps, top, lsblk, lsmem, lsns, udevd
> > etc. is all about it.
> 
> Unlimited beers for a 21-line kernel patch?  Sign me up!
> 
> Totally untested, barely compiled patch below.

Ok, that didn't even build, let me try this for real now...

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 13:14                           ` Greg Kroah-Hartman
@ 2020-03-03 13:34                             ` Miklos Szeredi
  2020-03-03 13:43                               ` Greg Kroah-Hartman
  2020-03-03 14:23                               ` Christian Brauner
  0 siblings, 2 replies; 117+ messages in thread
From: Miklos Szeredi @ 2020-03-03 13:34 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Karel Zak, David Howells, Ian Kent, Christian Brauner,
	James Bottomley, Steven Whitehouse, Miklos Szeredi, viro,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml

On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:

> > Unlimited beers for a 21-line kernel patch?  Sign me up!
> >
> > Totally untested, barely compiled patch below.
>
> Ok, that didn't even build, let me try this for real now...

Some comments on the interface:

O_LARGEFILE can be unconditional, since offsets are not exposed to the caller.

Use the openat2 style arguments; limit the accepted flags to sane ones
(e.g. don't let this syscall create a file).

If buffer is too small to fit the whole file, return error.

Verify that the number of bytes read matches the file size, otherwise
return error (may need to loop?).

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 13:34                             ` Miklos Szeredi
@ 2020-03-03 13:43                               ` Greg Kroah-Hartman
  2020-03-03 14:10                                 ` Greg Kroah-Hartman
                                                   ` (2 more replies)
  2020-03-03 14:23                               ` Christian Brauner
  1 sibling, 3 replies; 117+ messages in thread
From: Greg Kroah-Hartman @ 2020-03-03 13:43 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Karel Zak, David Howells, Ian Kent, Christian Brauner,
	James Bottomley, Steven Whitehouse, Miklos Szeredi, viro,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml

On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> 
> > > Unlimited beers for a 21-line kernel patch?  Sign me up!
> > >
> > > Totally untested, barely compiled patch below.
> >
> > Ok, that didn't even build, let me try this for real now...
> 
> Some comments on the interface:

Ok, hey, let's do this proper :)

> O_LARGEFILE can be unconditional, since offsets are not exposed to the caller.

Good point.

> Use the openat2 style arguments; limit the accepted flags to sane ones
> (e.g. don't let this syscall create a file).

Yeah, I just added that check to my local version:
	/* Mask off all O_ flags as we only want to read from the file */
	flags &= ~(VALID_OPEN_FLAGS);
	flags |= O_RDONLY | O_LARGEFILE;

> If buffer is too small to fit the whole file, return error.

Why?  What's wrong with just returning the bytes asked for?  If someone
only wants 5 bytes from the front of a file, it should be fine to give
that to them, right?

> Verify that the number of bytes read matches the file size, otherwise
> return error (may need to loop?).

No, we can't "match file size" as sysfs files do not really have a sane
"size".  So I don't want to loop at all here, one-shot, that's all you
get :)

Let me actually do this and try it out for real.

/me has no idea what he is getting himself into...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 11:38                       ` Karel Zak
  2020-03-03 13:03                         ` Greg Kroah-Hartman
@ 2020-03-03 14:09                         ` David Howells
  1 sibling, 0 replies; 117+ messages in thread
From: David Howells @ 2020-03-03 14:09 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: dhowells, Karel Zak, Miklos Szeredi, Ian Kent, Christian Brauner,
	James Bottomley, Steven Whitehouse, Miklos Szeredi, viro,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml

Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:

> Actually, I like this idea (the syscall,

It might mesh well with atomic_open in some way.

David


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 13:43                               ` Greg Kroah-Hartman
@ 2020-03-03 14:10                                 ` Greg Kroah-Hartman
  2020-03-03 14:13                                   ` Jann Horn
  2020-03-03 14:10                                 ` Miklos Szeredi
  2020-03-03 14:19                                 ` David Howells
  2 siblings, 1 reply; 117+ messages in thread
From: Greg Kroah-Hartman @ 2020-03-03 14:10 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Karel Zak, David Howells, Ian Kent, Christian Brauner,
	James Bottomley, Steven Whitehouse, Miklos Szeredi, viro,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml

On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote:
> On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> > On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > 
> > > > Unlimited beers for a 21-line kernel patch?  Sign me up!
> > > >
> > > > Totally untested, barely compiled patch below.
> > >
> > > Ok, that didn't even build, let me try this for real now...
> > 
> > Some comments on the interface:
> 
> Ok, hey, let's do this proper :)

Alright, how about this patch.

Actually tested with some simple sysfs files.

If people don't strongly object, I'll add "real" tests to it, hook it up
to all arches, write a manpage, and all the fun fluff a new syscall
deserves and submit it "for real".

It feels like I'm doing something wrong in that the actuall syscall
logic is just so small.  Maybe I'll benchmark this thing to see if it
makes any real difference...

thanks,

greg k-h

From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Subject: [PATCH] readfile: implement readfile syscall

It's a tiny syscall, meant to allow a user to do a single "open this
file, read into this buffer, and close the file" all in a single shot.

Should be good for reading "tiny" files like sysfs, procfs, and other
"small" files.

There is no restarting the syscall, am trying to keep it simple.  At
least for now.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 arch/x86/entry/syscalls/syscall_32.tbl |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 fs/open.c                              | 21 +++++++++++++++++++++
 include/linux/syscalls.h               |  2 ++
 include/uapi/asm-generic/unistd.h      |  4 +++-
 5 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index c17cb77eb150..a79cd025e72b 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -442,3 +442,4 @@
 435	i386	clone3			sys_clone3			__ia32_sys_clone3
 437	i386	openat2			sys_openat2			__ia32_sys_openat2
 438	i386	pidfd_getfd		sys_pidfd_getfd			__ia32_sys_pidfd_getfd
+439	i386	readfile		sys_readfile			__ia32_sys_readfile
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 44d510bc9b78..4f518f4e0e30 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -359,6 +359,7 @@
 435	common	clone3			__x64_sys_clone3/ptregs
 437	common	openat2			__x64_sys_openat2
 438	common	pidfd_getfd		__x64_sys_pidfd_getfd
+439	common	readfile		__x64_sys_readfile
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/open.c b/fs/open.c
index 0788b3715731..109bad47d542 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1340,3 +1340,24 @@ int stream_open(struct inode *inode, struct file *filp)
 }
 
 EXPORT_SYMBOL(stream_open);
+
+SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename,
+		char __user *, buffer, size_t, bufsize, int, flags)
+{
+	int retval;
+	int fd;
+
+	/* Mask off all O_ flags as we only want to read from the file */
+	flags &= ~(VALID_OPEN_FLAGS);
+	flags |= O_RDONLY | O_LARGEFILE;
+
+	fd = do_sys_open(dfd, filename, flags, 0000);
+	if (fd <= 0)
+		return fd;
+
+	retval = ksys_read(fd, buffer, bufsize);
+
+	__close_fd(current->files, fd);
+
+	return retval;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 1815065d52f3..3a636a913437 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1003,6 +1003,8 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
 				       siginfo_t __user *info,
 				       unsigned int flags);
 asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
+asmlinkage long sys_readfile(int dfd, const char __user *filename,
+			     char __user *buffer, size_t bufsize, int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 3a3201e4618e..31f84500915d 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -855,9 +855,11 @@ __SYSCALL(__NR_clone3, sys_clone3)
 __SYSCALL(__NR_openat2, sys_openat2)
 #define __NR_pidfd_getfd 438
 __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
+#define __NR_readfile 439
+__SYSCALL(__NR_readfile, sys_readfile)
 
 #undef __NR_syscalls
-#define __NR_syscalls 439
+#define __NR_syscalls 440
 
 /*
  * 32 bit systems traditionally used different
-- 
2.25.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 13:43                               ` Greg Kroah-Hartman
  2020-03-03 14:10                                 ` Greg Kroah-Hartman
@ 2020-03-03 14:10                                 ` Miklos Szeredi
  2020-03-03 14:29                                   ` Greg Kroah-Hartman
                                                     ` (2 more replies)
  2020-03-03 14:19                                 ` David Howells
  2 siblings, 3 replies; 117+ messages in thread
From: Miklos Szeredi @ 2020-03-03 14:10 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Karel Zak, David Howells, Ian Kent, Christian Brauner,
	James Bottomley, Steven Whitehouse, Miklos Szeredi, viro,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml

On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:

> > If buffer is too small to fit the whole file, return error.
>
> Why?  What's wrong with just returning the bytes asked for?  If someone
> only wants 5 bytes from the front of a file, it should be fine to give
> that to them, right?

I think we need to signal in some way to the caller that the result
was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise the
caller might be surprised.

>
> > Verify that the number of bytes read matches the file size, otherwise
> > return error (may need to loop?).
>
> No, we can't "match file size" as sysfs files do not really have a sane
> "size".  So I don't want to loop at all here, one-shot, that's all you
> get :)

Hmm.  I understand the no-size thing.  But looping until EOF (i.e.
until read return zero) might be a good idea regardless, because short
reads are allowed.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 14:10                                 ` Greg Kroah-Hartman
@ 2020-03-03 14:13                                   ` Jann Horn
  2020-03-03 14:24                                     ` Greg Kroah-Hartman
  0 siblings, 1 reply; 117+ messages in thread
From: Jann Horn @ 2020-03-03 14:13 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Miklos Szeredi, Karel Zak, David Howells, Ian Kent,
	Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Darrick J. Wong,
	Linux API, linux-fsdevel, lkml, Jens Axboe

On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote:
> > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> > > On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
> > > <gregkh@linuxfoundation.org> wrote:
> > >
> > > > > Unlimited beers for a 21-line kernel patch?  Sign me up!
> > > > >
> > > > > Totally untested, barely compiled patch below.
> > > >
> > > > Ok, that didn't even build, let me try this for real now...
> > >
> > > Some comments on the interface:
> >
> > Ok, hey, let's do this proper :)
>
> Alright, how about this patch.
>
> Actually tested with some simple sysfs files.
>
> If people don't strongly object, I'll add "real" tests to it, hook it up
> to all arches, write a manpage, and all the fun fluff a new syscall
> deserves and submit it "for real".

Just FYI, io_uring is moving towards the same kind of thing... IIRC
you can already use it to batch a bunch of open() calls, then batch a
bunch of read() calls on all the new fds and close them at the same
time. And I think they're planning to add support for doing
open()+read()+close() all in one go, too, except that it's a bit
complicated because passing forward the file descriptor in a generic
way is a bit complicated.

> It feels like I'm doing something wrong in that the actuall syscall
> logic is just so small.  Maybe I'll benchmark this thing to see if it
> makes any real difference...
>
> thanks,
>
> greg k-h
>
> From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Subject: [PATCH] readfile: implement readfile syscall
>
> It's a tiny syscall, meant to allow a user to do a single "open this
> file, read into this buffer, and close the file" all in a single shot.
>
> Should be good for reading "tiny" files like sysfs, procfs, and other
> "small" files.
>
> There is no restarting the syscall, am trying to keep it simple.  At
> least for now.
>
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[...]
> +SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename,
> +               char __user *, buffer, size_t, bufsize, int, flags)
> +{
> +       int retval;
> +       int fd;
> +
> +       /* Mask off all O_ flags as we only want to read from the file */
> +       flags &= ~(VALID_OPEN_FLAGS);
> +       flags |= O_RDONLY | O_LARGEFILE;
> +
> +       fd = do_sys_open(dfd, filename, flags, 0000);
> +       if (fd <= 0)
> +               return fd;
> +
> +       retval = ksys_read(fd, buffer, bufsize);
> +
> +       __close_fd(current->files, fd);
> +
> +       return retval;
> +}

If you're gonna do something like that, wouldn't you want to also
elide the use of the file descriptor table completely? do_sys_open()
will have to do atomic operations in the fd table and stuff, which is
probably moderately bad in terms of cacheline bouncing if this is used
in a multithreaded context; and as a side effect, the fd would be
inherited by anyone who calls fork() concurrently. You'll probably
want to use APIs like do_filp_open() and filp_close(), or something
like that, instead.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 13:43                               ` Greg Kroah-Hartman
  2020-03-03 14:10                                 ` Greg Kroah-Hartman
  2020-03-03 14:10                                 ` Miklos Szeredi
@ 2020-03-03 14:19                                 ` David Howells
  2020-03-03 16:59                                   ` Greg Kroah-Hartman
  2 siblings, 1 reply; 117+ messages in thread
From: David Howells @ 2020-03-03 14:19 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: dhowells, Miklos Szeredi, Karel Zak, Ian Kent, Christian Brauner,
	James Bottomley, Steven Whitehouse, Miklos Szeredi, viro,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml

Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:

> +	fd = do_sys_open(dfd, filename, flags, 0000);
> +	if (fd <= 0)
> +		return fd;
> +
> +	retval = ksys_read(fd, buffer, bufsize);
> +
> +	__close_fd(current->files, fd);

If you can use dentry_open() and vfs_read(), you might be able to avoid
dealing with file descriptors entirely.  That might make it worth a syscall.

You're going to be asked for writefile() you know ;-)

David


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 13:34                             ` Miklos Szeredi
  2020-03-03 13:43                               ` Greg Kroah-Hartman
@ 2020-03-03 14:23                               ` Christian Brauner
  2020-03-03 15:23                                 ` Greg Kroah-Hartman
  2020-03-03 15:53                                 ` David Howells
  1 sibling, 2 replies; 117+ messages in thread
From: Christian Brauner @ 2020-03-03 14:23 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Greg Kroah-Hartman, Karel Zak, David Howells, Ian Kent,
	James Bottomley, Steven Whitehouse, Miklos Szeredi, viro,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml

On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> 
> > > Unlimited beers for a 21-line kernel patch?  Sign me up!
> > >
> > > Totally untested, barely compiled patch below.
> >
> > Ok, that didn't even build, let me try this for real now...
> 
> Some comments on the interface:
> 
> O_LARGEFILE can be unconditional, since offsets are not exposed to the caller.
> 
> Use the openat2 style arguments; limit the accepted flags to sane ones
> (e.g. don't let this syscall create a file).

If we think this is worth it, might even good to either have it support
struct open_how or have it accept two flag arguments. We sure want
openat2()s RESOLVE_* flags in there.

Christian

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 14:13                                   ` Jann Horn
@ 2020-03-03 14:24                                     ` Greg Kroah-Hartman
  2020-03-03 15:44                                       ` Jens Axboe
  0 siblings, 1 reply; 117+ messages in thread
From: Greg Kroah-Hartman @ 2020-03-03 14:24 UTC (permalink / raw)
  To: Jann Horn
  Cc: Miklos Szeredi, Karel Zak, David Howells, Ian Kent,
	Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Darrick J. Wong,
	Linux API, linux-fsdevel, lkml, Jens Axboe

On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote:
> On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote:
> > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> > > > On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
> > > > <gregkh@linuxfoundation.org> wrote:
> > > >
> > > > > > Unlimited beers for a 21-line kernel patch?  Sign me up!
> > > > > >
> > > > > > Totally untested, barely compiled patch below.
> > > > >
> > > > > Ok, that didn't even build, let me try this for real now...
> > > >
> > > > Some comments on the interface:
> > >
> > > Ok, hey, let's do this proper :)
> >
> > Alright, how about this patch.
> >
> > Actually tested with some simple sysfs files.
> >
> > If people don't strongly object, I'll add "real" tests to it, hook it up
> > to all arches, write a manpage, and all the fun fluff a new syscall
> > deserves and submit it "for real".
> 
> Just FYI, io_uring is moving towards the same kind of thing... IIRC
> you can already use it to batch a bunch of open() calls, then batch a
> bunch of read() calls on all the new fds and close them at the same
> time. And I think they're planning to add support for doing
> open()+read()+close() all in one go, too, except that it's a bit
> complicated because passing forward the file descriptor in a generic
> way is a bit complicated.

It is complicated, I wouldn't recommend using io_ring for reading a
bunch of procfs or sysfs files, that feels like a ton of overkill with
too much setup/teardown to make it worth while.

But maybe not, will have to watch and see how it goes.

> > It feels like I'm doing something wrong in that the actuall syscall
> > logic is just so small.  Maybe I'll benchmark this thing to see if it
> > makes any real difference...
> >
> > thanks,
> >
> > greg k-h
> >
> > From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Subject: [PATCH] readfile: implement readfile syscall
> >
> > It's a tiny syscall, meant to allow a user to do a single "open this
> > file, read into this buffer, and close the file" all in a single shot.
> >
> > Should be good for reading "tiny" files like sysfs, procfs, and other
> > "small" files.
> >
> > There is no restarting the syscall, am trying to keep it simple.  At
> > least for now.
> >
> > Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> [...]
> > +SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename,
> > +               char __user *, buffer, size_t, bufsize, int, flags)
> > +{
> > +       int retval;
> > +       int fd;
> > +
> > +       /* Mask off all O_ flags as we only want to read from the file */
> > +       flags &= ~(VALID_OPEN_FLAGS);
> > +       flags |= O_RDONLY | O_LARGEFILE;
> > +
> > +       fd = do_sys_open(dfd, filename, flags, 0000);
> > +       if (fd <= 0)
> > +               return fd;
> > +
> > +       retval = ksys_read(fd, buffer, bufsize);
> > +
> > +       __close_fd(current->files, fd);
> > +
> > +       return retval;
> > +}
> 
> If you're gonna do something like that, wouldn't you want to also
> elide the use of the file descriptor table completely? do_sys_open()
> will have to do atomic operations in the fd table and stuff, which is
> probably moderately bad in terms of cacheline bouncing if this is used
> in a multithreaded context; and as a side effect, the fd would be
> inherited by anyone who calls fork() concurrently. You'll probably
> want to use APIs like do_filp_open() and filp_close(), or something
> like that, instead.

Ah, nice, that does make more sense.  I'll play around with that, and
benchmarking this thing later tonight.  Have to go get some stable
kernels out first...

thanks for the quick review, much appreciated.

greg k-h

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 14:10                                 ` Miklos Szeredi
@ 2020-03-03 14:29                                   ` Greg Kroah-Hartman
  2020-03-03 14:40                                     ` Jann Horn
  2020-03-03 14:40                                   ` David Howells
  2020-03-04  4:20                                   ` Ian Kent
  2 siblings, 1 reply; 117+ messages in thread
From: Greg Kroah-Hartman @ 2020-03-03 14:29 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Karel Zak, David Howells, Ian Kent, Christian Brauner,
	James Bottomley, Steven Whitehouse, Miklos Szeredi, viro,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml

On Tue, Mar 03, 2020 at 03:10:50PM +0100, Miklos Szeredi wrote:
> On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> 
> > > If buffer is too small to fit the whole file, return error.
> >
> > Why?  What's wrong with just returning the bytes asked for?  If someone
> > only wants 5 bytes from the front of a file, it should be fine to give
> > that to them, right?
> 
> I think we need to signal in some way to the caller that the result
> was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise the
> caller might be surprised.

But that's not the way a "normal" read works.  Short reads are fine, if
the file isn't big enough.  That's how char device nodes work all the
time as well, and this kind of is like that, or some kind of "stream" to
read from.

If you think the file is bigger, then you, as the caller, can just pass
in a bigger buffer if you want to (i.e. you can stat the thing and
determine the size beforehand.)

Think of the "normal" use case here, a sysfs read with a PAGE_SIZE
buffer.  That way userspace "knows" it will always read all of the data
it can from the file, we don't have to do any seeking or determining
real file size, or anything else like that.

We return the number of bytes read as well, so we "know" if we did a
short read, and also, you could imply, if the number of bytes read are
the exact same as the number of bytes of the buffer, maybe the file is
either that exact size, or bigger.

This should be "simple", let's not make it complex if we can help it :)

> > > Verify that the number of bytes read matches the file size, otherwise
> > > return error (may need to loop?).
> >
> > No, we can't "match file size" as sysfs files do not really have a sane
> > "size".  So I don't want to loop at all here, one-shot, that's all you
> > get :)
> 
> Hmm.  I understand the no-size thing.  But looping until EOF (i.e.
> until read return zero) might be a good idea regardless, because short
> reads are allowed.

If you want to loop, then do a userspace open/read-loop/close cycle.
That's not what this syscall should be for.

Should we call it: readfile-only-one-try-i-hope-my-buffer-is-big-enough()?  :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 14:10                                 ` Miklos Szeredi
  2020-03-03 14:29                                   ` Greg Kroah-Hartman
@ 2020-03-03 14:40                                   ` David Howells
  2020-03-04  4:20                                   ` Ian Kent
  2 siblings, 0 replies; 117+ messages in thread
From: David Howells @ 2020-03-03 14:40 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: dhowells, Miklos Szeredi, Karel Zak, Ian Kent, Christian Brauner,
	James Bottomley, Steven Whitehouse, Miklos Szeredi, viro,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml

Would you permit readfile() to access FIFOs, chardevs and blockdevs?
Certainly allowing use with blockdevs seems reasonable.

David


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 14:29                                   ` Greg Kroah-Hartman
@ 2020-03-03 14:40                                     ` Jann Horn
  2020-03-03 16:51                                       ` Greg Kroah-Hartman
  0 siblings, 1 reply; 117+ messages in thread
From: Jann Horn @ 2020-03-03 14:40 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Miklos Szeredi, Karel Zak, David Howells, Ian Kent,
	Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Darrick J. Wong,
	Linux API, linux-fsdevel, lkml

On Tue, Mar 3, 2020 at 3:30 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> On Tue, Mar 03, 2020 at 03:10:50PM +0100, Miklos Szeredi wrote:
> > On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> >
> > > > If buffer is too small to fit the whole file, return error.
> > >
> > > Why?  What's wrong with just returning the bytes asked for?  If someone
> > > only wants 5 bytes from the front of a file, it should be fine to give
> > > that to them, right?
> >
> > I think we need to signal in some way to the caller that the result
> > was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise the
> > caller might be surprised.
>
> But that's not the way a "normal" read works.  Short reads are fine, if
> the file isn't big enough.  That's how char device nodes work all the
> time as well, and this kind of is like that, or some kind of "stream" to
> read from.
>
> If you think the file is bigger, then you, as the caller, can just pass
> in a bigger buffer if you want to (i.e. you can stat the thing and
> determine the size beforehand.)
>
> Think of the "normal" use case here, a sysfs read with a PAGE_SIZE
> buffer.  That way userspace "knows" it will always read all of the data
> it can from the file, we don't have to do any seeking or determining
> real file size, or anything else like that.
>
> We return the number of bytes read as well, so we "know" if we did a
> short read, and also, you could imply, if the number of bytes read are
> the exact same as the number of bytes of the buffer, maybe the file is
> either that exact size, or bigger.
>
> This should be "simple", let's not make it complex if we can help it :)
>
> > > > Verify that the number of bytes read matches the file size, otherwise
> > > > return error (may need to loop?).
> > >
> > > No, we can't "match file size" as sysfs files do not really have a sane
> > > "size".  So I don't want to loop at all here, one-shot, that's all you
> > > get :)
> >
> > Hmm.  I understand the no-size thing.  But looping until EOF (i.e.
> > until read return zero) might be a good idea regardless, because short
> > reads are allowed.
>
> If you want to loop, then do a userspace open/read-loop/close cycle.
> That's not what this syscall should be for.
>
> Should we call it: readfile-only-one-try-i-hope-my-buffer-is-big-enough()?  :)

So how is this supposed to work in e.g. the following case?

========================================
$ cat map_lots_and_read_maps.c
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>

int main(void) {
  for (int i=0; i<1000; i++) {
    mmap(NULL, 0x1000, (i&1)?PROT_READ:PROT_NONE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
  }
  int maps = open("/proc/self/maps", O_RDONLY);
  static char buf[0x100000];
  int res;
  do {
    res = read(maps, buf, sizeof(buf));
  } while (res > 0);
}
$ gcc -o map_lots_and_read_maps map_lots_and_read_maps.c
$ strace -e trace='!mmap' ./map_lots_and_read_maps
execve("./map_lots_and_read_maps", ["./map_lots_and_read_maps"],
0x7ffebd297ac0 /* 51 vars */) = 0
brk(NULL)                               = 0x563a1184f000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=208479, ...}) = 0
close(3)                                = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320l\2\0\0\0\0\0"...,
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1820104, ...}) = 0
mprotect(0x7fb5c2d1a000, 1642496, PROT_NONE) = 0
close(3)                                = 0
arch_prctl(ARCH_SET_FS, 0x7fb5c2eb6500) = 0
mprotect(0x7fb5c2eab000, 12288, PROT_READ) = 0
mprotect(0x563a103e4000, 4096, PROT_READ) = 0
mprotect(0x7fb5c2f12000, 4096, PROT_READ) = 0
munmap(0x7fb5c2eb7000, 208479)          = 0
openat(AT_FDCWD, "/proc/self/maps", O_RDONLY) = 3
read(3, "563a103e1000-563a103e2000 r--p 0"..., 1048576) = 4075
read(3, "7fb5c2985000-7fb5c2986000 ---p 0"..., 1048576) = 4067
read(3, "7fb5c29d8000-7fb5c29d9000 r--p 0"..., 1048576) = 4067
read(3, "7fb5c2a2b000-7fb5c2a2c000 ---p 0"..., 1048576) = 4067
read(3, "7fb5c2a7e000-7fb5c2a7f000 r--p 0"..., 1048576) = 4067
read(3, "7fb5c2ad1000-7fb5c2ad2000 ---p 0"..., 1048576) = 4067
read(3, "7fb5c2b24000-7fb5c2b25000 r--p 0"..., 1048576) = 4067
read(3, "7fb5c2b77000-7fb5c2b78000 ---p 0"..., 1048576) = 4067
read(3, "7fb5c2bca000-7fb5c2bcb000 r--p 0"..., 1048576) = 4067
read(3, "7fb5c2c1d000-7fb5c2c1e000 ---p 0"..., 1048576) = 4067
read(3, "7fb5c2c70000-7fb5c2c71000 r--p 0"..., 1048576) = 4067
read(3, "7fb5c2cc3000-7fb5c2cc4000 ---p 0"..., 1048576) = 4078
read(3, "7fb5c2eca000-7fb5c2ecb000 r--p 0"..., 1048576) = 2388
read(3, "", 1048576)                    = 0
exit_group(0)                           = ?
+++ exited with 0 +++
$
========================================

The kernel is randomly returning short reads *with different lengths*
that are vaguely around PAGE_SIZE, no matter how big the buffer
supplied by userspace is. And while repeated read() calls will return
consistent state thanks to the seqfile magic, repeated readfile()
calls will probably return garbage with half-complete lines.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 14:23                               ` Christian Brauner
@ 2020-03-03 15:23                                 ` Greg Kroah-Hartman
  2020-03-03 15:53                                 ` David Howells
  1 sibling, 0 replies; 117+ messages in thread
From: Greg Kroah-Hartman @ 2020-03-03 15:23 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Miklos Szeredi, Karel Zak, David Howells, Ian Kent,
	James Bottomley, Steven Whitehouse, Miklos Szeredi, viro,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml

On Tue, Mar 03, 2020 at 03:23:51PM +0100, Christian Brauner wrote:
> On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> > On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > 
> > > > Unlimited beers for a 21-line kernel patch?  Sign me up!
> > > >
> > > > Totally untested, barely compiled patch below.
> > >
> > > Ok, that didn't even build, let me try this for real now...
> > 
> > Some comments on the interface:
> > 
> > O_LARGEFILE can be unconditional, since offsets are not exposed to the caller.
> > 
> > Use the openat2 style arguments; limit the accepted flags to sane ones
> > (e.g. don't let this syscall create a file).
> 
> If we think this is worth it, might even good to either have it support
> struct open_how or have it accept two flag arguments. We sure want
> openat2()s RESOLVE_* flags in there.

If you look at the patch I posted in this thread, I think it properly
supports open_how and RESOLVE_* flags.  But remember it's opening a file
that is already present, in RO mode, no creation allowed, so most of the
open_how interactions are limited.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 14:24                                     ` Greg Kroah-Hartman
@ 2020-03-03 15:44                                       ` Jens Axboe
  2020-03-03 16:37                                         ` Greg Kroah-Hartman
  2020-03-03 16:51                                         ` Jeff Layton
  0 siblings, 2 replies; 117+ messages in thread
From: Jens Axboe @ 2020-03-03 15:44 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jann Horn
  Cc: Miklos Szeredi, Karel Zak, David Howells, Ian Kent,
	Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Darrick J. Wong,
	Linux API, linux-fsdevel, lkml

On 3/3/20 7:24 AM, Greg Kroah-Hartman wrote:
> On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote:
>> On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman
>> <gregkh@linuxfoundation.org> wrote:
>>>
>>> On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote:
>>>> On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
>>>>> On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
>>>>> <gregkh@linuxfoundation.org> wrote:
>>>>>
>>>>>>> Unlimited beers for a 21-line kernel patch?  Sign me up!
>>>>>>>
>>>>>>> Totally untested, barely compiled patch below.
>>>>>>
>>>>>> Ok, that didn't even build, let me try this for real now...
>>>>>
>>>>> Some comments on the interface:
>>>>
>>>> Ok, hey, let's do this proper :)
>>>
>>> Alright, how about this patch.
>>>
>>> Actually tested with some simple sysfs files.
>>>
>>> If people don't strongly object, I'll add "real" tests to it, hook it up
>>> to all arches, write a manpage, and all the fun fluff a new syscall
>>> deserves and submit it "for real".
>>
>> Just FYI, io_uring is moving towards the same kind of thing... IIRC
>> you can already use it to batch a bunch of open() calls, then batch a
>> bunch of read() calls on all the new fds and close them at the same
>> time. And I think they're planning to add support for doing
>> open()+read()+close() all in one go, too, except that it's a bit
>> complicated because passing forward the file descriptor in a generic
>> way is a bit complicated.
> 
> It is complicated, I wouldn't recommend using io_ring for reading a
> bunch of procfs or sysfs files, that feels like a ton of overkill with
> too much setup/teardown to make it worth while.
> 
> But maybe not, will have to watch and see how it goes.

It really isn't, and I too thinks it makes more sense than having a
system call just for the explicit purpose of open/read/close. As Jann
said, you can't currently do a linked sequence of open/read/close,
because the fd passing between them isn't done. But that will come in
the future. If the use case is "a bunch of files", then you could
trivially do "open bunch", "read bunch", "close bunch" in three separate
steps.

Curious what the use case is for this that warrants a special system
call?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 14:23                               ` Christian Brauner
  2020-03-03 15:23                                 ` Greg Kroah-Hartman
@ 2020-03-03 15:53                                 ` David Howells
  1 sibling, 0 replies; 117+ messages in thread
From: David Howells @ 2020-03-03 15:53 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: dhowells, Christian Brauner, Miklos Szeredi, Karel Zak, Ian Kent,
	James Bottomley, Steven Whitehouse, Miklos Szeredi, viro,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml

Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:

> If you look at the patch I posted in this thread, I think it properly
> supports open_how and RESOLVE_* flags.  But remember it's opening a file
> that is already present, in RO mode, no creation allowed, so most of the
> open_how interactions are limited.

Something we should consider adding to openat2() at some point is the ability
to lock on open/create.  Various network filesystems support it.

David


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 15:44                                       ` Jens Axboe
@ 2020-03-03 16:37                                         ` Greg Kroah-Hartman
  2020-03-03 16:51                                         ` Jeff Layton
  1 sibling, 0 replies; 117+ messages in thread
From: Greg Kroah-Hartman @ 2020-03-03 16:37 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jann Horn, Miklos Szeredi, Karel Zak, David Howells, Ian Kent,
	Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Darrick J. Wong,
	Linux API, linux-fsdevel, lkml

On Tue, Mar 03, 2020 at 08:44:43AM -0700, Jens Axboe wrote:
> On 3/3/20 7:24 AM, Greg Kroah-Hartman wrote:
> > On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote:
> >> On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman
> >> <gregkh@linuxfoundation.org> wrote:
> >>>
> >>> On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote:
> >>>> On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> >>>>> On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
> >>>>> <gregkh@linuxfoundation.org> wrote:
> >>>>>
> >>>>>>> Unlimited beers for a 21-line kernel patch?  Sign me up!
> >>>>>>>
> >>>>>>> Totally untested, barely compiled patch below.
> >>>>>>
> >>>>>> Ok, that didn't even build, let me try this for real now...
> >>>>>
> >>>>> Some comments on the interface:
> >>>>
> >>>> Ok, hey, let's do this proper :)
> >>>
> >>> Alright, how about this patch.
> >>>
> >>> Actually tested with some simple sysfs files.
> >>>
> >>> If people don't strongly object, I'll add "real" tests to it, hook it up
> >>> to all arches, write a manpage, and all the fun fluff a new syscall
> >>> deserves and submit it "for real".
> >>
> >> Just FYI, io_uring is moving towards the same kind of thing... IIRC
> >> you can already use it to batch a bunch of open() calls, then batch a
> >> bunch of read() calls on all the new fds and close them at the same
> >> time. And I think they're planning to add support for doing
> >> open()+read()+close() all in one go, too, except that it's a bit
> >> complicated because passing forward the file descriptor in a generic
> >> way is a bit complicated.
> > 
> > It is complicated, I wouldn't recommend using io_ring for reading a
> > bunch of procfs or sysfs files, that feels like a ton of overkill with
> > too much setup/teardown to make it worth while.
> > 
> > But maybe not, will have to watch and see how it goes.
> 
> It really isn't, and I too thinks it makes more sense than having a
> system call just for the explicit purpose of open/read/close. As Jann
> said, you can't currently do a linked sequence of open/read/close,
> because the fd passing between them isn't done. But that will come in
> the future. If the use case is "a bunch of files", then you could
> trivially do "open bunch", "read bunch", "close bunch" in three separate
> steps.
> 
> Curious what the use case is for this that warrants a special system
> call?

All of the open/read/close cycles for sysfs and procfs files were the
one reported use case as we have lots of utilities that do that all the
time it seems (top and other monitoring tools).

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 15:44                                       ` Jens Axboe
  2020-03-03 16:37                                         ` Greg Kroah-Hartman
@ 2020-03-03 16:51                                         ` Jeff Layton
  2020-03-03 16:55                                           ` Jens Axboe
  1 sibling, 1 reply; 117+ messages in thread
From: Jeff Layton @ 2020-03-03 16:51 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, Jann Horn
  Cc: Miklos Szeredi, Karel Zak, David Howells, Ian Kent,
	Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Darrick J. Wong,
	Linux API, linux-fsdevel, lkml

On Tue, 2020-03-03 at 08:44 -0700, Jens Axboe wrote:
> On 3/3/20 7:24 AM, Greg Kroah-Hartman wrote:
> > On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote:
> > > On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman
> > > <gregkh@linuxfoundation.org> wrote:
> > > > On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote:
> > > > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> > > > > > On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
> > > > > > <gregkh@linuxfoundation.org> wrote:
> > > > > > 
> > > > > > > > Unlimited beers for a 21-line kernel patch?  Sign me up!
> > > > > > > > 
> > > > > > > > Totally untested, barely compiled patch below.
> > > > > > > 
> > > > > > > Ok, that didn't even build, let me try this for real now...
> > > > > > 
> > > > > > Some comments on the interface:
> > > > > 
> > > > > Ok, hey, let's do this proper :)
> > > > 
> > > > Alright, how about this patch.
> > > > 
> > > > Actually tested with some simple sysfs files.
> > > > 
> > > > If people don't strongly object, I'll add "real" tests to it, hook it up
> > > > to all arches, write a manpage, and all the fun fluff a new syscall
> > > > deserves and submit it "for real".
> > > 
> > > Just FYI, io_uring is moving towards the same kind of thing... IIRC
> > > you can already use it to batch a bunch of open() calls, then batch a
> > > bunch of read() calls on all the new fds and close them at the same
> > > time. And I think they're planning to add support for doing
> > > open()+read()+close() all in one go, too, except that it's a bit
> > > complicated because passing forward the file descriptor in a generic
> > > way is a bit complicated.
> > 
> > It is complicated, I wouldn't recommend using io_ring for reading a
> > bunch of procfs or sysfs files, that feels like a ton of overkill with
> > too much setup/teardown to make it worth while.
> > 
> > But maybe not, will have to watch and see how it goes.
> 
> It really isn't, and I too thinks it makes more sense than having a
> system call just for the explicit purpose of open/read/close. As Jann
> said, you can't currently do a linked sequence of open/read/close,
> because the fd passing between them isn't done. But that will come in
> the future. If the use case is "a bunch of files", then you could
> trivially do "open bunch", "read bunch", "close bunch" in three separate
> steps.
> 
> Curious what the use case is for this that warrants a special system
> call?
> 

Agreed. I'd really rather see something more general-purpose than the
proposed readfile(). At least with NFS and SMB, you can compound
together fairly arbitrary sorts of operations, and it'd be nice to be
able to pattern calls into the kernel for those sorts of uses.

So, NFSv4 has the concept of a current_stateid that is maintained by the
server. So basically you can do all this (e.g.) in a single compound:

open <some filehandle get a stateid>
write <using that stateid>
close <same stateid>

It'd be nice to be able to do something similar with io_uring. Make it
so that when you do an open, you set the "current fd" inside the
kernel's context, and then be able to issue io_uring requests that
specify a magic "fd" value that use it.

That would be a really useful pattern.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 14:40                                     ` Jann Horn
@ 2020-03-03 16:51                                       ` Greg Kroah-Hartman
  2020-03-03 16:57                                         ` Jann Horn
  2020-03-03 20:15                                         ` Greg Kroah-Hartman
  0 siblings, 2 replies; 117+ messages in thread
From: Greg Kroah-Hartman @ 2020-03-03 16:51 UTC (permalink / raw)
  To: Jann Horn
  Cc: Miklos Szeredi, Karel Zak, David Howells, Ian Kent,
	Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Darrick J. Wong,
	Linux API, linux-fsdevel, lkml

On Tue, Mar 03, 2020 at 03:40:24PM +0100, Jann Horn wrote:
> On Tue, Mar 3, 2020 at 3:30 PM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> > On Tue, Mar 03, 2020 at 03:10:50PM +0100, Miklos Szeredi wrote:
> > > On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman
> > > <gregkh@linuxfoundation.org> wrote:
> > > >
> > > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> > >
> > > > > If buffer is too small to fit the whole file, return error.
> > > >
> > > > Why?  What's wrong with just returning the bytes asked for?  If someone
> > > > only wants 5 bytes from the front of a file, it should be fine to give
> > > > that to them, right?
> > >
> > > I think we need to signal in some way to the caller that the result
> > > was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise the
> > > caller might be surprised.
> >
> > But that's not the way a "normal" read works.  Short reads are fine, if
> > the file isn't big enough.  That's how char device nodes work all the
> > time as well, and this kind of is like that, or some kind of "stream" to
> > read from.
> >
> > If you think the file is bigger, then you, as the caller, can just pass
> > in a bigger buffer if you want to (i.e. you can stat the thing and
> > determine the size beforehand.)
> >
> > Think of the "normal" use case here, a sysfs read with a PAGE_SIZE
> > buffer.  That way userspace "knows" it will always read all of the data
> > it can from the file, we don't have to do any seeking or determining
> > real file size, or anything else like that.
> >
> > We return the number of bytes read as well, so we "know" if we did a
> > short read, and also, you could imply, if the number of bytes read are
> > the exact same as the number of bytes of the buffer, maybe the file is
> > either that exact size, or bigger.
> >
> > This should be "simple", let's not make it complex if we can help it :)
> >
> > > > > Verify that the number of bytes read matches the file size, otherwise
> > > > > return error (may need to loop?).
> > > >
> > > > No, we can't "match file size" as sysfs files do not really have a sane
> > > > "size".  So I don't want to loop at all here, one-shot, that's all you
> > > > get :)
> > >
> > > Hmm.  I understand the no-size thing.  But looping until EOF (i.e.
> > > until read return zero) might be a good idea regardless, because short
> > > reads are allowed.
> >
> > If you want to loop, then do a userspace open/read-loop/close cycle.
> > That's not what this syscall should be for.
> >
> > Should we call it: readfile-only-one-try-i-hope-my-buffer-is-big-enough()?  :)
> 
> So how is this supposed to work in e.g. the following case?
> 
> ========================================
> $ cat map_lots_and_read_maps.c
> #include <sys/mman.h>
> #include <fcntl.h>
> #include <unistd.h>
> 
> int main(void) {
>   for (int i=0; i<1000; i++) {
>     mmap(NULL, 0x1000, (i&1)?PROT_READ:PROT_NONE,
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
>   }
>   int maps = open("/proc/self/maps", O_RDONLY);
>   static char buf[0x100000];
>   int res;
>   do {
>     res = read(maps, buf, sizeof(buf));
>   } while (res > 0);
> }
> $ gcc -o map_lots_and_read_maps map_lots_and_read_maps.c
> $ strace -e trace='!mmap' ./map_lots_and_read_maps
> execve("./map_lots_and_read_maps", ["./map_lots_and_read_maps"],
> 0x7ffebd297ac0 /* 51 vars */) = 0
> brk(NULL)                               = 0x563a1184f000
> access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
> openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
> fstat(3, {st_mode=S_IFREG|0644, st_size=208479, ...}) = 0
> close(3)                                = 0
> openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
> read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320l\2\0\0\0\0\0"...,
> 832) = 832
> fstat(3, {st_mode=S_IFREG|0755, st_size=1820104, ...}) = 0
> mprotect(0x7fb5c2d1a000, 1642496, PROT_NONE) = 0
> close(3)                                = 0
> arch_prctl(ARCH_SET_FS, 0x7fb5c2eb6500) = 0
> mprotect(0x7fb5c2eab000, 12288, PROT_READ) = 0
> mprotect(0x563a103e4000, 4096, PROT_READ) = 0
> mprotect(0x7fb5c2f12000, 4096, PROT_READ) = 0
> munmap(0x7fb5c2eb7000, 208479)          = 0
> openat(AT_FDCWD, "/proc/self/maps", O_RDONLY) = 3
> read(3, "563a103e1000-563a103e2000 r--p 0"..., 1048576) = 4075
> read(3, "7fb5c2985000-7fb5c2986000 ---p 0"..., 1048576) = 4067
> read(3, "7fb5c29d8000-7fb5c29d9000 r--p 0"..., 1048576) = 4067
> read(3, "7fb5c2a2b000-7fb5c2a2c000 ---p 0"..., 1048576) = 4067
> read(3, "7fb5c2a7e000-7fb5c2a7f000 r--p 0"..., 1048576) = 4067
> read(3, "7fb5c2ad1000-7fb5c2ad2000 ---p 0"..., 1048576) = 4067
> read(3, "7fb5c2b24000-7fb5c2b25000 r--p 0"..., 1048576) = 4067
> read(3, "7fb5c2b77000-7fb5c2b78000 ---p 0"..., 1048576) = 4067
> read(3, "7fb5c2bca000-7fb5c2bcb000 r--p 0"..., 1048576) = 4067
> read(3, "7fb5c2c1d000-7fb5c2c1e000 ---p 0"..., 1048576) = 4067
> read(3, "7fb5c2c70000-7fb5c2c71000 r--p 0"..., 1048576) = 4067
> read(3, "7fb5c2cc3000-7fb5c2cc4000 ---p 0"..., 1048576) = 4078
> read(3, "7fb5c2eca000-7fb5c2ecb000 r--p 0"..., 1048576) = 2388
> read(3, "", 1048576)                    = 0
> exit_group(0)                           = ?
> +++ exited with 0 +++
> $
> ========================================
> 
> The kernel is randomly returning short reads *with different lengths*
> that are vaguely around PAGE_SIZE, no matter how big the buffer
> supplied by userspace is. And while repeated read() calls will return
> consistent state thanks to the seqfile magic, repeated readfile()
> calls will probably return garbage with half-complete lines.

Ah crap, I forgot about seqfile, I was only considering the "simple"
cases that sysfs provides.

Ok, Miklos, you were totally right, I'll loop and read until the end of
file or buffer, which ever comes first.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 16:51                                         ` Jeff Layton
@ 2020-03-03 16:55                                           ` Jens Axboe
  2020-03-03 19:02                                             ` Jeff Layton
  0 siblings, 1 reply; 117+ messages in thread
From: Jens Axboe @ 2020-03-03 16:55 UTC (permalink / raw)
  To: Jeff Layton, Greg Kroah-Hartman, Jann Horn
  Cc: Miklos Szeredi, Karel Zak, David Howells, Ian Kent,
	Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Darrick J. Wong,
	Linux API, linux-fsdevel, lkml

On 3/3/20 9:51 AM, Jeff Layton wrote:
> On Tue, 2020-03-03 at 08:44 -0700, Jens Axboe wrote:
>> On 3/3/20 7:24 AM, Greg Kroah-Hartman wrote:
>>> On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote:
>>>> On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman
>>>> <gregkh@linuxfoundation.org> wrote:
>>>>> On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote:
>>>>>> On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
>>>>>>> On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
>>>>>>> <gregkh@linuxfoundation.org> wrote:
>>>>>>>
>>>>>>>>> Unlimited beers for a 21-line kernel patch?  Sign me up!
>>>>>>>>>
>>>>>>>>> Totally untested, barely compiled patch below.
>>>>>>>>
>>>>>>>> Ok, that didn't even build, let me try this for real now...
>>>>>>>
>>>>>>> Some comments on the interface:
>>>>>>
>>>>>> Ok, hey, let's do this proper :)
>>>>>
>>>>> Alright, how about this patch.
>>>>>
>>>>> Actually tested with some simple sysfs files.
>>>>>
>>>>> If people don't strongly object, I'll add "real" tests to it, hook it up
>>>>> to all arches, write a manpage, and all the fun fluff a new syscall
>>>>> deserves and submit it "for real".
>>>>
>>>> Just FYI, io_uring is moving towards the same kind of thing... IIRC
>>>> you can already use it to batch a bunch of open() calls, then batch a
>>>> bunch of read() calls on all the new fds and close them at the same
>>>> time. And I think they're planning to add support for doing
>>>> open()+read()+close() all in one go, too, except that it's a bit
>>>> complicated because passing forward the file descriptor in a generic
>>>> way is a bit complicated.
>>>
>>> It is complicated, I wouldn't recommend using io_ring for reading a
>>> bunch of procfs or sysfs files, that feels like a ton of overkill with
>>> too much setup/teardown to make it worth while.
>>>
>>> But maybe not, will have to watch and see how it goes.
>>
>> It really isn't, and I too thinks it makes more sense than having a
>> system call just for the explicit purpose of open/read/close. As Jann
>> said, you can't currently do a linked sequence of open/read/close,
>> because the fd passing between them isn't done. But that will come in
>> the future. If the use case is "a bunch of files", then you could
>> trivially do "open bunch", "read bunch", "close bunch" in three separate
>> steps.
>>
>> Curious what the use case is for this that warrants a special system
>> call?
>>
> 
> Agreed. I'd really rather see something more general-purpose than the
> proposed readfile(). At least with NFS and SMB, you can compound
> together fairly arbitrary sorts of operations, and it'd be nice to be
> able to pattern calls into the kernel for those sorts of uses.
> 
> So, NFSv4 has the concept of a current_stateid that is maintained by the
> server. So basically you can do all this (e.g.) in a single compound:
> 
> open <some filehandle get a stateid>
> write <using that stateid>
> close <same stateid>
> 
> It'd be nice to be able to do something similar with io_uring. Make it
> so that when you do an open, you set the "current fd" inside the
> kernel's context, and then be able to issue io_uring requests that
> specify a magic "fd" value that use it.
> 
> That would be a really useful pattern.

For io_uring, you can link requests that you submit into a chain. Each
link in the chain is done in sequence. Which means that you could do:

<open some file><read from that file><close that file>

in a single sequence. The only thing that is missing right now is a way
to have the return of that open propagated to the 'fd' of the read and
close, and it's actually one of the topics to discuss at LSFMM next
month.

One approach would be to use BPF to handle this passing, another
suggestion has been to have the read/close specify some magic 'fd' value
that just means "inherit fd from result of previous". The latter sounds
very close to the stateid you mention above, and the upside here is that
it wouldn't explode the necessary toolchain to need to include BPF.

In other words, this is really close to being reality and practically
feasible.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 16:51                                       ` Greg Kroah-Hartman
@ 2020-03-03 16:57                                         ` Jann Horn
  2020-03-03 20:15                                         ` Greg Kroah-Hartman
  1 sibling, 0 replies; 117+ messages in thread
From: Jann Horn @ 2020-03-03 16:57 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Miklos Szeredi, Karel Zak, David Howells, Ian Kent,
	Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Darrick J. Wong,
	Linux API, linux-fsdevel, lkml

On Tue, Mar 3, 2020 at 5:51 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> On Tue, Mar 03, 2020 at 03:40:24PM +0100, Jann Horn wrote:
> > On Tue, Mar 3, 2020 at 3:30 PM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > > On Tue, Mar 03, 2020 at 03:10:50PM +0100, Miklos Szeredi wrote:
> > > > On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman
> > > > <gregkh@linuxfoundation.org> wrote:
> > > > >
> > > > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> > > >
> > > > > > If buffer is too small to fit the whole file, return error.
> > > > >
> > > > > Why?  What's wrong with just returning the bytes asked for?  If someone
> > > > > only wants 5 bytes from the front of a file, it should be fine to give
> > > > > that to them, right?
> > > >
> > > > I think we need to signal in some way to the caller that the result
> > > > was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise the
> > > > caller might be surprised.
> > >
> > > But that's not the way a "normal" read works.  Short reads are fine, if
> > > the file isn't big enough.  That's how char device nodes work all the
> > > time as well, and this kind of is like that, or some kind of "stream" to
> > > read from.
> > >
> > > If you think the file is bigger, then you, as the caller, can just pass
> > > in a bigger buffer if you want to (i.e. you can stat the thing and
> > > determine the size beforehand.)
> > >
> > > Think of the "normal" use case here, a sysfs read with a PAGE_SIZE
> > > buffer.  That way userspace "knows" it will always read all of the data
> > > it can from the file, we don't have to do any seeking or determining
> > > real file size, or anything else like that.
> > >
> > > We return the number of bytes read as well, so we "know" if we did a
> > > short read, and also, you could imply, if the number of bytes read are
> > > the exact same as the number of bytes of the buffer, maybe the file is
> > > either that exact size, or bigger.
> > >
> > > This should be "simple", let's not make it complex if we can help it :)
> > >
> > > > > > Verify that the number of bytes read matches the file size, otherwise
> > > > > > return error (may need to loop?).
> > > > >
> > > > > No, we can't "match file size" as sysfs files do not really have a sane
> > > > > "size".  So I don't want to loop at all here, one-shot, that's all you
> > > > > get :)
> > > >
> > > > Hmm.  I understand the no-size thing.  But looping until EOF (i.e.
> > > > until read return zero) might be a good idea regardless, because short
> > > > reads are allowed.
> > >
> > > If you want to loop, then do a userspace open/read-loop/close cycle.
> > > That's not what this syscall should be for.
> > >
> > > Should we call it: readfile-only-one-try-i-hope-my-buffer-is-big-enough()?  :)
> >
> > So how is this supposed to work in e.g. the following case?
[...]
> >   int maps = open("/proc/self/maps", O_RDONLY);
> >   static char buf[0x100000];
> >   int res;
> >   do {
> >     res = read(maps, buf, sizeof(buf));
> >   } while (res > 0);
> > }
[...]
> >
> > The kernel is randomly returning short reads *with different lengths*
> > that are vaguely around PAGE_SIZE, no matter how big the buffer
> > supplied by userspace is. And while repeated read() calls will return
> > consistent state thanks to the seqfile magic, repeated readfile()
> > calls will probably return garbage with half-complete lines.
>
> Ah crap, I forgot about seqfile, I was only considering the "simple"
> cases that sysfs provides.
>
> Ok, Miklos, you were totally right, I'll loop and read until the end of
> file or buffer, which ever comes first.

I wonder what we should do when one of the later reads returns an
error code. As in, we start the first read, get a short read (maybe
because a signal arrived), try a second read, get -EINTR. Do we just
return the error code? That'd probably work fine for most usecases -
e.g. if "top" is reading stuff from procfs, and that gets interrupted
by SIGWINCH or so, it doesn't matter that we've already started the
first read; the only thing "top" really needs to know is that the read
was a short read and it has to retry.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 14:19                                 ` David Howells
@ 2020-03-03 16:59                                   ` Greg Kroah-Hartman
  0 siblings, 0 replies; 117+ messages in thread
From: Greg Kroah-Hartman @ 2020-03-03 16:59 UTC (permalink / raw)
  To: David Howells
  Cc: Miklos Szeredi, Karel Zak, Ian Kent, Christian Brauner,
	James Bottomley, Steven Whitehouse, Miklos Szeredi, viro,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml

On Tue, Mar 03, 2020 at 02:19:58PM +0000, David Howells wrote:
> Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:
> 
> > +	fd = do_sys_open(dfd, filename, flags, 0000);
> > +	if (fd <= 0)
> > +		return fd;
> > +
> > +	retval = ksys_read(fd, buffer, bufsize);
> > +
> > +	__close_fd(current->files, fd);
> 
> If you can use dentry_open() and vfs_read(), you might be able to avoid
> dealing with file descriptors entirely.  That might make it worth a syscall.

Will poke at that...

> You're going to be asked for writefile() you know ;-)

Yup, that just got asked on this thread already :)

greg k-h

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 16:55                                           ` Jens Axboe
@ 2020-03-03 19:02                                             ` Jeff Layton
  2020-03-03 19:07                                               ` Jens Axboe
  2020-03-03 19:23                                               ` Jens Axboe
  0 siblings, 2 replies; 117+ messages in thread
From: Jeff Layton @ 2020-03-03 19:02 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, Jann Horn
  Cc: Miklos Szeredi, Karel Zak, David Howells, Ian Kent,
	Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Darrick J. Wong,
	Linux API, linux-fsdevel, lkml

On Tue, 2020-03-03 at 09:55 -0700, Jens Axboe wrote:
> On 3/3/20 9:51 AM, Jeff Layton wrote:
> > On Tue, 2020-03-03 at 08:44 -0700, Jens Axboe wrote:
> > > On 3/3/20 7:24 AM, Greg Kroah-Hartman wrote:
> > > > On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote:
> > > > > On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman
> > > > > <gregkh@linuxfoundation.org> wrote:
> > > > > > On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote:
> > > > > > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> > > > > > > > On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
> > > > > > > > <gregkh@linuxfoundation.org> wrote:
> > > > > > > > 
> > > > > > > > > > Unlimited beers for a 21-line kernel patch?  Sign me up!
> > > > > > > > > > 
> > > > > > > > > > Totally untested, barely compiled patch below.
> > > > > > > > > 
> > > > > > > > > Ok, that didn't even build, let me try this for real now...
> > > > > > > > 
> > > > > > > > Some comments on the interface:
> > > > > > > 
> > > > > > > Ok, hey, let's do this proper :)
> > > > > > 
> > > > > > Alright, how about this patch.
> > > > > > 
> > > > > > Actually tested with some simple sysfs files.
> > > > > > 
> > > > > > If people don't strongly object, I'll add "real" tests to it, hook it up
> > > > > > to all arches, write a manpage, and all the fun fluff a new syscall
> > > > > > deserves and submit it "for real".
> > > > > 
> > > > > Just FYI, io_uring is moving towards the same kind of thing... IIRC
> > > > > you can already use it to batch a bunch of open() calls, then batch a
> > > > > bunch of read() calls on all the new fds and close them at the same
> > > > > time. And I think they're planning to add support for doing
> > > > > open()+read()+close() all in one go, too, except that it's a bit
> > > > > complicated because passing forward the file descriptor in a generic
> > > > > way is a bit complicated.
> > > > 
> > > > It is complicated, I wouldn't recommend using io_ring for reading a
> > > > bunch of procfs or sysfs files, that feels like a ton of overkill with
> > > > too much setup/teardown to make it worth while.
> > > > 
> > > > But maybe not, will have to watch and see how it goes.
> > > 
> > > It really isn't, and I too thinks it makes more sense than having a
> > > system call just for the explicit purpose of open/read/close. As Jann
> > > said, you can't currently do a linked sequence of open/read/close,
> > > because the fd passing between them isn't done. But that will come in
> > > the future. If the use case is "a bunch of files", then you could
> > > trivially do "open bunch", "read bunch", "close bunch" in three separate
> > > steps.
> > > 
> > > Curious what the use case is for this that warrants a special system
> > > call?
> > > 
> > 
> > Agreed. I'd really rather see something more general-purpose than the
> > proposed readfile(). At least with NFS and SMB, you can compound
> > together fairly arbitrary sorts of operations, and it'd be nice to be
> > able to pattern calls into the kernel for those sorts of uses.
> > 
> > So, NFSv4 has the concept of a current_stateid that is maintained by the
> > server. So basically you can do all this (e.g.) in a single compound:
> > 
> > open <some filehandle get a stateid>
> > write <using that stateid>
> > close <same stateid>
> > 
> > It'd be nice to be able to do something similar with io_uring. Make it
> > so that when you do an open, you set the "current fd" inside the
> > kernel's context, and then be able to issue io_uring requests that
> > specify a magic "fd" value that use it.
> > 
> > That would be a really useful pattern.
> 
> For io_uring, you can link requests that you submit into a chain. Each
> link in the chain is done in sequence. Which means that you could do:
> 
> <open some file><read from that file><close that file>
> 
> in a single sequence. The only thing that is missing right now is a way
> to have the return of that open propagated to the 'fd' of the read and
> close, and it's actually one of the topics to discuss at LSFMM next
> month.
> 
> One approach would be to use BPF to handle this passing, another
> suggestion has been to have the read/close specify some magic 'fd' value
> that just means "inherit fd from result of previous". The latter sounds
> very close to the stateid you mention above, and the upside here is that
> it wouldn't explode the necessary toolchain to need to include BPF.
> 
> In other words, this is really close to being reality and practically
> feasible.
> 

Excellent.

Yes, the latter is exactly what I had in mind for this. I suspect that
that would cover a large fraction of the potential use-cases for this.

Basically, all you'd need to do is keep a pointer to struct file in the
internal state for the chain. Then, allow userland to specify some magic
fd value for subsequent chained operations that says to use that instead
of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)?

That would cover the smb or nfs server sort of use cases, I think. For
the sysfs cases, I guess you'd need to dispatch several chains, but that
doesn't sound _too_ onerous.

In fact, with that you should even be able to emulate the proposed
readlink syscall in a userland library.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 19:02                                             ` Jeff Layton
@ 2020-03-03 19:07                                               ` Jens Axboe
  2020-03-03 19:23                                               ` Jens Axboe
  1 sibling, 0 replies; 117+ messages in thread
From: Jens Axboe @ 2020-03-03 19:07 UTC (permalink / raw)
  To: Jeff Layton, Greg Kroah-Hartman, Jann Horn
  Cc: Miklos Szeredi, Karel Zak, David Howells, Ian Kent,
	Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Darrick J. Wong,
	Linux API, linux-fsdevel, lkml

On 3/3/20 12:02 PM, Jeff Layton wrote:
> On Tue, 2020-03-03 at 09:55 -0700, Jens Axboe wrote:
>> On 3/3/20 9:51 AM, Jeff Layton wrote:
>>> On Tue, 2020-03-03 at 08:44 -0700, Jens Axboe wrote:
>>>> On 3/3/20 7:24 AM, Greg Kroah-Hartman wrote:
>>>>> On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote:
>>>>>> On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman
>>>>>> <gregkh@linuxfoundation.org> wrote:
>>>>>>> On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote:
>>>>>>>> On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
>>>>>>>>> On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman
>>>>>>>>> <gregkh@linuxfoundation.org> wrote:
>>>>>>>>>
>>>>>>>>>>> Unlimited beers for a 21-line kernel patch?  Sign me up!
>>>>>>>>>>>
>>>>>>>>>>> Totally untested, barely compiled patch below.
>>>>>>>>>>
>>>>>>>>>> Ok, that didn't even build, let me try this for real now...
>>>>>>>>>
>>>>>>>>> Some comments on the interface:
>>>>>>>>
>>>>>>>> Ok, hey, let's do this proper :)
>>>>>>>
>>>>>>> Alright, how about this patch.
>>>>>>>
>>>>>>> Actually tested with some simple sysfs files.
>>>>>>>
>>>>>>> If people don't strongly object, I'll add "real" tests to it, hook it up
>>>>>>> to all arches, write a manpage, and all the fun fluff a new syscall
>>>>>>> deserves and submit it "for real".
>>>>>>
>>>>>> Just FYI, io_uring is moving towards the same kind of thing... IIRC
>>>>>> you can already use it to batch a bunch of open() calls, then batch a
>>>>>> bunch of read() calls on all the new fds and close them at the same
>>>>>> time. And I think they're planning to add support for doing
>>>>>> open()+read()+close() all in one go, too, except that it's a bit
>>>>>> complicated because passing forward the file descriptor in a generic
>>>>>> way is a bit complicated.
>>>>>
>>>>> It is complicated, I wouldn't recommend using io_ring for reading a
>>>>> bunch of procfs or sysfs files, that feels like a ton of overkill with
>>>>> too much setup/teardown to make it worth while.
>>>>>
>>>>> But maybe not, will have to watch and see how it goes.
>>>>
>>>> It really isn't, and I too thinks it makes more sense than having a
>>>> system call just for the explicit purpose of open/read/close. As Jann
>>>> said, you can't currently do a linked sequence of open/read/close,
>>>> because the fd passing between them isn't done. But that will come in
>>>> the future. If the use case is "a bunch of files", then you could
>>>> trivially do "open bunch", "read bunch", "close bunch" in three separate
>>>> steps.
>>>>
>>>> Curious what the use case is for this that warrants a special system
>>>> call?
>>>>
>>>
>>> Agreed. I'd really rather see something more general-purpose than the
>>> proposed readfile(). At least with NFS and SMB, you can compound
>>> together fairly arbitrary sorts of operations, and it'd be nice to be
>>> able to pattern calls into the kernel for those sorts of uses.
>>>
>>> So, NFSv4 has the concept of a current_stateid that is maintained by the
>>> server. So basically you can do all this (e.g.) in a single compound:
>>>
>>> open <some filehandle get a stateid>
>>> write <using that stateid>
>>> close <same stateid>
>>>
>>> It'd be nice to be able to do something similar with io_uring. Make it
>>> so that when you do an open, you set the "current fd" inside the
>>> kernel's context, and then be able to issue io_uring requests that
>>> specify a magic "fd" value that use it.
>>>
>>> That would be a really useful pattern.
>>
>> For io_uring, you can link requests that you submit into a chain. Each
>> link in the chain is done in sequence. Which means that you could do:
>>
>> <open some file><read from that file><close that file>
>>
>> in a single sequence. The only thing that is missing right now is a way
>> to have the return of that open propagated to the 'fd' of the read and
>> close, and it's actually one of the topics to discuss at LSFMM next
>> month.
>>
>> One approach would be to use BPF to handle this passing, another
>> suggestion has been to have the read/close specify some magic 'fd' value
>> that just means "inherit fd from result of previous". The latter sounds
>> very close to the stateid you mention above, and the upside here is that
>> it wouldn't explode the necessary toolchain to need to include BPF.
>>
>> In other words, this is really close to being reality and practically
>> feasible.
>>
> 
> Excellent.
> 
> Yes, the latter is exactly what I had in mind for this. I suspect that
> that would cover a large fraction of the potential use-cases for this.
> 
> Basically, all you'd need to do is keep a pointer to struct file in the
> internal state for the chain. Then, allow userland to specify some magic
> fd value for subsequent chained operations that says to use that instead
> of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)?

Yeah I think that'd be a suitable way to signal that.

> That would cover the smb or nfs server sort of use cases, I think. For
> the sysfs cases, I guess you'd need to dispatch several chains, but that
> doesn't sound _too_ onerous.

The magic fd would be per-chain, so doing multiple chains wouldn't
really matter at all.

Let me try and hack this up, should be pretty trivial.

> In fact, with that you should even be able to emulate the proposed
> readlink syscall in a userland library.

Exactly

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 19:02                                             ` Jeff Layton
  2020-03-03 19:07                                               ` Jens Axboe
@ 2020-03-03 19:23                                               ` Jens Axboe
  2020-03-03 19:43                                                 ` Jeff Layton
  1 sibling, 1 reply; 117+ messages in thread
From: Jens Axboe @ 2020-03-03 19:23 UTC (permalink / raw)
  To: Jeff Layton, Greg Kroah-Hartman, Jann Horn
  Cc: Miklos Szeredi, Karel Zak, David Howells, Ian Kent,
	Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Darrick J. Wong,
	Linux API, linux-fsdevel, lkml

On 3/3/20 12:02 PM, Jeff Layton wrote:
> Basically, all you'd need to do is keep a pointer to struct file in the
> internal state for the chain. Then, allow userland to specify some magic
> fd value for subsequent chained operations that says to use that instead
> of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)?

BTW, I think we need two magics here. One that says "result from
previous is fd for next", and one that says "fd from previous is fd for
next". The former allows inheritance from open -> read, the latter from
read -> write.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 19:23                                               ` Jens Axboe
@ 2020-03-03 19:43                                                 ` Jeff Layton
  2020-03-03 20:33                                                   ` Jens Axboe
  0 siblings, 1 reply; 117+ messages in thread
From: Jeff Layton @ 2020-03-03 19:43 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, Jann Horn
  Cc: Miklos Szeredi, Karel Zak, David Howells, Ian Kent,
	Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Darrick J. Wong,
	Linux API, linux-fsdevel, lkml

On Tue, 2020-03-03 at 12:23 -0700, Jens Axboe wrote:
> On 3/3/20 12:02 PM, Jeff Layton wrote:
> > Basically, all you'd need to do is keep a pointer to struct file in the
> > internal state for the chain. Then, allow userland to specify some magic
> > fd value for subsequent chained operations that says to use that instead
> > of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)?
> 
> BTW, I think we need two magics here. One that says "result from
> previous is fd for next", and one that says "fd from previous is fd for
> next". The former allows inheritance from open -> read, the latter from
> read -> write.
> 

Do we? I suspect that in almost all of the cases, all we'd care about is
the last open. Also if you have unrelated operations in there you still
have to chain the fd through somehow to the next op which is a bit hard
to do with that scheme.

I'd just have a single magic carveout that means "use the result of last
open call done in this chain". If you do a second open (or pipe, or...),
then that would put the old struct file pointer and drop a new one in
there.

If we really do want to enable multiple opens in a single chain though,
then we might want to rethink this and consider some sort of slot table
for storing open fds.

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 16:51                                       ` Greg Kroah-Hartman
  2020-03-03 16:57                                         ` Jann Horn
@ 2020-03-03 20:15                                         ` Greg Kroah-Hartman
  1 sibling, 0 replies; 117+ messages in thread
From: Greg Kroah-Hartman @ 2020-03-03 20:15 UTC (permalink / raw)
  To: Jann Horn
  Cc: Miklos Szeredi, Karel Zak, David Howells, Ian Kent,
	Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Darrick J. Wong,
	Linux API, linux-fsdevel, lkml

On Tue, Mar 03, 2020 at 05:51:03PM +0100, Greg Kroah-Hartman wrote:
> On Tue, Mar 03, 2020 at 03:40:24PM +0100, Jann Horn wrote:
> > On Tue, Mar 3, 2020 at 3:30 PM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > > On Tue, Mar 03, 2020 at 03:10:50PM +0100, Miklos Szeredi wrote:
> > > > On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman
> > > > <gregkh@linuxfoundation.org> wrote:
> > > > >
> > > > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> > > >
> > > > > > If buffer is too small to fit the whole file, return error.
> > > > >
> > > > > Why?  What's wrong with just returning the bytes asked for?  If someone
> > > > > only wants 5 bytes from the front of a file, it should be fine to give
> > > > > that to them, right?
> > > >
> > > > I think we need to signal in some way to the caller that the result
> > > > was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise the
> > > > caller might be surprised.
> > >
> > > But that's not the way a "normal" read works.  Short reads are fine, if
> > > the file isn't big enough.  That's how char device nodes work all the
> > > time as well, and this kind of is like that, or some kind of "stream" to
> > > read from.
> > >
> > > If you think the file is bigger, then you, as the caller, can just pass
> > > in a bigger buffer if you want to (i.e. you can stat the thing and
> > > determine the size beforehand.)
> > >
> > > Think of the "normal" use case here, a sysfs read with a PAGE_SIZE
> > > buffer.  That way userspace "knows" it will always read all of the data
> > > it can from the file, we don't have to do any seeking or determining
> > > real file size, or anything else like that.
> > >
> > > We return the number of bytes read as well, so we "know" if we did a
> > > short read, and also, you could imply, if the number of bytes read are
> > > the exact same as the number of bytes of the buffer, maybe the file is
> > > either that exact size, or bigger.
> > >
> > > This should be "simple", let's not make it complex if we can help it :)
> > >
> > > > > > Verify that the number of bytes read matches the file size, otherwise
> > > > > > return error (may need to loop?).
> > > > >
> > > > > No, we can't "match file size" as sysfs files do not really have a sane
> > > > > "size".  So I don't want to loop at all here, one-shot, that's all you
> > > > > get :)
> > > >
> > > > Hmm.  I understand the no-size thing.  But looping until EOF (i.e.
> > > > until read return zero) might be a good idea regardless, because short
> > > > reads are allowed.
> > >
> > > If you want to loop, then do a userspace open/read-loop/close cycle.
> > > That's not what this syscall should be for.
> > >
> > > Should we call it: readfile-only-one-try-i-hope-my-buffer-is-big-enough()?  :)
> > 
> > So how is this supposed to work in e.g. the following case?
> > 
> > ========================================
> > $ cat map_lots_and_read_maps.c
> > #include <sys/mman.h>
> > #include <fcntl.h>
> > #include <unistd.h>
> > 
> > int main(void) {
> >   for (int i=0; i<1000; i++) {
> >     mmap(NULL, 0x1000, (i&1)?PROT_READ:PROT_NONE,
> > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> >   }
> >   int maps = open("/proc/self/maps", O_RDONLY);
> >   static char buf[0x100000];
> >   int res;
> >   do {
> >     res = read(maps, buf, sizeof(buf));
> >   } while (res > 0);
> > }
> > $ gcc -o map_lots_and_read_maps map_lots_and_read_maps.c
> > $ strace -e trace='!mmap' ./map_lots_and_read_maps
> > execve("./map_lots_and_read_maps", ["./map_lots_and_read_maps"],
> > 0x7ffebd297ac0 /* 51 vars */) = 0
> > brk(NULL)                               = 0x563a1184f000
> > access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
> > openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
> > fstat(3, {st_mode=S_IFREG|0644, st_size=208479, ...}) = 0
> > close(3)                                = 0
> > openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
> > read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320l\2\0\0\0\0\0"...,
> > 832) = 832
> > fstat(3, {st_mode=S_IFREG|0755, st_size=1820104, ...}) = 0
> > mprotect(0x7fb5c2d1a000, 1642496, PROT_NONE) = 0
> > close(3)                                = 0
> > arch_prctl(ARCH_SET_FS, 0x7fb5c2eb6500) = 0
> > mprotect(0x7fb5c2eab000, 12288, PROT_READ) = 0
> > mprotect(0x563a103e4000, 4096, PROT_READ) = 0
> > mprotect(0x7fb5c2f12000, 4096, PROT_READ) = 0
> > munmap(0x7fb5c2eb7000, 208479)          = 0
> > openat(AT_FDCWD, "/proc/self/maps", O_RDONLY) = 3
> > read(3, "563a103e1000-563a103e2000 r--p 0"..., 1048576) = 4075
> > read(3, "7fb5c2985000-7fb5c2986000 ---p 0"..., 1048576) = 4067
> > read(3, "7fb5c29d8000-7fb5c29d9000 r--p 0"..., 1048576) = 4067
> > read(3, "7fb5c2a2b000-7fb5c2a2c000 ---p 0"..., 1048576) = 4067
> > read(3, "7fb5c2a7e000-7fb5c2a7f000 r--p 0"..., 1048576) = 4067
> > read(3, "7fb5c2ad1000-7fb5c2ad2000 ---p 0"..., 1048576) = 4067
> > read(3, "7fb5c2b24000-7fb5c2b25000 r--p 0"..., 1048576) = 4067
> > read(3, "7fb5c2b77000-7fb5c2b78000 ---p 0"..., 1048576) = 4067
> > read(3, "7fb5c2bca000-7fb5c2bcb000 r--p 0"..., 1048576) = 4067
> > read(3, "7fb5c2c1d000-7fb5c2c1e000 ---p 0"..., 1048576) = 4067
> > read(3, "7fb5c2c70000-7fb5c2c71000 r--p 0"..., 1048576) = 4067
> > read(3, "7fb5c2cc3000-7fb5c2cc4000 ---p 0"..., 1048576) = 4078
> > read(3, "7fb5c2eca000-7fb5c2ecb000 r--p 0"..., 1048576) = 2388
> > read(3, "", 1048576)                    = 0
> > exit_group(0)                           = ?
> > +++ exited with 0 +++
> > $
> > ========================================
> > 
> > The kernel is randomly returning short reads *with different lengths*
> > that are vaguely around PAGE_SIZE, no matter how big the buffer
> > supplied by userspace is. And while repeated read() calls will return
> > consistent state thanks to the seqfile magic, repeated readfile()
> > calls will probably return garbage with half-complete lines.
> 
> Ah crap, I forgot about seqfile, I was only considering the "simple"
> cases that sysfs provides.
> 
> Ok, Miklos, you were totally right, I'll loop and read until the end of
> file or buffer, which ever comes first.

Hm, nope, this works just fine with the single "read" call.  I can read
/proc/self/maps with a single buffer, also larger files like
/sys/kernel/debug/usb/devices work just fine.

So maybe it is all sane without a loop.

I'll try to get rid of the fd now, and despite the interest in io_uring,
this might be a lot more "simple" overall.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 19:43                                                 ` Jeff Layton
@ 2020-03-03 20:33                                                   ` Jens Axboe
  2020-03-03 21:03                                                     ` Jeff Layton
  0 siblings, 1 reply; 117+ messages in thread
From: Jens Axboe @ 2020-03-03 20:33 UTC (permalink / raw)
  To: Jeff Layton, Greg Kroah-Hartman, Jann Horn
  Cc: Miklos Szeredi, Karel Zak, David Howells, Ian Kent,
	Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Darrick J. Wong,
	Linux API, linux-fsdevel, lkml

On 3/3/20 12:43 PM, Jeff Layton wrote:
> On Tue, 2020-03-03 at 12:23 -0700, Jens Axboe wrote:
>> On 3/3/20 12:02 PM, Jeff Layton wrote:
>>> Basically, all you'd need to do is keep a pointer to struct file in the
>>> internal state for the chain. Then, allow userland to specify some magic
>>> fd value for subsequent chained operations that says to use that instead
>>> of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)?
>>
>> BTW, I think we need two magics here. One that says "result from
>> previous is fd for next", and one that says "fd from previous is fd for
>> next". The former allows inheritance from open -> read, the latter from
>> read -> write.
>>
> 
> Do we? I suspect that in almost all of the cases, all we'd care about is
> the last open. Also if you have unrelated operations in there you still
> have to chain the fd through somehow to the next op which is a bit hard
> to do with that scheme.
> 
> I'd just have a single magic carveout that means "use the result of last
> open call done in this chain". If you do a second open (or pipe, or...),
> then that would put the old struct file pointer and drop a new one in
> there.
> 
> If we really do want to enable multiple opens in a single chain though,
> then we might want to rethink this and consider some sort of slot table
> for storing open fds.

I think the one magic can work, you just have to define your chain
appropriately for the case where you have multiple opens. That's true
for the two magic approach as well, of course, I don't want a stack of
open fds, just "last open" should suffice.

I don't like the implicit close, if your op opens an fd, something
should close it again. You pass it back to the application in any case
for io_uring, so the app can just close it. Which means that your chain
should just include a close for whatever fd you open, unless you plan on
using it in the application aftwards.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 20:33                                                   ` Jens Axboe
@ 2020-03-03 21:03                                                     ` Jeff Layton
  2020-03-03 21:20                                                       ` Jens Axboe
  0 siblings, 1 reply; 117+ messages in thread
From: Jeff Layton @ 2020-03-03 21:03 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, Jann Horn
  Cc: Miklos Szeredi, Karel Zak, David Howells, Ian Kent,
	Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Darrick J. Wong,
	Linux API, linux-fsdevel, lkml

On Tue, 2020-03-03 at 13:33 -0700, Jens Axboe wrote:
> On 3/3/20 12:43 PM, Jeff Layton wrote:
> > On Tue, 2020-03-03 at 12:23 -0700, Jens Axboe wrote:
> > > On 3/3/20 12:02 PM, Jeff Layton wrote:
> > > > Basically, all you'd need to do is keep a pointer to struct file in the
> > > > internal state for the chain. Then, allow userland to specify some magic
> > > > fd value for subsequent chained operations that says to use that instead
> > > > of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)?
> > > 
> > > BTW, I think we need two magics here. One that says "result from
> > > previous is fd for next", and one that says "fd from previous is fd for
> > > next". The former allows inheritance from open -> read, the latter from
> > > read -> write.
> > > 
> > 
> > Do we? I suspect that in almost all of the cases, all we'd care about is
> > the last open. Also if you have unrelated operations in there you still
> > have to chain the fd through somehow to the next op which is a bit hard
> > to do with that scheme.
> > 
> > I'd just have a single magic carveout that means "use the result of last
> > open call done in this chain". If you do a second open (or pipe, or...),
> > then that would put the old struct file pointer and drop a new one in
> > there.
> > 
> > If we really do want to enable multiple opens in a single chain though,
> > then we might want to rethink this and consider some sort of slot table
> > for storing open fds.
> 
> I think the one magic can work, you just have to define your chain
> appropriately for the case where you have multiple opens. That's true
> for the two magic approach as well, of course, I don't want a stack of
> open fds, just "last open" should suffice.
> 

Yep.

> I don't like the implicit close, if your op opens an fd, something
> should close it again. You pass it back to the application in any case
> for io_uring, so the app can just close it. Which means that your chain
> should just include a close for whatever fd you open, unless you plan on
> using it in the application aftwards.
> 

Yeah sorry, I didn't word that correctly. Let me try again:

My thinking was that you would still return the result of the open to
userland, but also stash a struct file pointer in the internal chain
representation. Then you just refer to that when you get the "magic" fd.

You'd still need to explicitly close the file though if you didn't want
to use it past the end of the current chain. So, I guess you _do_ need
the actual fd to properly close the file in that case.

On another note, what happens if you do open+write+close and the write
fails? Does the close still happen, or would you have to issue one
separately after getting the result?

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 21:03                                                     ` Jeff Layton
@ 2020-03-03 21:20                                                       ` Jens Axboe
  0 siblings, 0 replies; 117+ messages in thread
From: Jens Axboe @ 2020-03-03 21:20 UTC (permalink / raw)
  To: Jeff Layton, Greg Kroah-Hartman, Jann Horn
  Cc: Miklos Szeredi, Karel Zak, David Howells, Ian Kent,
	Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Darrick J. Wong,
	Linux API, linux-fsdevel, lkml

[-- Attachment #1: Type: text/plain, Size: 11580 bytes --]

On 3/3/20 2:03 PM, Jeff Layton wrote:
> On Tue, 2020-03-03 at 13:33 -0700, Jens Axboe wrote:
>> On 3/3/20 12:43 PM, Jeff Layton wrote:
>>> On Tue, 2020-03-03 at 12:23 -0700, Jens Axboe wrote:
>>>> On 3/3/20 12:02 PM, Jeff Layton wrote:
>>>>> Basically, all you'd need to do is keep a pointer to struct file in the
>>>>> internal state for the chain. Then, allow userland to specify some magic
>>>>> fd value for subsequent chained operations that says to use that instead
>>>>> of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)?
>>>>
>>>> BTW, I think we need two magics here. One that says "result from
>>>> previous is fd for next", and one that says "fd from previous is fd for
>>>> next". The former allows inheritance from open -> read, the latter from
>>>> read -> write.
>>>>
>>>
>>> Do we? I suspect that in almost all of the cases, all we'd care about is
>>> the last open. Also if you have unrelated operations in there you still
>>> have to chain the fd through somehow to the next op which is a bit hard
>>> to do with that scheme.
>>>
>>> I'd just have a single magic carveout that means "use the result of last
>>> open call done in this chain". If you do a second open (or pipe, or...),
>>> then that would put the old struct file pointer and drop a new one in
>>> there.
>>>
>>> If we really do want to enable multiple opens in a single chain though,
>>> then we might want to rethink this and consider some sort of slot table
>>> for storing open fds.
>>
>> I think the one magic can work, you just have to define your chain
>> appropriately for the case where you have multiple opens. That's true
>> for the two magic approach as well, of course, I don't want a stack of
>> open fds, just "last open" should suffice.
>>
> 
> Yep.
> 
>> I don't like the implicit close, if your op opens an fd, something
>> should close it again. You pass it back to the application in any case
>> for io_uring, so the app can just close it. Which means that your chain
>> should just include a close for whatever fd you open, unless you plan on
>> using it in the application aftwards.
>>
> 
> Yeah sorry, I didn't word that correctly. Let me try again:
> 
> My thinking was that you would still return the result of the open to
> userland, but also stash a struct file pointer in the internal chain
> representation. Then you just refer to that when you get the "magic" fd.
> 
> You'd still need to explicitly close the file though if you didn't want
> to use it past the end of the current chain. So, I guess you _do_ need
> the actual fd to properly close the file in that case.

Right, I'm caching both the fd and the file, we'll need both. See below,
quick hack. Needs a few prep patches, we always prepare the file upfront
and the prep handlers expect it to be there, we'd have to do things a
bit differently for that. And attached small test app that does
open+read+close.

> On another note, what happens if you do open+write+close and the write
> fails? Does the close still happen, or would you have to issue one
> separately after getting the result?

For any io_uring chain, if any link in the chain fails, then the rest of
the chain is errored. So if your open fails, you'd get -ECANCELED for
your read+close. If your read fails, just the close is errored. So yes,
you'd have to close the fd again if the chain doesn't fully execute.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 0e2065614ace..bbaea6b3e16a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -488,6 +488,10 @@ enum {
 	REQ_F_NEED_CLEANUP_BIT,
 	REQ_F_OVERFLOW_BIT,
 	REQ_F_POLLED_BIT,
+	REQ_F_OPEN_FD_BIT,
+
+	/* not a real bit, just to check we're not overflowing the space */
+	__REQ_F_LAST_BIT,
 };
 
 enum {
@@ -532,6 +536,8 @@ enum {
 	REQ_F_OVERFLOW		= BIT(REQ_F_OVERFLOW_BIT),
 	/* already went through poll handler */
 	REQ_F_POLLED		= BIT(REQ_F_POLLED_BIT),
+	/* use chain previous open fd */
+	REQ_F_OPEN_FD		= BIT(REQ_F_OPEN_FD_BIT),
 };
 
 struct async_poll {
@@ -593,6 +599,8 @@ struct io_kiocb {
 			struct callback_head	task_work;
 			struct hlist_node	hash_node;
 			struct async_poll	*apoll;
+			struct file		*last_open_file;
+			int			last_open_fd;
 		};
 		struct io_wq_work	work;
 	};
@@ -1292,7 +1300,7 @@ static inline void io_put_file(struct io_kiocb *req, struct file *file,
 {
 	if (fixed)
 		percpu_ref_put(&req->ctx->file_data->refs);
-	else
+	else if (!(req->flags & REQ_F_OPEN_FD))
 		fput(file);
 }
 
@@ -1435,6 +1443,12 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr)
 		list_del_init(&req->link_list);
 		if (!list_empty(&nxt->link_list))
 			nxt->flags |= REQ_F_LINK;
+		if (nxt->flags & REQ_F_OPEN_FD) {
+			WARN_ON_ONCE(nxt->file);
+			nxt->last_open_file = req->last_open_file;
+			nxt->last_open_fd = req->last_open_fd;
+			nxt->file = req->last_open_file;
+		}
 		*nxtptr = nxt;
 		break;
 	}
@@ -1957,37 +1971,20 @@ static bool io_file_supports_async(struct file *file)
 	return false;
 }
 
-static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
-		      bool force_nonblock)
+static int __io_prep_rw(struct io_kiocb *req, bool force_nonblock)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	struct kiocb *kiocb = &req->rw.kiocb;
-	unsigned ioprio;
-	int ret;
 
 	if (S_ISREG(file_inode(req->file)->i_mode))
 		req->flags |= REQ_F_ISREG;
 
-	kiocb->ki_pos = READ_ONCE(sqe->off);
 	if (kiocb->ki_pos == -1 && !(req->file->f_mode & FMODE_STREAM)) {
 		req->flags |= REQ_F_CUR_POS;
 		kiocb->ki_pos = req->file->f_pos;
 	}
 	kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
-	kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
-	ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags));
-	if (unlikely(ret))
-		return ret;
-
-	ioprio = READ_ONCE(sqe->ioprio);
-	if (ioprio) {
-		ret = ioprio_check_cap(ioprio);
-		if (ret)
-			return ret;
-
-		kiocb->ki_ioprio = ioprio;
-	} else
-		kiocb->ki_ioprio = get_current_ioprio();
+	kiocb->ki_flags |= iocb_flags(kiocb->ki_filp);
 
 	/* don't allow async punt if RWF_NOWAIT was requested */
 	if ((kiocb->ki_flags & IOCB_NOWAIT) ||
@@ -2011,6 +2008,31 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		kiocb->ki_complete = io_complete_rw;
 	}
 
+	return 0;
+}
+
+static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct kiocb *kiocb = &req->rw.kiocb;
+	unsigned ioprio;
+	int ret;
+
+	kiocb->ki_pos = READ_ONCE(sqe->off);
+
+	ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags));
+	if (unlikely(ret))
+		return ret;
+
+	ioprio = READ_ONCE(sqe->ioprio);
+	if (ioprio) {
+		ret = ioprio_check_cap(ioprio);
+		if (ret)
+			return ret;
+
+		kiocb->ki_ioprio = ioprio;
+	} else
+		kiocb->ki_ioprio = get_current_ioprio();
+
 	req->rw.addr = READ_ONCE(sqe->addr);
 	req->rw.len = READ_ONCE(sqe->len);
 	/* we own ->private, reuse it for the buffer index */
@@ -2273,13 +2295,10 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	struct iov_iter iter;
 	ssize_t ret;
 
-	ret = io_prep_rw(req, sqe, force_nonblock);
+	ret = io_prep_rw(req, sqe);
 	if (ret)
 		return ret;
 
-	if (unlikely(!(req->file->f_mode & FMODE_READ)))
-		return -EBADF;
-
 	/* either don't need iovec imported or already have it */
 	if (!req->io || req->flags & REQ_F_NEED_CLEANUP)
 		return 0;
@@ -2304,6 +2323,13 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
 	size_t iov_count;
 	ssize_t io_size, ret;
 
+	ret = __io_prep_rw(req, force_nonblock);
+	if (ret)
+		return ret;
+
+	if (unlikely(!(req->file->f_mode & FMODE_READ)))
+		return -EBADF;
+
 	ret = io_import_iovec(READ, req, &iovec, &iter);
 	if (ret < 0)
 		return ret;
@@ -2362,13 +2388,10 @@ static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	struct iov_iter iter;
 	ssize_t ret;
 
-	ret = io_prep_rw(req, sqe, force_nonblock);
+	ret = io_prep_rw(req, sqe);
 	if (ret)
 		return ret;
 
-	if (unlikely(!(req->file->f_mode & FMODE_WRITE)))
-		return -EBADF;
-
 	/* either don't need iovec imported or already have it */
 	if (!req->io || req->flags & REQ_F_NEED_CLEANUP)
 		return 0;
@@ -2393,6 +2416,13 @@ static int io_write(struct io_kiocb *req, bool force_nonblock)
 	size_t iov_count;
 	ssize_t ret, io_size;
 
+	ret = __io_prep_rw(req, force_nonblock);
+	if (ret)
+		return ret;
+
+	if (unlikely(!(req->file->f_mode & FMODE_WRITE)))
+		return -EBADF;
+
 	ret = io_import_iovec(WRITE, req, &iovec, &iter);
 	if (ret < 0)
 		return ret;
@@ -2737,8 +2767,8 @@ static int io_openat2_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 
 static int io_openat2(struct io_kiocb *req, bool force_nonblock)
 {
+	struct file *file = NULL;
 	struct open_flags op;
-	struct file *file;
 	int ret;
 
 	if (force_nonblock)
@@ -2763,8 +2793,12 @@ static int io_openat2(struct io_kiocb *req, bool force_nonblock)
 err:
 	putname(req->open.filename);
 	req->flags &= ~REQ_F_NEED_CLEANUP;
-	if (ret < 0)
+	if (ret < 0) {
 		req_set_fail_links(req);
+	} else if (req->flags & REQ_F_LINK) {
+		req->last_open_file = file;
+		req->last_open_fd = ret;
+	}
 	io_cqring_add_event(req, ret);
 	io_put_req(req);
 	return 0;
@@ -2980,10 +3014,6 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return -EBADF;
 
 	req->close.fd = READ_ONCE(sqe->fd);
-	if (req->file->f_op == &io_uring_fops ||
-	    req->close.fd == req->ctx->ring_fd)
-		return -EBADF;
-
 	return 0;
 }
 
@@ -3013,6 +3043,18 @@ static int io_close(struct io_kiocb *req, bool force_nonblock)
 {
 	int ret;
 
+	if (req->flags & REQ_F_OPEN_FD) {
+		if (req->close.fd != IOSQE_FD_LAST_OPEN)
+			return -EBADF;
+		req->close.fd = req->last_open_fd;
+		req->last_open_file = NULL;
+		req->last_open_fd = -1;
+	}
+
+	if (req->file->f_op == &io_uring_fops ||
+	    req->close.fd == req->ctx->ring_fd)
+		return -EBADF;
+
 	req->close.put_file = NULL;
 	ret = __close_fd_get_file(req->close.fd, &req->close.put_file);
 	if (ret < 0)
@@ -3437,8 +3479,14 @@ static int __io_accept(struct io_kiocb *req, bool force_nonblock)
 		return -EAGAIN;
 	if (ret == -ERESTARTSYS)
 		ret = -EINTR;
-	if (ret < 0)
+	if (ret < 0) {
 		req_set_fail_links(req);
+	} else if (req->flags & REQ_F_LINK) {
+		rcu_read_lock();
+		req->last_open_file = fcheck_files(current->files, ret);
+		rcu_read_unlock();
+		req->last_open_fd = ret;
+	}
 	io_cqring_add_event(req, ret);
 	io_put_req(req);
 	return 0;
@@ -4779,6 +4827,14 @@ static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req,
 		return 0;
 
 	fixed = (flags & IOSQE_FIXED_FILE);
+	if (fd == IOSQE_FD_LAST_OPEN) {
+		if (fixed)
+			return -EBADF;
+		req->flags |= REQ_F_OPEN_FD;
+		req->file = NULL;
+		return 0;
+	}
+
 	if (unlikely(!fixed && req->needs_fixed_file))
 		return -EBADF;
 
@@ -7448,6 +7504,7 @@ static int __init io_uring_init(void)
 	BUILD_BUG_SQE_ELEM(44, __s32,  splice_fd_in);
 
 	BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST);
+	BUILD_BUG_ON(__REQ_F_LAST_BIT >= 8 * sizeof(int));
 	req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
 	return 0;
 };
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 53b36311cdac..3ccf74efe381 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -77,6 +77,12 @@ enum {
 /* always go async */
 #define IOSQE_ASYNC		(1U << IOSQE_ASYNC_BIT)
 
+/*
+ * 'magic' ->fd values don't point to a real fd, but rather define how fds
+ * can be inherited through links in a chain
+ */
+#define IOSQE_FD_LAST_OPEN	(-4096)	/* previous result is fd */
+
 /*
  * io_uring_setup() flags
  */

-- 
Jens Axboe


[-- Attachment #2: orc.c --]
[-- Type: text/x-csrc, Size: 1589 bytes --]

/* SPDX-License-Identifier: MIT */
/*
 * Description: open+read+close link sequence with fd passing
 *
 */
#include <errno.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>

#include "liburing.h"

static int test_orc(struct io_uring *ring, const char *fname)
{
	struct io_uring_cqe *cqe;
	struct io_uring_sqe *sqe;
	char buf[4096];
	int ret, i;

	sqe = io_uring_get_sqe(ring);
	io_uring_prep_openat(sqe, AT_FDCWD, fname, O_RDONLY, 0);
	sqe->flags |= IOSQE_IO_LINK;
	sqe->user_data = IORING_OP_OPENAT;

	sqe = io_uring_get_sqe(ring);
	io_uring_prep_read(sqe, -4096, buf, sizeof(buf), 0);
	sqe->flags |= IOSQE_IO_LINK;
	sqe->user_data = IORING_OP_READ;

	sqe = io_uring_get_sqe(ring);
	io_uring_prep_close(sqe, -4096);
	sqe->user_data = IORING_OP_CLOSE;

	ret = io_uring_submit(ring);
	if (ret != 3) {
		fprintf(stderr, "sqe submit failed: %d\n", ret);
		goto err;
	}

	for (i = 0; i < 3; i++) {
		ret = io_uring_wait_cqe(ring, &cqe);
		if (ret < 0) {
			fprintf(stderr, "wait completion %d\n", ret);
			goto err;
		}

		printf("%d: op=%u, res=%d\n", i, (unsigned) cqe->user_data, cqe->res);
		io_uring_cqe_seen(ring, cqe);
	}

	return 0;
err:
	return 1;
}

int main(int argc, char *argv[])
{
	struct io_uring ring;
	int ret;

	if (argc < 2) {
		fprintf(stderr, "%s: <file>\n", argv[0]);
		return 1;
	}

	ret = io_uring_queue_init(8, &ring, 0);
	if (ret) {
		fprintf(stderr, "ring setup failed: %d\n", ret);
		return 1;
	}

	ret = test_orc(&ring, argv[1]);
	if (ret) {
		fprintf(stderr, "test_orc failed\n");
		return ret;
	}

	return 0;
}

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 13:03                         ` Greg Kroah-Hartman
  2020-03-03 13:14                           ` Greg Kroah-Hartman
@ 2020-03-04  2:01                           ` Ian Kent
  2020-03-04 15:22                             ` Karel Zak
  1 sibling, 1 reply; 117+ messages in thread
From: Ian Kent @ 2020-03-04  2:01 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Karel Zak
  Cc: Miklos Szeredi, David Howells, Christian Brauner,
	James Bottomley, Steven Whitehouse, Miklos Szeredi, viro,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml

On Tue, 2020-03-03 at 14:03 +0100, Greg Kroah-Hartman wrote:
> On Tue, Mar 03, 2020 at 12:38:14PM +0100, Karel Zak wrote:
> > On Tue, Mar 03, 2020 at 10:26:21AM +0100, Miklos Szeredi wrote:
> > > No, I don't think this is going to be a performance issue at all,
> > > but
> > > if anything we could introduce a syscall
> > > 
> > >   ssize_t readfile(int dfd, const char *path, char *buf, size_t
> > > bufsize, int flags);
> > 
> > off-topic, but I'll buy you many many beers if you implement it ;-
> > ),
> > because open + read + close is pretty common for /sys and /proc in
> > many userspace tools; for example ps, top, lsblk, lsmem, lsns,
> > udevd
> > etc. is all about it.
> 
> Unlimited beers for a 21-line kernel patch?  Sign me up!
> 
> Totally untested, barely compiled patch below.
> 
> Actually, I like this idea (the syscall, not just the unlimited
> beers).
> Maybe this could make a lot of sense, I'll write some actual tests
> for
> it now that syscalls are getting "heavy" again due to CPU vendors
> finally paying the price for their madness...

The problem isn't with open->read->close but with the mount info.
changing between reads (ie. seq file read takes and drops the
needed lock between reads at least once).

The problem is you don't know the buffer size needed to get this
in one hit, how is this different to read(2)?

> 
> thanks,
> 
> greg k-h
> -------------------
> 
> 
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index 44d510bc9b78..178cd45340e2 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -359,6 +359,7 @@
>  435	common	clone3			__x64_sys_clone3/ptregs
>  437	common	openat2			__x64_sys_openat2
>  438	common	pidfd_getfd		__x64_sys_pidfd_getfd
> +439	common	readfile		__x86_sys_readfile
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache
> impact
> diff --git a/fs/open.c b/fs/open.c
> index 0788b3715731..1a830fada750 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -1340,3 +1340,23 @@ int stream_open(struct inode *inode, struct
> file *filp)
>  }
>  
>  EXPORT_SYMBOL(stream_open);
> +
> +SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename,
> +		char __user *, buffer, size_t, bufsize, int, flags)
> +{
> +	int retval;
> +	int fd;
> +
> +	if (force_o_largefile())
> +		flags |= O_LARGEFILE;
> +
> +	fd = do_sys_open(dfd, filename, flags, O_RDONLY);
> +	if (fd <= 0)
> +		return fd;
> +
> +	retval = ksys_read(fd, buffer, bufsize);
> +
> +	__close_fd(current->files, fd);
> +
> +	return retval;
> +}


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03 14:10                                 ` Miklos Szeredi
  2020-03-03 14:29                                   ` Greg Kroah-Hartman
  2020-03-03 14:40                                   ` David Howells
@ 2020-03-04  4:20                                   ` Ian Kent
  2 siblings, 0 replies; 117+ messages in thread
From: Ian Kent @ 2020-03-04  4:20 UTC (permalink / raw)
  To: Miklos Szeredi, Greg Kroah-Hartman
  Cc: Karel Zak, David Howells, Christian Brauner, James Bottomley,
	Steven Whitehouse, Miklos Szeredi, viro, Christian Brauner,
	Jann Horn, Darrick J. Wong, Linux API, linux-fsdevel, lkml

On Tue, 2020-03-03 at 15:10 +0100, Miklos Szeredi wrote:
> On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote:
> > > If buffer is too small to fit the whole file, return error.
> > 
> > Why?  What's wrong with just returning the bytes asked for?  If
> > someone
> > only wants 5 bytes from the front of a file, it should be fine to
> > give
> > that to them, right?
> 
> I think we need to signal in some way to the caller that the result
> was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise
> the
> caller might be surprised.
> 
> > > Verify that the number of bytes read matches the file size,
> > > otherwise
> > > return error (may need to loop?).
> > 
> > No, we can't "match file size" as sysfs files do not really have a
> > sane
> > "size".  So I don't want to loop at all here, one-shot, that's all
> > you
> > get :)
> 
> Hmm.  I understand the no-size thing.  But looping until EOF (i.e.
> until read return zero) might be a good idea regardless, because
> short
> reads are allowed.

Surely a short read equates to an error.

That has to be the definition of readfile() because you can do the
looping thing with read(2) and get the entire file anyway.

If you think about it don't you arrive at the conclusion this can
be done with read(2) alone anyway because you have to loop to get
the entire file, otherwise there's no point to the syscall!

Ian


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 10/17] fsinfo: Allow mount information to be queried [ver #17]
  2020-02-21 18:03 ` [PATCH 10/17] fsinfo: Allow mount information to be queried " David Howells
@ 2020-03-04 14:58   ` Miklos Szeredi
  2020-03-04 16:10   ` Miklos Szeredi
  1 sibling, 0 replies; 117+ messages in thread
From: Miklos Szeredi @ 2020-03-04 14:58 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Ian Kent, Miklos Szeredi, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, linux-kernel

On Fri, Feb 21, 2020 at 7:03 PM David Howells <dhowells@redhat.com> wrote:

> +/*
> + * Return the path of this mount relative to its parent and clipped to
> + * the current chroot.

And clipped to nothing if outside current root.  The code doesn't
appear to care, which to me seems like a hole.

And btw, what is the point of only showing path relative to parent
mount?  This way it's impossible to get a consistent path from root
due to mount/dentry tree changes between calls to fsinfo().

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-04  2:01                           ` Ian Kent
@ 2020-03-04 15:22                             ` Karel Zak
  2020-03-04 16:49                               ` Greg Kroah-Hartman
  0 siblings, 1 reply; 117+ messages in thread
From: Karel Zak @ 2020-03-04 15:22 UTC (permalink / raw)
  To: Ian Kent
  Cc: Greg Kroah-Hartman, Miklos Szeredi, David Howells,
	Christian Brauner, James Bottomley, Steven Whitehouse,
	Miklos Szeredi, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml

On Wed, Mar 04, 2020 at 10:01:33AM +0800, Ian Kent wrote:
> On Tue, 2020-03-03 at 14:03 +0100, Greg Kroah-Hartman wrote:
> > Actually, I like this idea (the syscall, not just the unlimited
> > beers).
> > Maybe this could make a lot of sense, I'll write some actual tests
> > for
> > it now that syscalls are getting "heavy" again due to CPU vendors
> > finally paying the price for their madness...
> 
> The problem isn't with open->read->close but with the mount info.
> changing between reads (ie. seq file read takes and drops the
> needed lock between reads at least once).

readfile() is not reaction to mountinfo. 

The motivation is that we have many places with trivial
open->read->close for very small text files due to /sys and /proc. The
current way how kernel delivers these small strings to userspace seems
pretty inefficient if we can do the same by one syscall.

    Karel

$ strace -e openat,read,close -c ps aux
...
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 43.32    0.004190           4       987           read
 31.42    0.003039           3       844         4 openat
 25.26    0.002443           2       842           close
------ ----------- ----------- --------- --------- ----------------
100.00    0.009672                  2673         4 total

$ strace -e openat,read,close -c lsns
...
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 39.95    0.001567           2       593           openat
 30.93    0.001213           2       597           close
 29.12    0.001142           3       365           read
------ ----------- ----------- --------- --------- ----------------
100.00    0.003922                  1555           total


$ strace -e openat,read,close -c lscpu
...
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 44.67    0.001480           7       189        52 openat
 34.77    0.001152           6       180           read
 20.56    0.000681           4       140           close
------ ----------- ----------- --------- --------- ----------------
100.00    0.003313                   509        52 total


-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 10/17] fsinfo: Allow mount information to be queried [ver #17]
  2020-02-21 18:03 ` [PATCH 10/17] fsinfo: Allow mount information to be queried " David Howells
  2020-03-04 14:58   ` Miklos Szeredi
@ 2020-03-04 16:10   ` Miklos Szeredi
  1 sibling, 0 replies; 117+ messages in thread
From: Miklos Szeredi @ 2020-03-04 16:10 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Ian Kent, Miklos Szeredi, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, linux-kernel

On Fri, Feb 21, 2020 at 7:03 PM David Howells <dhowells@redhat.com> wrote:
 +
> +/*
> + * Return information about the submounts relative to path.
> + */
> +int fsinfo_generic_mount_children(struct path *path, struct fsinfo_context *ctx)
> +{
> +       struct fsinfo_mount_child record;
> +       struct mount *m, *child;
> +
> +       if (!path->mnt)
> +               return -ENODATA;
> +
> +       m = real_mount(path->mnt);
> +
> +       rcu_read_lock();
> +       list_for_each_entry_rcu(child, &m->mnt_mounts, mnt_child) {

mnt_mounts is not using _rcu primitives, so why is this rcu safe?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-04 15:22                             ` Karel Zak
@ 2020-03-04 16:49                               ` Greg Kroah-Hartman
  2020-03-04 17:55                                 ` Karel Zak
  0 siblings, 1 reply; 117+ messages in thread
From: Greg Kroah-Hartman @ 2020-03-04 16:49 UTC (permalink / raw)
  To: Karel Zak
  Cc: Ian Kent, Miklos Szeredi, David Howells, Christian Brauner,
	James Bottomley, Steven Whitehouse, Miklos Szeredi, viro,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml

On Wed, Mar 04, 2020 at 04:22:41PM +0100, Karel Zak wrote:
> On Wed, Mar 04, 2020 at 10:01:33AM +0800, Ian Kent wrote:
> > On Tue, 2020-03-03 at 14:03 +0100, Greg Kroah-Hartman wrote:
> > > Actually, I like this idea (the syscall, not just the unlimited
> > > beers).
> > > Maybe this could make a lot of sense, I'll write some actual tests
> > > for
> > > it now that syscalls are getting "heavy" again due to CPU vendors
> > > finally paying the price for their madness...
> > 
> > The problem isn't with open->read->close but with the mount info.
> > changing between reads (ie. seq file read takes and drops the
> > needed lock between reads at least once).
> 
> readfile() is not reaction to mountinfo. 
> 
> The motivation is that we have many places with trivial
> open->read->close for very small text files due to /sys and /proc. The
> current way how kernel delivers these small strings to userspace seems
> pretty inefficient if we can do the same by one syscall.
> 
>     Karel
> 
> $ strace -e openat,read,close -c ps aux
> ...
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  43.32    0.004190           4       987           read
>  31.42    0.003039           3       844         4 openat
>  25.26    0.002443           2       842           close
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.009672                  2673         4 total
> 
> $ strace -e openat,read,close -c lsns
> ...
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  39.95    0.001567           2       593           openat
>  30.93    0.001213           2       597           close
>  29.12    0.001142           3       365           read
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.003922                  1555           total
> 
> 
> $ strace -e openat,read,close -c lscpu
> ...
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  44.67    0.001480           7       189        52 openat
>  34.77    0.001152           6       180           read
>  20.56    0.000681           4       140           close
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.003313                   509        52 total

As a "real-world" test, would you recommend me converting one of the
above tools to my implementation of readfile to see how/if it actually
makes sense, or do you have some other tool you would rather see me try?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-04 16:49                               ` Greg Kroah-Hartman
@ 2020-03-04 17:55                                 ` Karel Zak
  0 siblings, 0 replies; 117+ messages in thread
From: Karel Zak @ 2020-03-04 17:55 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Ian Kent, Miklos Szeredi, David Howells, Christian Brauner,
	James Bottomley, Steven Whitehouse, Miklos Szeredi, viro,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml

On Wed, Mar 04, 2020 at 05:49:13PM +0100, Greg Kroah-Hartman wrote:
> On Wed, Mar 04, 2020 at 04:22:41PM +0100, Karel Zak wrote:
> > $ strace -e openat,read,close -c ps aux
> > ...
> > % time     seconds  usecs/call     calls    errors syscall
> > ------ ----------- ----------- --------- --------- ----------------
> >  43.32    0.004190           4       987           read
> >  31.42    0.003039           3       844         4 openat
> >  25.26    0.002443           2       842           close
> > ------ ----------- ----------- --------- --------- ----------------
> > 100.00    0.009672                  2673         4 total
> > 
> > $ strace -e openat,read,close -c lsns
> > ...
> > % time     seconds  usecs/call     calls    errors syscall
> > ------ ----------- ----------- --------- --------- ----------------
> >  39.95    0.001567           2       593           openat
> >  30.93    0.001213           2       597           close
> >  29.12    0.001142           3       365           read
> > ------ ----------- ----------- --------- --------- ----------------
> > 100.00    0.003922                  1555           total
> > 
> > 
> > $ strace -e openat,read,close -c lscpu
> > ...
> > % time     seconds  usecs/call     calls    errors syscall
> > ------ ----------- ----------- --------- --------- ----------------
> >  44.67    0.001480           7       189        52 openat
> >  34.77    0.001152           6       180           read
> >  20.56    0.000681           4       140           close
> > ------ ----------- ----------- --------- --------- ----------------
> > 100.00    0.003313                   509        52 total
> 
> As a "real-world" test, would you recommend me converting one of the
> above tools to my implementation of readfile to see how/if it actually
> makes sense, or do you have some other tool you would rather see me try?

See lib/path.c and lib/sysfs.c in util-linux (https://github.com/karelzak/util-linux). 
For example ul_path_read() and ul_path_scanf(). 

We use it for lsblk, lsmem, lscpu, etc.

 $ git grep -c ul_path_read misc-utils/lsblk.c sys-utils/lscpu.c
 misc-utils/lsblk.c:30
 sys-utils/lscpu.c:31

We're probably a little bit off-topic here, no problem to continue on
util-linux@vger.kernel.org or by private mails. Thanks!

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-03  7:46                   ` Miklos Szeredi
@ 2020-03-06 16:25                     ` Miklos Szeredi
  2020-03-06 19:43                       ` Al Viro
  2020-03-07  9:48                       ` Greg Kroah-Hartman
  0 siblings, 2 replies; 117+ messages in thread
From: Miklos Szeredi @ 2020-03-06 16:25 UTC (permalink / raw)
  To: Ian Kent
  Cc: David Howells, Christian Brauner, James Bottomley,
	Steven Whitehouse, Miklos Szeredi, viro, Christian Brauner,
	Jann Horn, Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote:
> 
> I'm doing a patch.   Let's see how it fares in the face of all these
> preconceptions.

Here's a first cut.  Doesn't yet have superblock info, just mount info.
Probably has rough edges, but appears to work.

I started with sysfs, then kernfs, then went with a custom filesystem, because
neither could do what I wanted.

Anyway, this is more for review of the concept, than for a code review, but
obviously if you see a fatal flaw in the design, please let me know.

get mountinfo from open file:

  cat /proc/$PID/fdmount/$FD/*

get mountinfo by mount ID:

  mount -t mountfs mountfs /mountfs
  cat /mountfs/$MNT_ID/*


Thanks,
Miklos

---
 fs/Makefile              |    1 
 fs/mount.h               |   11 +
 fs/mountfs/Makefile      |    1 
 fs/mountfs/super.c       |  497 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/namespace.c           |   60 +++++
 fs/proc/base.c           |    2 
 fs/proc/fd.c             |   82 +++++++
 fs/proc/fd.h             |    3 
 fs/proc_namespace.c      |   22 --
 fs/seq_file.c            |   23 ++
 include/linux/seq_file.h |    1 
 11 files changed, 682 insertions(+), 21 deletions(-)

--- a/fs/Makefile
+++ b/fs/Makefile
@@ -135,3 +135,4 @@ obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
 obj-$(CONFIG_EROFS_FS)		+= erofs/
 obj-$(CONFIG_VBOXSF_FS)		+= vboxsf/
 obj-$(CONFIG_ZONEFS_FS)		+= zonefs/
+obj-y				+= mountfs/
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -72,6 +72,7 @@ struct mount {
 	int mnt_expiry_mark;		/* true if marked for expiry */
 	struct hlist_head mnt_pins;
 	struct hlist_head mnt_stuck_children;
+	struct mountfs_entry *mnt_mountfs_entry;
 } __randomize_layout;
 
 #define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */
@@ -153,3 +154,13 @@ static inline bool is_anon_ns(struct mnt
 {
 	return ns->seq == 0;
 }
+
+extern struct mount *get_mount(struct mount *mnt);
+extern void mntput_no_expire(struct mount *mnt);
+
+void mountfs_create(struct mount *mnt, struct mnt_namespace *mnt_ns);
+extern void mountfs_remove(struct mount *mnt);
+void seq_mount_children(struct seq_file *sf, struct mount *mnt);
+void seq_mount_propagate_from(struct seq_file *sf, struct mount *mnt,
+			      const struct path *root);
+int mountfs_lookup_internal(struct vfsmount *m, struct path *path);
--- /dev/null
+++ b/fs/mountfs/Makefile
@@ -0,0 +1 @@
+obj-y				+= super.o
--- /dev/null
+++ b/fs/mountfs/super.c
@@ -0,0 +1,497 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "../pnode.h"
+#include <linux/fs.h>
+#include <linux/kref.h>
+#include <linux/nsproxy.h>
+#include <linux/fs_struct.h>
+#include <linux/fs_context.h>
+
+#define MOUNTFS_SUPER_MAGIC 0x4e756f4d
+
+static DEFINE_MUTEX(mountfs_lock);
+static struct rb_root mountfs_entries = RB_ROOT;
+static struct vfsmount *mountfs_mnt __read_mostly;
+
+struct mountfs_entry {
+	struct kref kref;
+	struct mount *mnt;
+	struct rb_node node;
+	int id;
+};
+
+static const char *mountfs_attrs[] = {
+	"root", "mountpoint", "id", "parent", "options", "children",
+	"group", "master", "propagate_from"
+};
+
+#define MOUNTFS_INO(id) (((unsigned long) id + 1) * \
+			 (ARRAY_SIZE(mountfs_attrs) + 1))
+
+void mountfs_entry_release(struct kref *kref)
+{
+	kfree(container_of(kref, struct mountfs_entry, kref));
+}
+
+void mountfs_entry_put(struct mountfs_entry *entry)
+{
+	kref_put(&entry->kref, mountfs_entry_release);
+}
+
+static struct mount *mountfs_get_mount(struct mountfs_entry *entry)
+{
+	struct mount *mnt;
+
+	rcu_read_lock();
+	mnt = get_mount(rcu_dereference(entry->mnt));
+	rcu_read_unlock();
+
+	return mnt;
+}
+
+static bool mountfs_entry_visible(struct mountfs_entry *entry)
+{
+	struct mount *mnt;
+	bool visible = false;
+
+	rcu_read_lock();
+	mnt = rcu_dereference(entry->mnt);
+	if (mnt && mnt->mnt_ns == current->nsproxy->mnt_ns)
+		visible = true;
+	rcu_read_unlock();
+
+	return visible;
+}
+
+static int mountfs_attr_show(struct seq_file *sf, void *v)
+{
+	const char *name = sf->file->f_path.dentry->d_name.name;
+	struct mountfs_entry *entry = sf->private;
+	struct mount *mnt = mountfs_get_mount(entry);
+	struct vfsmount *m;
+	struct super_block *sb;
+	struct path root;
+	int err = 0;
+
+	if (!mnt)
+		return -ENODEV;
+
+	m = &mnt->mnt;
+	sb = m->mnt_sb;
+
+	if (strcmp(name, "root") == 0) {
+		if (sb->s_op->show_path) {
+			err = sb->s_op->show_path(sf, m->mnt_root);
+		} else {
+			seq_dentry(sf, m->mnt_root, " \t\n\\");
+		}
+		seq_putc(sf, '\n');
+	} else if (strcmp(name, "mountpoint") == 0) {
+		struct path mnt_path = { .dentry = m->mnt_root, .mnt = m };
+
+		get_fs_root(current->fs, &root);
+		err = seq_path_root(sf, &mnt_path, &root, " \t\n\\");
+		path_put(&root);
+		if (err == SEQ_SKIP) {
+			seq_puts(sf, "(unreachable)");
+			err = 0;
+		}
+		seq_putc(sf, '\n');
+	} else if (strcmp(name, "id") == 0) {
+		seq_printf(sf, "%i\n", mnt->mnt_id);
+	} else if (strcmp(name, "parent") == 0) {
+		int parent;
+
+		rcu_read_lock();
+		parent = rcu_dereference(mnt->mnt_parent)->mnt_id;
+		rcu_read_unlock();
+
+		seq_printf(sf, "%i\n", parent);
+	} else if (strcmp(name, "options") == 0) {
+		int mnt_flags = READ_ONCE(m->mnt_flags);
+
+		seq_puts(sf, mnt_flags & MNT_READONLY ? "ro" : "rw");
+		seq_mnt_opts(sf, mnt_flags);
+		seq_putc(sf, '\n');
+	} else if (strcmp(name, "children") == 0) {
+		seq_mount_children(sf, mnt);
+	} else if (strcmp(name, "group") == 0) {
+		if (IS_MNT_SHARED(mnt))
+			seq_printf(sf, "%i\n", mnt->mnt_group_id);
+	} else if (strcmp(name, "master") == 0) {
+		if (IS_MNT_SLAVE(mnt)) {
+			int master;
+
+			rcu_read_lock();
+			master = rcu_dereference(mnt->mnt_master)->mnt_group_id;
+			rcu_read_unlock();
+			seq_printf(sf, "%i\n", master);
+		}
+	} else if (strcmp(name, "propagate_from") == 0) {
+		if (IS_MNT_SLAVE(mnt)) {
+			get_fs_root(current->fs, &root);
+			seq_mount_propagate_from(sf, mnt, &root);
+			path_put(&root);
+		}
+	}
+	mntput_no_expire(mnt);
+
+	return err;
+}
+
+static int mountfs_attr_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, mountfs_attr_show, inode->i_private);
+}
+
+static const struct file_operations mountfs_attr_fops = {
+	.open		= mountfs_attr_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static struct mountfs_entry *mountfs_node_to_entry(struct rb_node *node)
+{
+	return rb_entry(node, struct mountfs_entry, node);
+}
+
+static struct rb_node **mountfs_find_node(int id, struct rb_node **parent)
+{
+	struct rb_node **link = &mountfs_entries.rb_node;
+
+	*parent = NULL;
+	while (*link) {
+		struct mountfs_entry *entry = mountfs_node_to_entry(*link);
+
+		*parent = *link;
+		if (id < entry->id)
+			link = &entry->node.rb_left;
+		else if (id > entry->id)
+			link = &entry->node.rb_right;
+		else
+			break;
+	}
+	return link;
+}
+
+void mountfs_create(struct mount *mnt, struct mnt_namespace *mnt_ns)
+{
+	struct mountfs_entry *entry;
+	struct rb_node **link, *parent;
+
+	if (mnt->mnt.mnt_flags & MNT_INTERNAL)
+		return;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry) {
+		WARN(1, "failed to allocate mountfs entry");
+		return;
+	}
+	kref_init(&entry->kref);
+	entry->mnt = mnt;
+	entry->id = mnt->mnt_id;
+
+	mutex_lock(&mountfs_lock);
+	link = mountfs_find_node(entry->id, &parent);
+	if (!WARN_ON(*link)) {
+		rb_link_node(&entry->node, parent, link);
+		rb_insert_color(&entry->node, &mountfs_entries);
+		mnt->mnt_mountfs_entry = entry;
+	} else {
+		kfree(entry);
+	}
+	mutex_unlock(&mountfs_lock);
+}
+
+void mountfs_remove(struct mount *mnt)
+{
+	struct mountfs_entry *entry = mnt->mnt_mountfs_entry;
+
+	if (!entry)
+		return;
+
+	mutex_lock(&mountfs_lock);
+	entry->mnt = NULL;
+	rb_erase(&entry->node, &mountfs_entries);
+	mutex_unlock(&mountfs_lock);
+
+	mountfs_entry_put(entry);
+
+	mnt->mnt_mountfs_entry = NULL;
+}
+
+static struct mountfs_entry *mountfs_get_entry(const char *name)
+{
+	struct mountfs_entry *entry = NULL;
+	struct rb_node **link, *dummy;
+	unsigned long mnt_id;
+	char buf[32];
+	int ret;
+
+	ret = kstrtoul(name, 10, &mnt_id);
+	if (ret || mnt_id > INT_MAX)
+		return NULL;
+
+	if (WARN_ON(snprintf(buf, sizeof(buf), "%lu", mnt_id) >= sizeof(buf)) ||
+	    strcmp(buf, name) != 0)
+		return NULL;
+
+	mutex_lock(&mountfs_lock);
+	link = mountfs_find_node(mnt_id, &dummy);
+	if (*link) {
+		entry = mountfs_node_to_entry(*link);
+		if (!mountfs_entry_visible(entry))
+			entry = NULL;
+		else
+			kref_get(&entry->kref);
+	}
+	mutex_unlock(&mountfs_lock);
+
+	return entry;
+}
+
+static void mountfs_init_inode(struct inode *inode, umode_t mode);
+
+static struct dentry *mountfs_lookup_entry(struct dentry *dentry,
+					   struct mountfs_entry *entry,
+					   int idx)
+{
+	struct inode *inode;
+
+	inode = new_inode(dentry->d_sb);
+	if (!inode) {
+		mountfs_entry_put(entry);
+		return ERR_PTR(-ENOMEM);
+	}
+	inode->i_private = entry;
+	inode->i_ino = MOUNTFS_INO(entry->id) + idx;
+	mountfs_init_inode(inode, idx ? S_IFREG | 0444 : S_IFDIR | 0555);
+	return d_splice_alias(inode, dentry);
+
+}
+
+static struct dentry *mountfs_lookup(struct inode *dir, struct dentry *dentry,
+				     unsigned int flags)
+{
+	struct mountfs_entry *entry = dir->i_private;
+	int i = 0;
+
+	if (entry) {
+		for (i = 0; i < ARRAY_SIZE(mountfs_attrs); i++)
+			if (strcmp(mountfs_attrs[i], dentry->d_name.name) == 0)
+				break;
+		if (i == ARRAY_SIZE(mountfs_attrs))
+			return ERR_PTR(-ENOMEM);
+		i++;
+	} else {
+		entry = mountfs_get_entry(dentry->d_name.name);
+		if (!entry)
+			return ERR_PTR(-ENOENT);
+	}
+
+	return mountfs_lookup_entry(dentry, entry, i);
+}
+
+static int mountfs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct mountfs_entry *entry = dentry->d_inode->i_private;
+
+	/* root: valid */
+	if (!entry)
+		return 1;
+
+	/* removed: invalid */
+	if (!entry->mnt)
+		return 0;
+
+	/* attribute or visible in this namespace: valid */
+	if (!d_can_lookup(dentry) || mountfs_entry_visible(entry))
+		return 1;
+
+	/* invlisible in this namespace: valid but deny entry*/
+	return -ENOENT;
+}
+
+static int mountfs_readdir(struct file *file, struct dir_context *ctx)
+{
+	struct rb_node *node;
+	struct mountfs_entry *entry = file_inode(file)->i_private;
+	char name[32];
+	const char *s;
+	unsigned int len;
+
+	if (ctx->pos - 2 > INT_MAX || !dir_emit_dots(file, ctx))
+		return 0;
+
+	if (entry) {
+		while (ctx->pos - 2 < ARRAY_SIZE(mountfs_attrs)) {
+			s = mountfs_attrs[ctx->pos - 2];
+			if (!dir_emit(ctx, s, strlen(s),
+				      MOUNTFS_INO(entry->id) + ctx->pos,
+				      DT_REG))
+				break;
+			ctx->pos++;
+		}
+		return 0;
+	}
+
+	mutex_lock(&mountfs_lock);
+	mountfs_find_node(ctx->pos - 2, &node);
+	for (; node; node = rb_next(node)) {
+		entry = mountfs_node_to_entry(node);
+		len = snprintf(name, sizeof(name), "%i", entry->id);
+		if (WARN_ON(len >= sizeof(name)))
+			goto out_unlock;
+		if (!mountfs_entry_visible(entry))
+			continue;
+		ctx->pos = (loff_t) entry->id + 2;
+		if (!dir_emit(ctx, name, len, MOUNTFS_INO(entry->id), DT_DIR))
+			goto out_unlock;
+	}
+	ctx->pos = (loff_t) INT_MAX + 3;
+out_unlock:
+	mutex_unlock(&mountfs_lock);
+	return 0;
+}
+
+int mountfs_lookup_internal(struct vfsmount *m, struct path *path)
+{
+	char name[32];
+	struct qstr this = { .name = name };
+	struct mount *mnt = real_mount(m);
+	struct mountfs_entry *entry = mnt->mnt_mountfs_entry;
+	struct dentry *dentry, *old, *root = mountfs_mnt->mnt_root;
+
+	this.len = snprintf(name, sizeof(name), "%i", mnt->mnt_id);
+	if (WARN_ON(this.len >= sizeof(name)))
+		return -EIO;
+
+	dentry = d_hash_and_lookup(root, &this);
+	if (!dentry) {
+		DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
+
+		dentry = d_alloc_parallel(root, &this, &wq);
+		if (!IS_ERR(dentry) && d_in_lookup(dentry)) {
+			kref_get(&entry->kref);
+			old = mountfs_lookup_entry(dentry, entry, 0);
+			d_lookup_done(dentry);
+			if (unlikely(old)) {
+				dput(dentry);
+				dentry = old;
+			}
+		}
+		if (IS_ERR(dentry))
+			return PTR_ERR(dentry);
+	}
+
+	*path = (struct path) { .mnt = mountfs_mnt, .dentry = dentry };
+	return 0;
+}
+
+static const struct dentry_operations mountfs_dops = {
+	.d_revalidate = mountfs_d_revalidate,
+};
+
+static const struct inode_operations mountfs_iops = {
+	.lookup = mountfs_lookup,
+};
+
+static const struct file_operations mountfs_fops = {
+	.iterate_shared = mountfs_readdir,
+	.read = generic_read_dir,
+	.llseek = generic_file_llseek,
+};
+
+static void mountfs_init_inode(struct inode *inode, umode_t mode)
+{
+	inode->i_mode = mode;
+	inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
+	if (S_ISREG(mode)) {
+		inode->i_size = PAGE_SIZE;
+		inode->i_fop = &mountfs_attr_fops;
+	} else {
+		inode->i_op = &mountfs_iops;
+		inode->i_fop = &mountfs_fops;
+	}
+}
+
+static void mountfs_evict_inode(struct inode *inode)
+{
+	struct mountfs_entry *entry = inode->i_private;
+
+	clear_inode(inode);
+	if (entry)
+		mountfs_entry_put(entry);
+}
+
+static const struct super_operations mountfs_sops = {
+	.statfs		= simple_statfs,
+	.drop_inode	= generic_delete_inode,
+	.evict_inode	= mountfs_evict_inode,
+};
+
+static int mountfs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+	struct inode *root;
+
+	sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV;
+	sb->s_blocksize = PAGE_SIZE;
+	sb->s_blocksize_bits = PAGE_SHIFT;
+	sb->s_magic = MOUNTFS_SUPER_MAGIC;
+	sb->s_time_gran = 1;
+	sb->s_shrink.seeks = 0;
+	sb->s_op = &mountfs_sops;
+	sb->s_d_op = &mountfs_dops;
+
+	root = new_inode(sb);
+	if (!root)
+		return -ENOMEM;
+
+	root->i_ino = 1;
+	mountfs_init_inode(root, S_IFDIR | 0444);
+
+	sb->s_root = d_make_root(root);
+	if (!sb->s_root)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int mountfs_get_tree(struct fs_context *fc)
+{
+	return get_tree_single(fc, mountfs_fill_super);
+}
+
+static const struct fs_context_operations mountfs_context_ops = {
+	.get_tree = mountfs_get_tree,
+};
+
+static int mountfs_init_fs_context(struct fs_context *fc)
+{
+	fc->ops = &mountfs_context_ops;
+	fc->global = true;
+	return 0;
+}
+
+static struct file_system_type mountfs_fs_type = {
+	.name = "mountfs",
+	.init_fs_context = mountfs_init_fs_context,
+	.kill_sb = kill_anon_super,
+};
+
+static int __init mountfs_init(void)
+{
+	int err;
+
+	err = register_filesystem(&mountfs_fs_type);
+	if (!err) {
+		mountfs_mnt = kern_mount(&mountfs_fs_type);
+		if (IS_ERR(mountfs_mnt)) {
+			err = PTR_ERR(mountfs_mnt);
+			unregister_filesystem(&mountfs_fs_type);
+		}
+	}
+	return err;
+}
+fs_initcall(mountfs_init);
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -172,6 +172,24 @@ unsigned int mnt_get_count(struct mount
 #endif
 }
 
+struct mount *get_mount(struct mount *mnt)
+{
+	if (mnt) {
+		/* see comment in mntput_no_expire() */
+		if (likely(READ_ONCE(mnt->mnt_ns))) {
+			mnt_add_count(mnt, 1);
+		} else {
+			lock_mount_hash();
+			if (mnt->mnt.mnt_flags & MNT_DOOMED)
+				mnt = NULL;
+			else
+				mnt_add_count(mnt, 1);
+			unlock_mount_hash();
+		}
+	}
+	return mnt;
+}
+
 static struct mount *alloc_vfsmnt(const char *name)
 {
 	struct mount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL);
@@ -1091,6 +1109,9 @@ static void cleanup_mnt(struct mount *mn
 	 * so mnt_get_writers() below is safe.
 	 */
 	WARN_ON(mnt_get_writers(mnt));
+
+	mountfs_remove(mnt);
+
 	if (unlikely(mnt->mnt_pins.first))
 		mnt_pin_kill(mnt);
 	hlist_for_each_entry_safe(m, p, &mnt->mnt_stuck_children, mnt_umount) {
@@ -1120,7 +1141,7 @@ static void delayed_mntput(struct work_s
 }
 static DECLARE_DELAYED_WORK(delayed_mntput_work, delayed_mntput);
 
-static void mntput_no_expire(struct mount *mnt)
+void mntput_no_expire(struct mount *mnt)
 {
 	LIST_HEAD(list);
 
@@ -1296,6 +1317,37 @@ const struct seq_operations mounts_op =
 };
 #endif  /* CONFIG_PROC_FS */
 
+void seq_mount_children(struct seq_file *sf, struct mount *mnt)
+{
+	struct mount *child;
+	bool first = true;
+
+	down_read(&namespace_sem);
+	list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
+		if (!first)
+			seq_putc(sf, ',');
+		else
+			first = false;
+		seq_printf(sf, "%i", child->mnt_id);
+	}
+	up_read(&namespace_sem);
+	if (!first)
+		seq_putc(sf, '\n');
+}
+
+void seq_mount_propagate_from(struct seq_file *sf, struct mount *mnt,
+			      const struct path *root)
+{
+	int dom;
+
+	down_read(&namespace_sem);
+	dom = get_dominating_id(mnt, root);
+	up_read(&namespace_sem);
+
+	if (dom)
+		seq_printf(sf, "%i\n", dom);
+}
+
 /**
  * may_umount_tree - check if a mount tree is busy
  * @mnt: root of mount tree
@@ -2062,6 +2114,9 @@ static int attach_recursive_mnt(struct m
 		err = count_mounts(ns, source_mnt);
 		if (err)
 			goto out;
+
+		for (p = source_mnt; p; p = next_mnt(p, source_mnt))
+			mountfs_create(p, ns);
 	}
 
 	if (IS_MNT_SHARED(dest_mnt)) {
@@ -3224,6 +3279,7 @@ struct mnt_namespace *copy_mnt_ns(unsign
 	p = old;
 	q = new;
 	while (p) {
+		mountfs_create(q, new_ns);
 		q->mnt_ns = new_ns;
 		new_ns->mounts++;
 		if (new_fs) {
@@ -3686,6 +3742,8 @@ static void __init init_mount_tree(void)
 	if (IS_ERR(ns))
 		panic("Can't allocate initial namespace");
 	m = real_mount(mnt);
+
+	mountfs_create(m, ns);
 	m->mnt_ns = ns;
 	ns->root = m;
 	ns->mounts = 1;
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3092,6 +3092,7 @@ static const struct pid_entry tgid_base_
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
 	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
+	DIR("fdmount",    S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations),
 	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
@@ -3497,6 +3498,7 @@ static const struct inode_operations pro
 static const struct pid_entry tid_base_stuff[] = {
 	DIR("fd",        S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
 	DIR("fdinfo",    S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
+	DIR("fdmount",   S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations),
 	DIR("ns",	 S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -361,3 +361,85 @@ const struct file_operations proc_fdinfo
 	.iterate_shared	= proc_readfdinfo,
 	.llseek		= generic_file_llseek,
 };
+
+static int proc_fdmount_link(struct dentry *dentry, struct path *path)
+{
+	struct files_struct *files = NULL;
+	struct task_struct *task;
+	struct path fd_path;
+	int ret = -ENOENT;
+
+	task = get_proc_task(d_inode(dentry));
+	if (task) {
+		files = get_files_struct(task);
+		put_task_struct(task);
+	}
+
+	if (files) {
+		unsigned int fd = proc_fd(d_inode(dentry));
+		struct file *fd_file;
+
+		spin_lock(&files->file_lock);
+		fd_file = fcheck_files(files, fd);
+		if (fd_file) {
+			fd_path = fd_file->f_path;
+			path_get(&fd_path);
+			ret = 0;
+		}
+		spin_unlock(&files->file_lock);
+		put_files_struct(files);
+	}
+	if (!ret) {
+		ret = mountfs_lookup_internal(fd_path.mnt, path);
+		path_put(&fd_path);
+	}
+
+	return ret;
+}
+
+static struct dentry *proc_fdmount_instantiate(struct dentry *dentry,
+	struct task_struct *task, const void *ptr)
+{
+	const struct fd_data *data = ptr;
+	struct proc_inode *ei;
+	struct inode *inode;
+
+	inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | 0400);
+	if (!inode)
+		return ERR_PTR(-ENOENT);
+
+	ei = PROC_I(inode);
+	ei->fd = data->fd;
+
+	inode->i_op = &proc_pid_link_inode_operations;
+	inode->i_size = 64;
+
+	ei->op.proc_get_link = proc_fdmount_link;
+	tid_fd_update_inode(task, inode, 0);
+
+	d_set_d_op(dentry, &tid_fd_dentry_operations);
+	return d_splice_alias(inode, dentry);
+}
+
+static struct dentry *
+proc_lookupfdmount(struct inode *dir, struct dentry *dentry, unsigned int flags)
+{
+	return proc_lookupfd_common(dir, dentry, proc_fdmount_instantiate);
+}
+
+static int proc_readfdmount(struct file *file, struct dir_context *ctx)
+{
+	return proc_readfd_common(file, ctx,
+				  proc_fdmount_instantiate);
+}
+
+const struct inode_operations proc_fdmount_inode_operations = {
+	.lookup		= proc_lookupfdmount,
+	.setattr	= proc_setattr,
+};
+
+const struct file_operations proc_fdmount_operations = {
+	.read		= generic_read_dir,
+	.iterate_shared	= proc_readfdmount,
+	.llseek		= generic_file_llseek,
+};
--- a/fs/proc/fd.h
+++ b/fs/proc/fd.h
@@ -10,6 +10,9 @@ extern const struct inode_operations pro
 extern const struct file_operations proc_fdinfo_operations;
 extern const struct inode_operations proc_fdinfo_inode_operations;
 
+extern const struct file_operations proc_fdmount_operations;
+extern const struct inode_operations proc_fdmount_inode_operations;
+
 extern int proc_fd_permission(struct inode *inode, int mask);
 
 static inline unsigned int proc_fd(struct inode *inode)
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -61,24 +61,6 @@ static int show_sb_opts(struct seq_file
 	return security_sb_show_options(m, sb);
 }
 
-static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
-{
-	static const struct proc_fs_info mnt_info[] = {
-		{ MNT_NOSUID, ",nosuid" },
-		{ MNT_NODEV, ",nodev" },
-		{ MNT_NOEXEC, ",noexec" },
-		{ MNT_NOATIME, ",noatime" },
-		{ MNT_NODIRATIME, ",nodiratime" },
-		{ MNT_RELATIME, ",relatime" },
-		{ 0, NULL }
-	};
-	const struct proc_fs_info *fs_infop;
-
-	for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) {
-		if (mnt->mnt_flags & fs_infop->flag)
-			seq_puts(m, fs_infop->str);
-	}
-}
 
 static inline void mangle(struct seq_file *m, const char *s)
 {
@@ -120,7 +102,7 @@ static int show_vfsmnt(struct seq_file *
 	err = show_sb_opts(m, sb);
 	if (err)
 		goto out;
-	show_mnt_opts(m, mnt);
+	seq_mnt_opts(m, mnt->mnt_flags);
 	if (sb->s_op->show_options)
 		err = sb->s_op->show_options(m, mnt_path.dentry);
 	seq_puts(m, " 0 0\n");
@@ -153,7 +135,7 @@ static int show_mountinfo(struct seq_fil
 		goto out;
 
 	seq_puts(m, mnt->mnt_flags & MNT_READONLY ? " ro" : " rw");
-	show_mnt_opts(m, mnt);
+	seq_mnt_opts(m, mnt->mnt_flags);
 
 	/* Tagged fields ("foo:X" or "bar") */
 	if (IS_MNT_SHARED(r))
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -15,6 +15,7 @@
 #include <linux/cred.h>
 #include <linux/mm.h>
 #include <linux/printk.h>
+#include <linux/mount.h>
 #include <linux/string_helpers.h>
 
 #include <linux/uaccess.h>
@@ -548,6 +549,28 @@ int seq_dentry(struct seq_file *m, struc
 }
 EXPORT_SYMBOL(seq_dentry);
 
+void seq_mnt_opts(struct seq_file *m, int mnt_flags)
+{
+	unsigned int i;
+	static const struct {
+		int flag;
+		const char *str;
+	} mnt_info[] = {
+		{ MNT_NOSUID, ",nosuid" },
+		{ MNT_NODEV, ",nodev" },
+		{ MNT_NOEXEC, ",noexec" },
+		{ MNT_NOATIME, ",noatime" },
+		{ MNT_NODIRATIME, ",nodiratime" },
+		{ MNT_RELATIME, ",relatime" },
+		{ 0, NULL }
+	};
+
+	for (i = 0; mnt_info[i].flag; i++) {
+		if (mnt_flags & mnt_info[i].flag)
+			seq_puts(m, mnt_info[i].str);
+	}
+}
+
 static void *single_start(struct seq_file *p, loff_t *pos)
 {
 	return NULL + (*pos == 0);
--- a/include/linux/seq_file.h
+++ b/include/linux/seq_file.h
@@ -138,6 +138,7 @@ int seq_file_path(struct seq_file *, str
 int seq_dentry(struct seq_file *, struct dentry *, const char *);
 int seq_path_root(struct seq_file *m, const struct path *path,
 		  const struct path *root, const char *esc);
+void seq_mnt_opts(struct seq_file *m, int mnt_flags);
 
 int single_open(struct file *, int (*)(struct seq_file *, void *), void *);
 int single_open_size(struct file *, int (*)(struct seq_file *, void *), void *, size_t);

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-06 16:25                     ` Miklos Szeredi
@ 2020-03-06 19:43                       ` Al Viro
  2020-03-06 19:54                         ` Miklos Szeredi
  2020-03-06 19:58                         ` Al Viro
  2020-03-07  9:48                       ` Greg Kroah-Hartman
  1 sibling, 2 replies; 117+ messages in thread
From: Al Viro @ 2020-03-06 19:43 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, David Howells, Christian Brauner, James Bottomley,
	Steven Whitehouse, Miklos Szeredi, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote:
> On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote:
> > 
> > I'm doing a patch.   Let's see how it fares in the face of all these
> > preconceptions.
> 
> Here's a first cut.  Doesn't yet have superblock info, just mount info.
> Probably has rough edges, but appears to work.

For starters, you have just made namespace_sem held over copy_to_user().
This is not going to fly.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-06 19:43                       ` Al Viro
@ 2020-03-06 19:54                         ` Miklos Szeredi
  2020-03-06 19:58                         ` Al Viro
  1 sibling, 0 replies; 117+ messages in thread
From: Miklos Szeredi @ 2020-03-06 19:54 UTC (permalink / raw)
  To: Al Viro
  Cc: Ian Kent, David Howells, Christian Brauner, James Bottomley,
	Steven Whitehouse, Miklos Szeredi, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Fri, Mar 6, 2020 at 8:43 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote:
> > On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote:
> > >
> > > I'm doing a patch.   Let's see how it fares in the face of all these
> > > preconceptions.
> >
> > Here's a first cut.  Doesn't yet have superblock info, just mount info.
> > Probably has rough edges, but appears to work.
>
> For starters, you have just made namespace_sem held over copy_to_user().
> This is not going to fly.

Where?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-06 19:43                       ` Al Viro
  2020-03-06 19:54                         ` Miklos Szeredi
@ 2020-03-06 19:58                         ` Al Viro
  2020-03-06 20:05                           ` Al Viro
  1 sibling, 1 reply; 117+ messages in thread
From: Al Viro @ 2020-03-06 19:58 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, David Howells, Christian Brauner, James Bottomley,
	Steven Whitehouse, Miklos Szeredi, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Fri, Mar 06, 2020 at 07:43:22PM +0000, Al Viro wrote:
> On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote:
> > On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote:
> > > 
> > > I'm doing a patch.   Let's see how it fares in the face of all these
> > > preconceptions.
> > 
> > Here's a first cut.  Doesn't yet have superblock info, just mount info.
> > Probably has rough edges, but appears to work.
> 
> For starters, you have just made namespace_sem held over copy_to_user().
> This is not going to fly.

In case if the above is too terse: you grab your mutex while under
namespace_sem (see attach_recursive_mnt()); the same mutex is held
while calling dir_emit().  Which can (and normally does) copy data
to userland-supplied buffer.

NAK for that reason alone, and to be honest I had been too busy
suppressing the gag reflex to read and comment any deeper.

I really hate that approach, in case it's not clear from the above.
To the degree that I don't trust myself to filter out the obscenities
if I try to comment on it right now.

The only blocking thing we can afford under namespace_sem is GFP_KERNEL
allocation.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-06 19:58                         ` Al Viro
@ 2020-03-06 20:05                           ` Al Viro
  2020-03-06 20:11                             ` Miklos Szeredi
  2020-03-06 20:37                             ` Al Viro
  0 siblings, 2 replies; 117+ messages in thread
From: Al Viro @ 2020-03-06 20:05 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, David Howells, Christian Brauner, James Bottomley,
	Steven Whitehouse, Miklos Szeredi, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Fri, Mar 06, 2020 at 07:58:23PM +0000, Al Viro wrote:
> On Fri, Mar 06, 2020 at 07:43:22PM +0000, Al Viro wrote:
> > On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote:
> > > On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote:
> > > > 
> > > > I'm doing a patch.   Let's see how it fares in the face of all these
> > > > preconceptions.
> > > 
> > > Here's a first cut.  Doesn't yet have superblock info, just mount info.
> > > Probably has rough edges, but appears to work.
> > 
> > For starters, you have just made namespace_sem held over copy_to_user().
> > This is not going to fly.
> 
> In case if the above is too terse: you grab your mutex while under
> namespace_sem (see attach_recursive_mnt()); the same mutex is held
> while calling dir_emit().  Which can (and normally does) copy data
> to userland-supplied buffer.
> 
> NAK for that reason alone, and to be honest I had been too busy
> suppressing the gag reflex to read and comment any deeper.
> 
> I really hate that approach, in case it's not clear from the above.
> To the degree that I don't trust myself to filter out the obscenities
> if I try to comment on it right now.
> 
> The only blocking thing we can afford under namespace_sem is GFP_KERNEL
> allocation.

Incidentally, attach_recursive_mnt() only gets you the root(s) of
attached tree(s); try mount --rbind and see how much you've missed.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-06 20:05                           ` Al Viro
@ 2020-03-06 20:11                             ` Miklos Szeredi
  2020-03-06 20:37                             ` Al Viro
  1 sibling, 0 replies; 117+ messages in thread
From: Miklos Szeredi @ 2020-03-06 20:11 UTC (permalink / raw)
  To: Al Viro
  Cc: Ian Kent, David Howells, Christian Brauner, James Bottomley,
	Steven Whitehouse, Miklos Szeredi, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Fri, Mar 6, 2020 at 9:05 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Mar 06, 2020 at 07:58:23PM +0000, Al Viro wrote:
> > On Fri, Mar 06, 2020 at 07:43:22PM +0000, Al Viro wrote:
> > > On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote:
> > > > On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote:
> > > > >
> > > > > I'm doing a patch.   Let's see how it fares in the face of all these
> > > > > preconceptions.
> > > >
> > > > Here's a first cut.  Doesn't yet have superblock info, just mount info.
> > > > Probably has rough edges, but appears to work.
> > >
> > > For starters, you have just made namespace_sem held over copy_to_user().
> > > This is not going to fly.
> >
> > In case if the above is too terse: you grab your mutex while under
> > namespace_sem (see attach_recursive_mnt()); the same mutex is held
> > while calling dir_emit().  Which can (and normally does) copy data
> > to userland-supplied buffer.
> >
> > NAK for that reason alone, and to be honest I had been too busy
> > suppressing the gag reflex to read and comment any deeper.
> >
> > I really hate that approach, in case it's not clear from the above.
> > To the degree that I don't trust myself to filter out the obscenities
> > if I try to comment on it right now.
> >
> > The only blocking thing we can afford under namespace_sem is GFP_KERNEL
> > allocation.
>
> Incidentally, attach_recursive_mnt() only gets you the root(s) of
> attached tree(s); try mount --rbind and see how much you've missed.

Okay.  Both trivially fixable:

 - the dir_emit() can be taken out from under the mutex and the rb
tree search repeated for every entry; possibly not as efficient, but I
guess at this point that's irrelevant

 - addition of the mountfs entry moved to the right places

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-06 20:05                           ` Al Viro
  2020-03-06 20:11                             ` Miklos Szeredi
@ 2020-03-06 20:37                             ` Al Viro
  2020-03-06 20:38                               ` Al Viro
  1 sibling, 1 reply; 117+ messages in thread
From: Al Viro @ 2020-03-06 20:37 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, David Howells, Christian Brauner, James Bottomley,
	Steven Whitehouse, Miklos Szeredi, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Fri, Mar 06, 2020 at 08:05:22PM +0000, Al Viro wrote:
> On Fri, Mar 06, 2020 at 07:58:23PM +0000, Al Viro wrote:
> > On Fri, Mar 06, 2020 at 07:43:22PM +0000, Al Viro wrote:
> > > On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote:
> > > > On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote:
> > > > > 
> > > > > I'm doing a patch.   Let's see how it fares in the face of all these
> > > > > preconceptions.
> > > > 
> > > > Here's a first cut.  Doesn't yet have superblock info, just mount info.
> > > > Probably has rough edges, but appears to work.
> > > 
> > > For starters, you have just made namespace_sem held over copy_to_user().
> > > This is not going to fly.
> > 
> > In case if the above is too terse: you grab your mutex while under
> > namespace_sem (see attach_recursive_mnt()); the same mutex is held
> > while calling dir_emit().  Which can (and normally does) copy data
> > to userland-supplied buffer.
> > 
> > NAK for that reason alone, and to be honest I had been too busy
> > suppressing the gag reflex to read and comment any deeper.
> > 
> > I really hate that approach, in case it's not clear from the above.
> > To the degree that I don't trust myself to filter out the obscenities
> > if I try to comment on it right now.
> > 
> > The only blocking thing we can afford under namespace_sem is GFP_KERNEL
> > allocation.
> 
> Incidentally, attach_recursive_mnt() only gets you the root(s) of
> attached tree(s); try mount --rbind and see how much you've missed.

You are misreading mntput_no_expire(), BTW - your get_mount() can
bloody well race with umount(2), hitting the moment when we are done
figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone
set MNT_DOOMED) yet.  If somebody calls umount(2) on a filesystem that
is not mounted anywhere else, they are not supposed to see the sucker
return 0 until the filesystem is shut down.  You break that.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-06 20:37                             ` Al Viro
@ 2020-03-06 20:38                               ` Al Viro
  2020-03-06 20:45                                 ` Al Viro
  0 siblings, 1 reply; 117+ messages in thread
From: Al Viro @ 2020-03-06 20:38 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, David Howells, Christian Brauner, James Bottomley,
	Steven Whitehouse, Miklos Szeredi, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Fri, Mar 06, 2020 at 08:37:05PM +0000, Al Viro wrote:

> You are misreading mntput_no_expire(), BTW - your get_mount() can
> bloody well race with umount(2), hitting the moment when we are done
> figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone
> set MNT_DOOMED) yet.  If somebody calls umount(2) on a filesystem that
> is not mounted anywhere else, they are not supposed to see the sucker
> return 0 until the filesystem is shut down.  You break that.

While we are at it, d_alloc_parallel() requires i_rwsem on parent held
at least shared.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-06 20:38                               ` Al Viro
@ 2020-03-06 20:45                                 ` Al Viro
  2020-03-06 20:49                                   ` Al Viro
  2020-03-06 20:51                                   ` Miklos Szeredi
  0 siblings, 2 replies; 117+ messages in thread
From: Al Viro @ 2020-03-06 20:45 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, David Howells, Christian Brauner, James Bottomley,
	Steven Whitehouse, Miklos Szeredi, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Fri, Mar 06, 2020 at 08:38:44PM +0000, Al Viro wrote:
> On Fri, Mar 06, 2020 at 08:37:05PM +0000, Al Viro wrote:
> 
> > You are misreading mntput_no_expire(), BTW - your get_mount() can
> > bloody well race with umount(2), hitting the moment when we are done
> > figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone
> > set MNT_DOOMED) yet.  If somebody calls umount(2) on a filesystem that
> > is not mounted anywhere else, they are not supposed to see the sucker
> > return 0 until the filesystem is shut down.  You break that.
> 
> While we are at it, d_alloc_parallel() requires i_rwsem on parent held
> at least shared.

Egads...  Let me see if I got it right - you are providing procfs symlinks
to objects on the internal mount of that thing.  And those objects happen
to be directories, so one can get to their parent that way.  Or am I misreading
that thing?

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-06 20:45                                 ` Al Viro
@ 2020-03-06 20:49                                   ` Al Viro
  2020-03-06 20:51                                     ` Miklos Szeredi
  2020-03-06 20:56                                     ` Al Viro
  2020-03-06 20:51                                   ` Miklos Szeredi
  1 sibling, 2 replies; 117+ messages in thread
From: Al Viro @ 2020-03-06 20:49 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, David Howells, Christian Brauner, James Bottomley,
	Steven Whitehouse, Miklos Szeredi, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Fri, Mar 06, 2020 at 08:45:23PM +0000, Al Viro wrote:
> On Fri, Mar 06, 2020 at 08:38:44PM +0000, Al Viro wrote:
> > On Fri, Mar 06, 2020 at 08:37:05PM +0000, Al Viro wrote:
> > 
> > > You are misreading mntput_no_expire(), BTW - your get_mount() can
> > > bloody well race with umount(2), hitting the moment when we are done
> > > figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone
> > > set MNT_DOOMED) yet.  If somebody calls umount(2) on a filesystem that
> > > is not mounted anywhere else, they are not supposed to see the sucker
> > > return 0 until the filesystem is shut down.  You break that.
> > 
> > While we are at it, d_alloc_parallel() requires i_rwsem on parent held
> > at least shared.
> 
> Egads...  Let me see if I got it right - you are providing procfs symlinks
> to objects on the internal mount of that thing.  And those objects happen
> to be directories, so one can get to their parent that way.  Or am I misreading
> that thing?

IDGI.  You have (in your lookup) kstrtoul, followed by snprintf, followed
by strcmp and WARN_ON() in case of mismatch?  Is there any point in having
stat(2) on "00" spew into syslog?  Confused...

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-06 20:45                                 ` Al Viro
  2020-03-06 20:49                                   ` Al Viro
@ 2020-03-06 20:51                                   ` Miklos Szeredi
  1 sibling, 0 replies; 117+ messages in thread
From: Miklos Szeredi @ 2020-03-06 20:51 UTC (permalink / raw)
  To: Al Viro
  Cc: Ian Kent, David Howells, Christian Brauner, James Bottomley,
	Steven Whitehouse, Miklos Szeredi, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Fri, Mar 6, 2020 at 9:45 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Mar 06, 2020 at 08:38:44PM +0000, Al Viro wrote:
> > On Fri, Mar 06, 2020 at 08:37:05PM +0000, Al Viro wrote:
> >
> > > You are misreading mntput_no_expire(), BTW - your get_mount() can
> > > bloody well race with umount(2), hitting the moment when we are done
> > > figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone
> > > set MNT_DOOMED) yet.  If somebody calls umount(2) on a filesystem that
> > > is not mounted anywhere else, they are not supposed to see the sucker
> > > return 0 until the filesystem is shut down.  You break that.

Ah, good point.

> >
> > While we are at it, d_alloc_parallel() requires i_rwsem on parent held
> > at least shared.

Okay.

> Egads...  Let me see if I got it right - you are providing procfs symlinks
> to objects on the internal mount of that thing.  And those objects happen
> to be directories, so one can get to their parent that way.  Or am I misreading
> that thing?

Yes.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-06 20:49                                   ` Al Viro
@ 2020-03-06 20:51                                     ` Miklos Szeredi
  2020-03-06 21:28                                       ` Al Viro
  2020-03-06 20:56                                     ` Al Viro
  1 sibling, 1 reply; 117+ messages in thread
From: Miklos Szeredi @ 2020-03-06 20:51 UTC (permalink / raw)
  To: Al Viro
  Cc: Ian Kent, David Howells, Christian Brauner, James Bottomley,
	Steven Whitehouse, Miklos Szeredi, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Fri, Mar 6, 2020 at 9:49 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Mar 06, 2020 at 08:45:23PM +0000, Al Viro wrote:
> > On Fri, Mar 06, 2020 at 08:38:44PM +0000, Al Viro wrote:
> > > On Fri, Mar 06, 2020 at 08:37:05PM +0000, Al Viro wrote:
> > >
> > > > You are misreading mntput_no_expire(), BTW - your get_mount() can
> > > > bloody well race with umount(2), hitting the moment when we are done
> > > > figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone
> > > > set MNT_DOOMED) yet.  If somebody calls umount(2) on a filesystem that
> > > > is not mounted anywhere else, they are not supposed to see the sucker
> > > > return 0 until the filesystem is shut down.  You break that.
> > >
> > > While we are at it, d_alloc_parallel() requires i_rwsem on parent held
> > > at least shared.
> >
> > Egads...  Let me see if I got it right - you are providing procfs symlinks
> > to objects on the internal mount of that thing.  And those objects happen
> > to be directories, so one can get to their parent that way.  Or am I misreading
> > that thing?
>
> IDGI.  You have (in your lookup) kstrtoul, followed by snprintf, followed
> by strcmp and WARN_ON() in case of mismatch?  Is there any point in having
> stat(2) on "00" spew into syslog?  Confused...

The WARN_ON() is for the buffer overrun, not for the strcmp mismatch.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-06 20:49                                   ` Al Viro
  2020-03-06 20:51                                     ` Miklos Szeredi
@ 2020-03-06 20:56                                     ` Al Viro
  1 sibling, 0 replies; 117+ messages in thread
From: Al Viro @ 2020-03-06 20:56 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, David Howells, Christian Brauner, James Bottomley,
	Steven Whitehouse, Miklos Szeredi, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Fri, Mar 06, 2020 at 08:49:26PM +0000, Al Viro wrote:
> On Fri, Mar 06, 2020 at 08:45:23PM +0000, Al Viro wrote:
> > On Fri, Mar 06, 2020 at 08:38:44PM +0000, Al Viro wrote:
> > > On Fri, Mar 06, 2020 at 08:37:05PM +0000, Al Viro wrote:
> > > 
> > > > You are misreading mntput_no_expire(), BTW - your get_mount() can
> > > > bloody well race with umount(2), hitting the moment when we are done
> > > > figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone
> > > > set MNT_DOOMED) yet.  If somebody calls umount(2) on a filesystem that
> > > > is not mounted anywhere else, they are not supposed to see the sucker
> > > > return 0 until the filesystem is shut down.  You break that.
> > > 
> > > While we are at it, d_alloc_parallel() requires i_rwsem on parent held
> > > at least shared.
> > 
> > Egads...  Let me see if I got it right - you are providing procfs symlinks
> > to objects on the internal mount of that thing.  And those objects happen
> > to be directories, so one can get to their parent that way.  Or am I misreading
> > that thing?
> 
> IDGI.  You have (in your lookup) kstrtoul, followed by snprintf, followed
> by strcmp and WARN_ON() in case of mismatch?  Is there any point in having
> stat(2) on "00" spew into syslog?  Confused...

AFAICS, refcounting in there cannot be right:
+static struct dentry *mountfs_lookup(struct inode *dir, struct dentry *dentry,
+                                    unsigned int flags)
+{
+       struct mountfs_entry *entry = dir->i_private;
+       int i = 0;
+               
+       if (entry) {
+               for (i = 0; i < ARRAY_SIZE(mountfs_attrs); i++)
+                       if (strcmp(mountfs_attrs[i], dentry->d_name.name) == 0)
+                               break;
+               if (i == ARRAY_SIZE(mountfs_attrs))
+                       return ERR_PTR(-ENOMEM);
+               i++;
+       } else {
+               entry = mountfs_get_entry(dentry->d_name.name);
+               if (!entry)
+                       return ERR_PTR(-ENOENT);
+       }
+                          
+       return mountfs_lookup_entry(dentry, entry, i);
+}
ends up consuming a reference in mountfs_lookup_entry() (at the very least,
drops it in case of inode allocation hitting OOM) and nothing in the
that loop in mountfs_lookup() appears to do a matching reference grab.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-06 20:51                                     ` Miklos Szeredi
@ 2020-03-06 21:28                                       ` Al Viro
  0 siblings, 0 replies; 117+ messages in thread
From: Al Viro @ 2020-03-06 21:28 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, David Howells, Christian Brauner, James Bottomley,
	Steven Whitehouse, Miklos Szeredi, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Greg Kroah-Hartman

On Fri, Mar 06, 2020 at 09:51:50PM +0100, Miklos Szeredi wrote:
> On Fri, Mar 6, 2020 at 9:49 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Fri, Mar 06, 2020 at 08:45:23PM +0000, Al Viro wrote:
> > > On Fri, Mar 06, 2020 at 08:38:44PM +0000, Al Viro wrote:
> > > > On Fri, Mar 06, 2020 at 08:37:05PM +0000, Al Viro wrote:
> > > >
> > > > > You are misreading mntput_no_expire(), BTW - your get_mount() can
> > > > > bloody well race with umount(2), hitting the moment when we are done
> > > > > figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone
> > > > > set MNT_DOOMED) yet.  If somebody calls umount(2) on a filesystem that
> > > > > is not mounted anywhere else, they are not supposed to see the sucker
> > > > > return 0 until the filesystem is shut down.  You break that.
> > > >
> > > > While we are at it, d_alloc_parallel() requires i_rwsem on parent held
> > > > at least shared.
> > >
> > > Egads...  Let me see if I got it right - you are providing procfs symlinks
> > > to objects on the internal mount of that thing.  And those objects happen
> > > to be directories, so one can get to their parent that way.  Or am I misreading
> > > that thing?
> >
> > IDGI.  You have (in your lookup) kstrtoul, followed by snprintf, followed
> > by strcmp and WARN_ON() in case of mismatch?  Is there any point in having
> > stat(2) on "00" spew into syslog?  Confused...
> 
> The WARN_ON() is for the buffer overrun, not for the strcmp mismatch.

That makes even less sense - buffer overrun on snprintf of an int into
32-character array?  That's what, future-proofing it for the time we
manage to issue 10^31 syscalls since the (much closer) moment when we
get 128bit int?

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-06 16:25                     ` Miklos Szeredi
  2020-03-06 19:43                       ` Al Viro
@ 2020-03-07  9:48                       ` Greg Kroah-Hartman
  2020-03-07 20:48                         ` Miklos Szeredi
  1 sibling, 1 reply; 117+ messages in thread
From: Greg Kroah-Hartman @ 2020-03-07  9:48 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, David Howells, Christian Brauner, James Bottomley,
	Steven Whitehouse, Miklos Szeredi, viro, Christian Brauner,
	Jann Horn, Darrick J. Wong, Linux API, linux-fsdevel, lkml

On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote:
> On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote:
> > 
> > I'm doing a patch.   Let's see how it fares in the face of all these
> > preconceptions.
> 
> Here's a first cut.  Doesn't yet have superblock info, just mount info.
> Probably has rough edges, but appears to work.
> 
> I started with sysfs, then kernfs, then went with a custom filesystem, because
> neither could do what I wanted.

Hm, what is wrong with kernfs that prevented you from using it here?
Just complexity or something else?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-03-07  9:48                       ` Greg Kroah-Hartman
@ 2020-03-07 20:48                         ` Miklos Szeredi
  0 siblings, 0 replies; 117+ messages in thread
From: Miklos Szeredi @ 2020-03-07 20:48 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Ian Kent, David Howells, Christian Brauner, James Bottomley,
	Steven Whitehouse, Miklos Szeredi, viro, Christian Brauner,
	Jann Horn, Darrick J. Wong, Linux API, linux-fsdevel, lkml

On Sat, Mar 7, 2020 at 10:48 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote:
> > On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote:
> > >
> > > I'm doing a patch.   Let's see how it fares in the face of all these
> > > preconceptions.
> >
> > Here's a first cut.  Doesn't yet have superblock info, just mount info.
> > Probably has rough edges, but appears to work.
> >
> > I started with sysfs, then kernfs, then went with a custom filesystem, because
> > neither could do what I wanted.
>
> Hm, what is wrong with kernfs that prevented you from using it here?
> Just complexity or something else?

I wanted to have a single instance covering all the namespaces, with
just a filtered view depending on which namespace the task is looking
at it.

Having a kernfs_node for each attribute is also rather heavy compared
to the size of struct mount.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 117+ messages in thread

end of thread, back to index

Thread overview: 117+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-21 18:01 [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] David Howells
2020-02-21 18:01 ` [PATCH 01/17] watch_queue: Add security hooks to rule on setting mount and sb watches " David Howells
2020-02-21 18:02 ` [PATCH 02/17] watch_queue: Implement mount topology and attribute change notifications " David Howells
2020-02-21 18:02 ` [PATCH 03/17] watch_queue: sample: Display mount tree " David Howells
2020-02-21 18:02 ` [PATCH 04/17] watch_queue: Introduce a non-repeating system-unique superblock ID " David Howells
2020-02-21 18:02 ` [PATCH 05/17] watch_queue: Add superblock notifications " David Howells
2020-02-21 18:02 ` [PATCH 06/17] watch_queue: sample: Display " David Howells
2020-02-21 18:02 ` [PATCH 07/17] fsinfo: Add fsinfo() syscall to query filesystem information " David Howells
2020-02-26  2:29   ` Aleksa Sarai
2020-02-28 14:44   ` David Howells
2020-02-21 18:02 ` [PATCH 08/17] fsinfo: Provide a bitmap of supported features " David Howells
2020-02-21 18:03 ` [PATCH 09/17] fsinfo: Allow fsinfo() to look up a mount object by ID " David Howells
2020-02-21 18:03 ` [PATCH 10/17] fsinfo: Allow mount information to be queried " David Howells
2020-03-04 14:58   ` Miklos Szeredi
2020-03-04 16:10   ` Miklos Szeredi
2020-02-21 18:03 ` [PATCH 11/17] fsinfo: sample: Mount listing program " David Howells
2020-02-21 18:03 ` [PATCH 12/17] fsinfo: Allow the mount topology propogation flags to be retrieved " David Howells
2020-02-21 18:03 ` [PATCH 13/17] fsinfo: Query superblock unique ID and notification counter " David Howells
2020-02-21 18:03 ` [PATCH 14/17] fsinfo: Add API documentation " David Howells
2020-02-21 18:03 ` [PATCH 15/17] fsinfo: Add support for AFS " David Howells
2020-02-21 18:03 ` [PATCH 16/17] fsinfo: Add example support for Ext4 " David Howells
2020-02-21 18:04 ` [PATCH 17/17] fsinfo: Add example support for NFS " David Howells
2020-02-21 20:21 ` [PATCH 00/17] VFS: Filesystem information and notifications " James Bottomley
2020-02-24 10:24   ` Miklos Szeredi
2020-02-24 14:55     ` James Bottomley
2020-02-24 15:28       ` Miklos Szeredi
2020-02-25 12:13         ` Steven Whitehouse
2020-02-25 15:28           ` James Bottomley
2020-02-25 15:47             ` Steven Whitehouse
2020-02-26  9:11             ` Miklos Szeredi
2020-02-26 10:51               ` Steven Whitehouse
2020-02-27  5:06               ` Ian Kent
2020-02-27  9:36                 ` Miklos Szeredi
2020-02-27 11:34                   ` Ian Kent
2020-02-27 13:45                     ` Miklos Szeredi
2020-02-27 15:14                       ` Karel Zak
2020-02-28  0:43                         ` Ian Kent
2020-02-28  8:35                           ` Miklos Szeredi
2020-02-28 12:27                             ` Greg Kroah-Hartman
2020-02-28 16:24                               ` Miklos Szeredi
2020-02-28 17:15                                 ` Al Viro
2020-03-02  8:43                                   ` Miklos Szeredi
2020-03-02 10:34                                 ` Karel Zak
2020-02-28 16:42                               ` David Howells
2020-02-28 15:08                             ` James Bottomley
2020-02-28 15:40                               ` Miklos Szeredi
2020-02-28  0:12                       ` Ian Kent
2020-02-28 15:52             ` Christian Brauner
2020-02-28 16:36             ` David Howells
2020-03-02  9:09               ` Miklos Szeredi
2020-03-02  9:38                 ` Greg Kroah-Hartman
2020-03-03  5:27                 ` Ian Kent
2020-03-03  7:46                   ` Miklos Szeredi
2020-03-06 16:25                     ` Miklos Szeredi
2020-03-06 19:43                       ` Al Viro
2020-03-06 19:54                         ` Miklos Szeredi
2020-03-06 19:58                         ` Al Viro
2020-03-06 20:05                           ` Al Viro
2020-03-06 20:11                             ` Miklos Szeredi
2020-03-06 20:37                             ` Al Viro
2020-03-06 20:38                               ` Al Viro
2020-03-06 20:45                                 ` Al Viro
2020-03-06 20:49                                   ` Al Viro
2020-03-06 20:51                                     ` Miklos Szeredi
2020-03-06 21:28                                       ` Al Viro
2020-03-06 20:56                                     ` Al Viro
2020-03-06 20:51                                   ` Miklos Szeredi
2020-03-07  9:48                       ` Greg Kroah-Hartman
2020-03-07 20:48                         ` Miklos Szeredi
2020-03-03  9:12                   ` David Howells
2020-03-03  9:26                     ` Miklos Szeredi
2020-03-03  9:48                       ` Miklos Szeredi
2020-03-03 10:21                         ` Steven Whitehouse
2020-03-03 10:32                           ` Miklos Szeredi
2020-03-03 11:09                             ` Ian Kent
2020-03-03 10:00                       ` Christian Brauner
2020-03-03 10:13                         ` Miklos Szeredi
2020-03-03 10:25                           ` Christian Brauner
2020-03-03 11:33                             ` Miklos Szeredi
2020-03-03 11:56                               ` Christian Brauner
2020-03-03 11:38                       ` Karel Zak
2020-03-03 13:03                         ` Greg Kroah-Hartman
2020-03-03 13:14                           ` Greg Kroah-Hartman
2020-03-03 13:34                             ` Miklos Szeredi
2020-03-03 13:43                               ` Greg Kroah-Hartman
2020-03-03 14:10                                 ` Greg Kroah-Hartman
2020-03-03 14:13                                   ` Jann Horn
2020-03-03 14:24                                     ` Greg Kroah-Hartman
2020-03-03 15:44                                       ` Jens Axboe
2020-03-03 16:37                                         ` Greg Kroah-Hartman
2020-03-03 16:51                                         ` Jeff Layton
2020-03-03 16:55                                           ` Jens Axboe
2020-03-03 19:02                                             ` Jeff Layton
2020-03-03 19:07                                               ` Jens Axboe
2020-03-03 19:23                                               ` Jens Axboe
2020-03-03 19:43                                                 ` Jeff Layton
2020-03-03 20:33                                                   ` Jens Axboe
2020-03-03 21:03                                                     ` Jeff Layton
2020-03-03 21:20                                                       ` Jens Axboe
2020-03-03 14:10                                 ` Miklos Szeredi
2020-03-03 14:29                                   ` Greg Kroah-Hartman
2020-03-03 14:40                                     ` Jann Horn
2020-03-03 16:51                                       ` Greg Kroah-Hartman
2020-03-03 16:57                                         ` Jann Horn
2020-03-03 20:15                                         ` Greg Kroah-Hartman
2020-03-03 14:40                                   ` David Howells
2020-03-04  4:20                                   ` Ian Kent
2020-03-03 14:19                                 ` David Howells
2020-03-03 16:59                                   ` Greg Kroah-Hartman
2020-03-03 14:23                               ` Christian Brauner
2020-03-03 15:23                                 ` Greg Kroah-Hartman
2020-03-03 15:53                                 ` David Howells
2020-03-04  2:01                           ` Ian Kent
2020-03-04 15:22                             ` Karel Zak
2020-03-04 16:49                               ` Greg Kroah-Hartman
2020-03-04 17:55                                 ` Karel Zak
2020-03-03 14:09                         ` David Howells

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git