* [PATCH 00/13] VFS: Filesystem information [ver #19] @ 2020-03-18 15:08 David Howells 2020-03-18 15:09 ` [PATCH 12/13] fsinfo: Example support for Ext4 " David Howells ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: David Howells @ 2020-03-18 15:08 UTC (permalink / raw) To: torvalds, viro Cc: linux-nfs, Andreas Dilger, Anna Schumaker, Theodore Ts'o, linux-api, linux-ext4, Trond Myklebust, dhowells, raven, mszeredi, christian, jannh, darrick.wong, kzak, jlayton, linux-api, linux-fsdevel, linux-security-module, linux-kernel Here's a set of patches that adds a system call, fsinfo(), that allows information about the VFS, mount topology, superblock and files to be retrieved. The patchset is based on top of the notifications patchset and allows event counters implemented in the latter to be retrieved to allow overruns to be efficiently managed. Included are a couple of sample programs plus limited example code for NFS and Ext4. The example code is not intended to go upstream as-is. ======= THE WHY ======= Why do we want this? Using /proc/mounts (or similar) has problems: (1) Reading from it holds a global lock (namespace_sem) that prevents mounting and unmounting. Lots of data is encoded and mangled into text whilst the lock is held, including superblock option strings and mount point paths. This causes performance problems when there are a lot of mount objects in a system. (2) Even though namespace_sem is held during a read, reading the whole file isn't necessarily atomic with respect to mount-type operations. If a read isn't satisfied in one go, then it may return to userspace briefly and then continue reading some way into the file. But changes can occur in the interval that may then go unseen. (3) Determining what has changed means parsing and comparing consecutive outputs of /proc/mounts. (4) Querying a specific mount or superblock means searching through /proc/mounts and searching by path or mount ID - but we might have an fd we want to query. (5) Mount topology is not explicit. One must derive it manually by comparing entries. (6) Whilst you can poll() it for events, it only tells you that something changed in the namespace, not what or whether you can even see the change. To fix the notification issues, the preceding notifications patchset added mount watch notifications whereby you can watch for notifications in a specific mount subtree. The notification messages include the ID(s) of the affected mounts. To support notifications, however, we need to be able to handle overruns in the notification queue. I added a number of event counters to struct super_block and struct mount to allow you to pin down the changes, but there needs to be a way to retrieve them. Exposing them through /proc would require adding yet another /proc/mounts-type file. We could add per-mount directories full of attributes in sysfs, but that has issues also (see below). Adding an extensible system call interface for retrieving filesystem information also allows other things to be exposed: (1) Jeff Layton's error handling changes need a way to allow error event information to be retrieved. (2) Bits in masks returned by things like statx() and FS_IOC_GETFLAGS are actually 3-state { Set, Unset, Not supported }. It could be useful to provide a way to expose information like this[*]. (3) Limits of the numerical metadata values in a filesystem[*]. (4) Filesystem capability information[*]. Filesystems don't all have the same capabilities, and even different instances may have different capabilities, particularly with network filesystems where the set of may be server-dependent. Capabilities might even vary at file granularity - though possibly such information should be conveyed through statx() instead. (5) ID mapping/shifting tables in use for a superblock. (6) Filesystem-specific information. I need something for AFS so that I can do pioctl()-emulation, thereby allowing me to implement certain of the AFS command line utilities that query state of a particular file. This could also have application for other filesystems, such as NFS, CIFS and ext4. [*] In a lot of cases these are probably fixed and can be memcpy'd from static data. There's a further consideration: I want to make it possible to have fsconfig(fd, FSCONFIG_CMD_CREATE) be intercepted by a container manager such that the manager can supervise a mount attempted inside the container. The manager would be given an fd pointing to the fs_context struct and would then need some way to query it (fsinfo()) and modify it (fsconfig()). This could also be used to arbitrate user-requested mounts when containers are not in play. ============================ WHY NOT USE PROCFS OR SYSFS? ============================ Why is it better to go with a new system call rather than adding more magic stuff to /proc or /sysfs for each superblock object and each mount object? (1) It can be targetted. It makes it easy to query directly by path. procfs and sysfs cannot do this easily. (2) It's more efficient as we can return specific binary data rather than making huge text dumps. Granted, sysfs and procfs could present the same data, though as lots of little files which have to be individually opened, read, closed and parsed. (3) We wouldn't have the overhead of open and close (even adding a self-contained readfile() syscall has to do that internally) and the RCU destruction of the file struct(s). (4) Opening a file in procfs or sysfs has a pathwalk overhead for each file accessed. We can use an integer attribute ID instead (yes, this is similar to ioctl) - but could also use a string ID if that is preferred. (5) Can easily query cross-namespace if, say, a container manager process is given an fs_context that hasn't yet been mounted into a namespace - or hasn't even been fully created yet. (6) Don't have to create/delete a bunch of sysfs/procfs nodes each time a mount happens or is removed - and since systemd makes much use of mount namespaces and mount propagation, this will create a lot of nodes. The argument for doing this through procfs/sysfs/somemagicfs is that someone using a shell can just query the magic files using ordinary text tools, such as cat - and that has merit - but it doesn't solve the query-by-pathname problem. The suggested way around the query-by-pathname problem is to open the target file O_PATH and then look in a magic directory under procfs corresponding to the fd number to see a set of attribute files[*] laid out. Bash, however, can't open by O_PATH or O_NOFOLLOW as things stand... [*] Or possibly symlinks to files under a per-mount or per-sb directory in sysfs. ================ DESIGN DECISIONS ================ (1) Information is partitioned into sets of attributes. (2) Attribute IDs are integers as they're fast to compare. (3) Attribute values are typed (struct, list of structs, string, opaque blob). They type is fixed for a particular attribute. (4) For structure types, the length is also a version. New fields can be tacked onto the end. (5) When copying a versioned struct to userspace, the core handles a version mismatch by truncating or zero-padding the data as necessary. None of this is seen by the filesystem. (6) The core handles all the buffering and buffer resizing. (7) The filesystem never gets any access to the userspace parameter buffer or result buffer. (8) "Meta" attributes can describe other attributes. ======== OVERVIEW ======== fsinfo() is a system call that allows information about the filesystem at a particular path point to be queried as a set of attributes. Attribute values are of four basic types: (1) Structure with version-dependent length (the length is the version). (2) Variable-length string. (3) List of structures (all the same length). (4) Opaque blob. Attributes can have multiple values either as a sequence of values or a sequence-of-sequences of values and all the values of a particular attribute must be of the same type. Values can be up to INT_MAX size, subject to memory availability. Note that the values of an attribute *are* allowed to vary between dentries within a single superblock, depending on the specific dentry that you're looking at, but the values still have to be of the type for that attribute. I've tried to make the interface as light as possible, so integer attribute ID rather than string and the core does all the buffer allocation and expansion and all the extensibility support work rather than leaving that to the filesystems. This means that userspace pointers are not exposed to the filesystem. fsinfo() allows a variety of information to be retrieved about a filesystem and the mount topology: (1) General superblock attributes: - Filesystem identifiers (UUID, volume label, device numbers, ...) - The limits on a filesystem's capabilities - Information on supported statx fields and attributes and IOC flags. - A variety single-bit flags indicating supported capabilities. - Timestamp resolution and range. - The amount of space/free space in a filesystem (as statfs()). - Superblock notification counter. (2) Filesystem-specific superblock attributes: - Superblock-level timestamps. - Cell name, workgroup or other netfs grouping concept. - Server names and addresses. (3) VFS information: - Mount topology information. - Mount attributes. - Mount notification counter. - Mount point path. (4) Information about what the fsinfo() syscall itself supports, including the type and struct size of attributes. The system is extensible: (1) New attributes can be added. There is no requirement that a filesystem implement every attribute. A helper function is provided to scan a list of attributes and a filesystem can have multiple such lists. (2) Version length-dependent structure attributes can be made larger and have additional information tacked on the end, provided it keeps the layout of the existing fields. If an older process asks for a shorter structure, it will only be given the bits it asks for. If a newer process asks for a longer structure on an older kernel, the extra space will be set to 0. In all cases, the size of the data actually available is returned. In essence, the size of a structure is that structure's version: a smaller size is an earlier version and a later version includes everything that the earlier version did. (3) New single-bit capability flags can be added. This is a structure-typed attribute and, as such, (2) applies. Any bits you wanted but the kernel doesn't support are automatically set to 0. fsinfo() may be called like the following, for example: struct fsinfo_params params = { .at_flags = AT_SYMLINK_NOFOLLOW, .flags = FSINFO_FLAGS_QUERY_PATH, .request = FSINFO_ATTR_AFS_SERVER_ADDRESSES, .Nth = 2, }; struct fsinfo_server_address address; len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", ¶ms, &address, sizeof(address)); The above example would query an AFS filesystem to retrieve the address list for the 3rd server, and: struct fsinfo_params params = { .at_flags = AT_SYMLINK_NOFOLLOW, .flags = FSINFO_FLAGS_QUERY_PATH, .request = FSINFO_ATTR_NFS_SERVER_NAME; }; char server_name[256]; len = fsinfo(AT_FDCWD, "/home/dhowells/", ¶ms, &server_name, sizeof(server_name)); would retrieve the name of the NFS server as a string. In future, I want to make fsinfo() capable of querying a context created by fsopen() or fspick(), e.g.: fd = fsopen("ext4", 0); struct fsinfo_params params = { .flags = FSINFO_FLAGS_QUERY_FSCONTEXT, .request = FSINFO_ATTR_CONFIGURATION; }; char buffer[65536]; fsinfo(fd, NULL, ¶ms, &buffer, sizeof(buffer)); even if that context doesn't currently have a superblock attached. The patches can be found here also: https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git on branch: fsinfo-core =================== SIGNIFICANT CHANGES =================== ver #19: (*) Split FSINFO_ATTR_MOUNT_TOPOLOGY from FSINFO_ATTR_MOUNT_INFO. The latter requires no locking as it looks no further than the mount object it's dealing with. The topology attribute, however, has to take the namespace lock. That said, the info attribute includes a counter that indicates how many times a mount object's position in the topology has changed. (*) A bit of patch rearrangement to put the mount topology-exposing attributes into one patch. (*) Pass both AT_* and RESOLVE_* flags to fsinfo() as suggested by Linus, rather than adding missing RESOLVE_* flags. ver #18: (*) Moved the mount and superblock notification patches into a different branch. (*) Made superblock configuration (->show_opts), bindmount path (->show_path) and filesystem statistics (->show_stats) available as the CONFIGURATION, MOUNT_PATH and FS_STATISTICS attributes. (*) Made mountpoint device name available, filtered through the superblock (->show_devname), as the SOURCE attribute. (*) Made the mountpoint available as a full path as well as a relative one. (*) Added more event counters to MOUNT_INFO, including a subtree notification counter, to make it easier to clean up after a notification overrun. (*) Made the event counter value returned by MOUNT_CHILDREN the sum of the five event counters. (*) Added a mount uniquifier and added that to the MOUNT_CHILDREN entries also so that mount ID reuse can be detected. (*) Merged the SB_NOTIFICATION attribute into the MOUNT_INFO attribute to avoid duplicate information. (*) Switched to using the RESOLVE_* flags rather than AT_* flags for pathwalk control. Added more RESOLVE_* flags. (*) Used a lock instead of RCU to enumerate children for the MOUNT_CHILDREN attribute for safety. This is probably worth revisiting at a later date, however. ver #17: (*) Applied comments from Jann Horn, Darrick Wong and Christian Brauner. (*) Rearranged the order in which fsinfo() does things so that the superblock operations table can have a function pointer rather than a table pointer. The ->fsinfo() op is now called at least twice, once to determine the size of buffer needed and then to retrieve the data. If the retrieval step indicates yet more space is needed, the buffer will be expanded and that step repeated. (*) Merge the element size into the size in the fsinfo_attribute def and don't set size for strings or opaques. Let a helper work that out. This means that strings can actually get larger then 4K. (*) A helper is provided to scan a list of attributes and call the appropriate get function. This can be called from a filesystem's ->fsinfo() method multiple times. It also handles attribute enumeration and info querying. (*) Rearranged the patches to put all the notification patches first. This allowed some of the bits to be squashed together. At some point, I'll move the notification patches into a different branch. ver #16: (*) Split the features bits out of the fsinfo() core into their own patch and got rid of the name encoding attributes. (*) Renamed the 'array' type to 'list' and made AFS use it for returning server address lists. (*) Changed the ->fsinfo() method into an ->fsinfo_attributes[] table, where each attribute has a ->get() method to deal with it. These tables can then be returned with an fsinfo meta attribute. (*) Dropped the fscontext query and parameter/description retrieval attributes for now. (*) Picked the mount topology attributes into this branch. (*) Picked the mount notifications into this branch and rebased on top of notifications-pipe-core. (*) Picked the superblock notifications into this branch. (*) Add sample code for Ext4 and NFS. David --- David Howells (13): fsinfo: Add fsinfo() syscall to query filesystem information fsinfo: Provide a bitmap of supported features fsinfo: Allow retrieval of superblock devname, options and stats fsinfo: Allow fsinfo() to look up a mount object by ID fsinfo: Add a uniquifier ID to struct mount fsinfo: Allow mount information to be queried fsinfo: Allow mount topology and propagation info to be retrieved fsinfo: Provide notification overrun handling support fsinfo: sample: Mount listing program fsinfo: Add API documentation fsinfo: Add support for AFS fsinfo: Example support for Ext4 fsinfo: Example support for NFS Documentation/filesystems/fsinfo.rst | 574 +++++++++++++++++ arch/alpha/kernel/syscalls/syscall.tbl | 1 arch/arm/tools/syscall.tbl | 1 arch/arm64/include/asm/unistd.h | 2 arch/arm64/include/asm/unistd32.h | 2 arch/ia64/kernel/syscalls/syscall.tbl | 1 arch/m68k/kernel/syscalls/syscall.tbl | 1 arch/microblaze/kernel/syscalls/syscall.tbl | 1 arch/mips/kernel/syscalls/syscall_n32.tbl | 1 arch/mips/kernel/syscalls/syscall_n64.tbl | 1 arch/mips/kernel/syscalls/syscall_o32.tbl | 1 arch/parisc/kernel/syscalls/syscall.tbl | 1 arch/powerpc/kernel/syscalls/syscall.tbl | 1 arch/s390/kernel/syscalls/syscall.tbl | 1 arch/sh/kernel/syscalls/syscall.tbl | 1 arch/sparc/kernel/syscalls/syscall.tbl | 1 arch/x86/entry/syscalls/syscall_32.tbl | 1 arch/x86/entry/syscalls/syscall_64.tbl | 1 arch/xtensa/kernel/syscalls/syscall.tbl | 1 fs/Kconfig | 7 fs/Makefile | 1 fs/afs/internal.h | 1 fs/afs/super.c | 218 +++++++ fs/d_path.c | 2 fs/ext4/Makefile | 1 fs/ext4/ext4.h | 6 fs/ext4/fsinfo.c | 45 + fs/ext4/super.c | 3 fs/fsinfo.c | 725 ++++++++++++++++++++++ fs/internal.h | 14 fs/mount.h | 3 fs/mount_notify.c | 2 fs/namespace.c | 389 ++++++++++++ fs/nfs/Makefile | 1 fs/nfs/fsinfo.c | 230 +++++++ fs/nfs/internal.h | 6 fs/nfs/nfs4super.c | 3 fs/nfs/super.c | 3 include/linux/fs.h | 4 include/linux/fsinfo.h | 111 +++ include/linux/syscalls.h | 4 include/uapi/asm-generic/unistd.h | 4 include/uapi/linux/fsinfo.h | 371 +++++++++++ include/uapi/linux/mount.h | 10 include/uapi/linux/windows.h | 35 + kernel/sys_ni.c | 1 samples/vfs/Makefile | 7 samples/vfs/test-fsinfo.c | 891 +++++++++++++++++++++++++++ samples/vfs/test-mntinfo.c | 279 ++++++++ 49 files changed, 3962 insertions(+), 9 deletions(-) create mode 100644 Documentation/filesystems/fsinfo.rst create mode 100644 fs/ext4/fsinfo.c create mode 100644 fs/fsinfo.c create mode 100644 fs/nfs/fsinfo.c create mode 100644 include/linux/fsinfo.h create mode 100644 include/uapi/linux/fsinfo.h create mode 100644 include/uapi/linux/windows.h create mode 100644 samples/vfs/test-fsinfo.c create mode 100644 samples/vfs/test-mntinfo.c ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 12/13] fsinfo: Example support for Ext4 [ver #19] 2020-03-18 15:08 [PATCH 00/13] VFS: Filesystem information [ver #19] David Howells @ 2020-03-18 15:09 ` David Howells 2020-03-18 16:05 ` [PATCH 00/13] VFS: Filesystem information " Miklos Szeredi 2020-03-19 10:37 ` David Howells 2 siblings, 0 replies; 13+ messages in thread From: David Howells @ 2020-03-18 15:09 UTC (permalink / raw) To: torvalds, viro Cc: Theodore Ts'o, Andreas Dilger, linux-ext4, dhowells, raven, mszeredi, christian, jannh, darrick.wong, kzak, jlayton, linux-api, linux-fsdevel, linux-security-module, linux-kernel Add the ability to list some Ext4 volume timestamps as an example. Is this useful for ext4? Is there anything else that could be useful? Signed-off-by: David Howells <dhowells@redhat.com> cc: "Theodore Ts'o" <tytso@mit.edu> cc: Andreas Dilger <adilger.kernel@dilger.ca> cc: linux-ext4@vger.kernel.org --- fs/ext4/Makefile | 1 + fs/ext4/ext4.h | 6 ++++++ fs/ext4/fsinfo.c | 45 +++++++++++++++++++++++++++++++++++++++++++ fs/ext4/super.c | 3 +++ include/uapi/linux/fsinfo.h | 16 +++++++++++++++ samples/vfs/test-fsinfo.c | 35 +++++++++++++++++++++++++++++++++ 6 files changed, 106 insertions(+) create mode 100644 fs/ext4/fsinfo.c diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile index 4ccb3c9189d8..71d5b460c7c7 100644 --- a/fs/ext4/Makefile +++ b/fs/ext4/Makefile @@ -16,3 +16,4 @@ ext4-$(CONFIG_EXT4_FS_SECURITY) += xattr_security.o ext4-inode-test-objs += inode-test.o obj-$(CONFIG_EXT4_KUNIT_TESTS) += ext4-inode-test.o ext4-$(CONFIG_FS_VERITY) += verity.o +ext4-$(CONFIG_FSINFO) += fsinfo.o diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 61b37a052052..f0304aa107f8 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -42,6 +42,7 @@ #include <linux/fscrypt.h> #include <linux/fsverity.h> +#include <linux/fsinfo.h> #include <linux/compiler.h> @@ -3190,6 +3191,11 @@ extern const struct inode_operations ext4_file_inode_operations; extern const struct file_operations ext4_file_operations; extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin); +/* fsinfo.c */ +#ifdef CONFIG_FSINFO +extern int ext4_fsinfo(struct path *path, struct fsinfo_context *ctx); +#endif + /* inline.c */ extern int ext4_get_max_inline_size(struct inode *inode); extern int ext4_find_inline_data_nolock(struct inode *inode); diff --git a/fs/ext4/fsinfo.c b/fs/ext4/fsinfo.c new file mode 100644 index 000000000000..785f82a74dc9 --- /dev/null +++ b/fs/ext4/fsinfo.c @@ -0,0 +1,45 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Filesystem information for ext4 + * + * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved. + * Written by David Howells (dhowells@redhat.com) + */ + +#include <linux/mount.h> +#include "ext4.h" + +static int ext4_fsinfo_get_volume_name(struct path *path, struct fsinfo_context *ctx) +{ + const struct ext4_sb_info *sbi = EXT4_SB(path->mnt->mnt_sb); + const struct ext4_super_block *es = sbi->s_es; + + memcpy(ctx->buffer, es->s_volume_name, sizeof(es->s_volume_name)); + return strlen(ctx->buffer); +} + +static int ext4_fsinfo_get_timestamps(struct path *path, struct fsinfo_context *ctx) +{ + const struct ext4_sb_info *sbi = EXT4_SB(path->mnt->mnt_sb); + const struct ext4_super_block *es = sbi->s_es; + struct fsinfo_ext4_timestamps *ts = ctx->buffer; + +#define Z(R,S) R = S | (((u64)S##_hi) << 32) + Z(ts->mkfs_time, es->s_mkfs_time); + Z(ts->mount_time, es->s_mtime); + Z(ts->write_time, es->s_wtime); + Z(ts->last_check_time, es->s_lastcheck); + Z(ts->first_error_time, es->s_first_error_time); + Z(ts->last_error_time, es->s_last_error_time); + return sizeof(*ts); +} + +static const struct fsinfo_attribute ext4_fsinfo_attributes[] = { + FSINFO_STRING (FSINFO_ATTR_VOLUME_NAME, ext4_fsinfo_get_volume_name), + FSINFO_VSTRUCT (FSINFO_ATTR_EXT4_TIMESTAMPS, ext4_fsinfo_get_timestamps), + {} +}; + +int ext4_fsinfo(struct path *path, struct fsinfo_context *ctx) +{ + return fsinfo_get_attribute(path, ctx, ext4_fsinfo_attributes); +} diff --git a/fs/ext4/super.c b/fs/ext4/super.c index ff1b764b0c0e..3655fbeab754 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1487,6 +1487,9 @@ static const struct super_operations ext4_sops = { .freeze_fs = ext4_freeze, .unfreeze_fs = ext4_unfreeze, .statfs = ext4_statfs, +#ifdef CONFIG_FSINFO + .fsinfo = ext4_fsinfo, +#endif .remount_fs = ext4_remount, .show_options = ext4_show_options, #ifdef CONFIG_QUOTA diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h index 150b693a1b5a..4cfb71227eff 100644 --- a/include/uapi/linux/fsinfo.h +++ b/include/uapi/linux/fsinfo.h @@ -42,6 +42,8 @@ #define FSINFO_ATTR_AFS_SERVER_NAME 0x301 /* Name of the Nth server (string) */ #define FSINFO_ATTR_AFS_SERVER_ADDRESSES 0x302 /* List of addresses of the Nth server */ +#define FSINFO_ATTR_EXT4_TIMESTAMPS 0x400 /* Ext4 superblock timestamps */ + /* * Optional fsinfo() parameter structure. * @@ -323,4 +325,18 @@ struct fsinfo_afs_server_address { #define FSINFO_ATTR_AFS_SERVER_ADDRESSES__STRUCT struct fsinfo_afs_server_address +/* + * Information struct for fsinfo(FSINFO_ATTR_EXT4_TIMESTAMPS). + */ +struct fsinfo_ext4_timestamps { + __u64 mkfs_time; + __u64 mount_time; + __u64 write_time; + __u64 last_check_time; + __u64 first_error_time; + __u64 last_error_time; +}; + +#define FSINFO_ATTR_EXT4_TIMESTAMPS__STRUCT struct fsinfo_ext4_timestamps + #endif /* _UAPI_LINUX_FSINFO_H */ diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c index 9f9564f7f73e..6ad1128a3e1d 100644 --- a/samples/vfs/test-fsinfo.c +++ b/samples/vfs/test-fsinfo.c @@ -384,6 +384,40 @@ static void dump_afs_fsinfo_server_address(void *reply, unsigned int size) printf("family=%u\n", ss->ss_family); } +static char *dump_ext4_time(char *buffer, time_t tim) +{ + struct tm tm; + int len; + + if (tim == 0) + return "-"; + + if (!localtime_r(&tim, &tm)) { + perror("localtime_r"); + exit(1); + } + len = strftime(buffer, 100, "%F %T", &tm); + if (len == 0) { + perror("strftime"); + exit(1); + } + return buffer; +} + +static void dump_ext4_fsinfo_timestamps(void *reply, unsigned int size) +{ + struct fsinfo_ext4_timestamps *r = reply; + char buffer[100]; + + printf("\n"); + printf("\tmkfs : %s\n", dump_ext4_time(buffer, r->mkfs_time)); + printf("\tmount : %s\n", dump_ext4_time(buffer, r->mount_time)); + printf("\twrite : %s\n", dump_ext4_time(buffer, r->write_time)); + printf("\tfsck : %s\n", dump_ext4_time(buffer, r->last_check_time)); + printf("\t1st-err : %s\n", dump_ext4_time(buffer, r->first_error_time)); + printf("\tlast-err: %s\n", dump_ext4_time(buffer, r->last_error_time)); +} + static void dump_string(void *reply, unsigned int size) { char *s = reply, *p; @@ -471,6 +505,7 @@ static const struct fsinfo_attribute fsinfo_attributes[] = { FSINFO_STRING (FSINFO_ATTR_AFS_CELL_NAME, string), FSINFO_STRING (FSINFO_ATTR_AFS_SERVER_NAME, string), FSINFO_LIST_N (FSINFO_ATTR_AFS_SERVER_ADDRESSES, afs_fsinfo_server_address), + FSINFO_VSTRUCT (FSINFO_ATTR_EXT4_TIMESTAMPS, ext4_fsinfo_timestamps), {} }; ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 00/13] VFS: Filesystem information [ver #19] 2020-03-18 15:08 [PATCH 00/13] VFS: Filesystem information [ver #19] David Howells 2020-03-18 15:09 ` [PATCH 12/13] fsinfo: Example support for Ext4 " David Howells @ 2020-03-18 16:05 ` Miklos Szeredi 2020-04-01 5:22 ` Ian Kent 2020-03-19 10:37 ` David Howells 2 siblings, 1 reply; 13+ messages in thread From: Miklos Szeredi @ 2020-03-18 16:05 UTC (permalink / raw) To: David Howells Cc: Linus Torvalds, Al Viro, Linux NFS list, Andreas Dilger, Anna Schumaker, Theodore Ts'o, Linux API, linux-ext4, Trond Myklebust, Ian Kent, Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong, Karel Zak, Jeff Layton, linux-fsdevel, LSM, linux-kernel On Wed, Mar 18, 2020 at 4:08 PM David Howells <dhowells@redhat.com> wrote: > ============================ > WHY NOT USE PROCFS OR SYSFS? > ============================ > > Why is it better to go with a new system call rather than adding more magic > stuff to /proc or /sysfs for each superblock object and each mount object? > > (1) It can be targetted. It makes it easy to query directly by path. > procfs and sysfs cannot do this easily. > > (2) It's more efficient as we can return specific binary data rather than > making huge text dumps. Granted, sysfs and procfs could present the > same data, though as lots of little files which have to be > individually opened, read, closed and parsed. Asked this a number of times, but you haven't answered yet: what application would require such a high efficiency? Nobody's suggesting we move stat(2) to proc interfaces, and AFAIK nobody suggested we move /proc/PID/* to a binary syscall interface. Each one has its place, and I strongly feel that mount info belongs in the latter category. Feel free to prove the opposite. > (3) We wouldn't have the overhead of open and close (even adding a > self-contained readfile() syscall has to do that internally Busted: add f_op->readfile() and be done with all that. For example DEFINE_SHOW_ATTRIBUTE() could be trivially moved to that interface. We could optimize existing proc, sys, etc. interfaces, but it's not been an issue, apparently. > > (4) Opening a file in procfs or sysfs has a pathwalk overhead for each > file accessed. We can use an integer attribute ID instead (yes, this > is similar to ioctl) - but could also use a string ID if that is > preferred. > > (5) Can easily query cross-namespace if, say, a container manager process > is given an fs_context that hasn't yet been mounted into a namespace - > or hasn't even been fully created yet. Works with my patch. > (6) Don't have to create/delete a bunch of sysfs/procfs nodes each time a > mount happens or is removed - and since systemd makes much use of > mount namespaces and mount propagation, this will create a lot of > nodes. Not true. > The argument for doing this through procfs/sysfs/somemagicfs is that > someone using a shell can just query the magic files using ordinary text > tools, such as cat - and that has merit - but it doesn't solve the > query-by-pathname problem. > > The suggested way around the query-by-pathname problem is to open the > target file O_PATH and then look in a magic directory under procfs > corresponding to the fd number to see a set of attribute files[*] laid out. > Bash, however, can't open by O_PATH or O_NOFOLLOW as things stand... Bash doesn't have fsinfo(2) either, so that's not really a good argument. Implementing a utility to show mount attribute(s) by path is trivial for the file based interface, while it would need to be updated for each extension of fsinfo(2). Same goes for libc, language bindings, etc. Thanks, Miklos ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 00/13] VFS: Filesystem information [ver #19] 2020-03-18 16:05 ` [PATCH 00/13] VFS: Filesystem information " Miklos Szeredi @ 2020-04-01 5:22 ` Ian Kent 2020-04-01 8:18 ` Miklos Szeredi 2020-04-01 8:27 ` David Howells 0 siblings, 2 replies; 13+ messages in thread From: Ian Kent @ 2020-04-01 5:22 UTC (permalink / raw) To: Miklos Szeredi, David Howells Cc: Linus Torvalds, Al Viro, Linux NFS list, Andreas Dilger, Anna Schumaker, Theodore Ts'o, Linux API, linux-ext4, Trond Myklebust, Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong, Karel Zak, Jeff Layton, linux-fsdevel, LSM, linux-kernel On Wed, 2020-03-18 at 17:05 +0100, Miklos Szeredi wrote: > On Wed, Mar 18, 2020 at 4:08 PM David Howells <dhowells@redhat.com> > wrote: > > > ============================ > > WHY NOT USE PROCFS OR SYSFS? > > ============================ > > > > Why is it better to go with a new system call rather than adding > > more magic > > stuff to /proc or /sysfs for each superblock object and each mount > > object? > > > > (1) It can be targetted. It makes it easy to query directly by > > path. > > procfs and sysfs cannot do this easily. > > > > (2) It's more efficient as we can return specific binary data > > rather than > > making huge text dumps. Granted, sysfs and procfs could > > present the > > same data, though as lots of little files which have to be > > individually opened, read, closed and parsed. > > Asked this a number of times, but you haven't answered yet: what > application would require such a high efficiency? Umm ... systemd and udisks2 and about 4 others. A problem I've had with autofs for years is using autofs direct mount maps of any appreciable size cause several key user space applications to consume all available CPU while autofs is starting or stopping which takes a fair while with a very large mount table. I saw a couple of applications affected purely because of the large mount table but not as badly as starting or stopping autofs. Maps of 5,000 to 10,000 map entries can almost be handled, not uncommon for heavy autofs users in spite of the problem, but much larger than that and you've got a serious problem. There are problems with expiration as well but that's more an autofs problem that I need to fix. To be clear it's not autofs that needs the improvement (I need to deal with this in autofs itself) it's the affect that these large mount tables have on the rest of the user space and that's quite significant. I can't even think about resolving my autofs problem until this problem is resolved and handling very large numbers of mounts as efficiently as possible must be part of that solution for me and I think for the OS overall too. Ian > > Nobody's suggesting we move stat(2) to proc interfaces, and AFAIK > nobody suggested we move /proc/PID/* to a binary syscall interface. > Each one has its place, and I strongly feel that mount info belongs > in > the latter category. Feel free to prove the opposite. > > > (3) We wouldn't have the overhead of open and close (even adding a > > self-contained readfile() syscall has to do that internally > > Busted: add f_op->readfile() and be done with all that. For example > DEFINE_SHOW_ATTRIBUTE() could be trivially moved to that interface. > > We could optimize existing proc, sys, etc. interfaces, but it's not > been an issue, apparently. > > > (4) Opening a file in procfs or sysfs has a pathwalk overhead for > > each > > file accessed. We can use an integer attribute ID instead > > (yes, this > > is similar to ioctl) - but could also use a string ID if that > > is > > preferred. > > > > (5) Can easily query cross-namespace if, say, a container manager > > process > > is given an fs_context that hasn't yet been mounted into a > > namespace - > > or hasn't even been fully created yet. > > Works with my patch. > > > (6) Don't have to create/delete a bunch of sysfs/procfs nodes each > > time a > > mount happens or is removed - and since systemd makes much use > > of > > mount namespaces and mount propagation, this will create a lot > > of > > nodes. > > Not true. > > > The argument for doing this through procfs/sysfs/somemagicfs is > > that > > someone using a shell can just query the magic files using ordinary > > text > > tools, such as cat - and that has merit - but it doesn't solve the > > query-by-pathname problem. > > > > The suggested way around the query-by-pathname problem is to open > > the > > target file O_PATH and then look in a magic directory under procfs > > corresponding to the fd number to see a set of attribute files[*] > > laid out. > > Bash, however, can't open by O_PATH or O_NOFOLLOW as things > > stand... > > Bash doesn't have fsinfo(2) either, so that's not really a good > argument. > > Implementing a utility to show mount attribute(s) by path is trivial > for the file based interface, while it would need to be updated for > each extension of fsinfo(2). Same goes for libc, language bindings, > etc. > > Thanks, > Miklos ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 00/13] VFS: Filesystem information [ver #19] 2020-04-01 5:22 ` Ian Kent @ 2020-04-01 8:18 ` Miklos Szeredi 2020-04-01 8:27 ` David Howells 1 sibling, 0 replies; 13+ messages in thread From: Miklos Szeredi @ 2020-04-01 8:18 UTC (permalink / raw) To: Ian Kent Cc: David Howells, Linus Torvalds, Al Viro, Linux NFS list, Andreas Dilger, Anna Schumaker, Theodore Ts'o, Linux API, linux-ext4, Trond Myklebust, Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong, Karel Zak, Jeff Layton, linux-fsdevel, LSM, linux-kernel On Wed, Apr 1, 2020 at 7:22 AM Ian Kent <raven@themaw.net> wrote: > > On Wed, 2020-03-18 at 17:05 +0100, Miklos Szeredi wrote: > > On Wed, Mar 18, 2020 at 4:08 PM David Howells <dhowells@redhat.com> > > wrote: > > > > > ============================ > > > WHY NOT USE PROCFS OR SYSFS? > > > ============================ > > > > > > Why is it better to go with a new system call rather than adding > > > more magic > > > stuff to /proc or /sysfs for each superblock object and each mount > > > object? > > > > > > (1) It can be targetted. It makes it easy to query directly by > > > path. > > > procfs and sysfs cannot do this easily. > > > > > > (2) It's more efficient as we can return specific binary data > > > rather than > > > making huge text dumps. Granted, sysfs and procfs could > > > present the > > > same data, though as lots of little files which have to be > > > individually opened, read, closed and parsed. > > > > Asked this a number of times, but you haven't answered yet: what > > application would require such a high efficiency? > > Umm ... systemd and udisks2 and about 4 others. > > A problem I've had with autofs for years is using autofs direct mount > maps of any appreciable size cause several key user space applications > to consume all available CPU while autofs is starting or stopping which > takes a fair while with a very large mount table. I saw a couple of > applications affected purely because of the large mount table but not > as badly as starting or stopping autofs. > > Maps of 5,000 to 10,000 map entries can almost be handled, not uncommon > for heavy autofs users in spite of the problem, but much larger than > that and you've got a serious problem. > > There are problems with expiration as well but that's more an autofs > problem that I need to fix. > > To be clear it's not autofs that needs the improvement (I need to > deal with this in autofs itself) it's the affect that these large > mount tables have on the rest of the user space and that's quite > significant. According to dhowell's measurements processing 100k mounts would take about a few seconds of system time (that's the time spent by the kernel to retrieve the data, obviously the userspace processing would add to that, but that's independent of the kernel patchset). I think that sort of time spent by the kernel is entirely reasonable and is probably not worth heavy optimization, since userspace is probably going to spend as much, if not more time with each mount entry. > I can't even think about resolving my autofs problem until this > problem is resolved and handling very large numbers of mounts > as efficiently as possible must be part of that solution for me > and I think for the OS overall too. The key to that is allowing userspace to retrieve individual mount entries instead of having to parse the complete mount table on every change. Thanks, Miklos ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 00/13] VFS: Filesystem information [ver #19] 2020-04-01 5:22 ` Ian Kent 2020-04-01 8:18 ` Miklos Szeredi @ 2020-04-01 8:27 ` David Howells 2020-04-01 8:37 ` Miklos Szeredi 1 sibling, 1 reply; 13+ messages in thread From: David Howells @ 2020-04-01 8:27 UTC (permalink / raw) To: Miklos Szeredi Cc: dhowells, Ian Kent, Linus Torvalds, Al Viro, Linux NFS list, Andreas Dilger, Anna Schumaker, Theodore Ts'o, Linux API, linux-ext4, Trond Myklebust, Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong, Karel Zak, Jeff Layton, linux-fsdevel, LSM, linux-kernel Miklos Szeredi <miklos@szeredi.hu> wrote: > According to dhowell's measurements processing 100k mounts would take > about a few seconds of system time (that's the time spent by the > kernel to retrieve the data, But the inefficiency of mountfs - at least as currently implemented - scales up with the number of individual values you want to retrieve, both in terms of memory usage and time taken. With fsinfo(), I've tried to batch values together where it makes sense - and there's no lingering memory overhead - no extra inodes, dentries and files required. David ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 00/13] VFS: Filesystem information [ver #19] 2020-04-01 8:27 ` David Howells @ 2020-04-01 8:37 ` Miklos Szeredi 2020-04-01 12:35 ` Miklos Szeredi ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Miklos Szeredi @ 2020-04-01 8:37 UTC (permalink / raw) To: David Howells Cc: Ian Kent, Linus Torvalds, Al Viro, Linux NFS list, Andreas Dilger, Anna Schumaker, Theodore Ts'o, Linux API, linux-ext4, Trond Myklebust, Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong, Karel Zak, Jeff Layton, linux-fsdevel, LSM, linux-kernel On Wed, Apr 1, 2020 at 10:27 AM David Howells <dhowells@redhat.com> wrote: > > Miklos Szeredi <miklos@szeredi.hu> wrote: > > > According to dhowell's measurements processing 100k mounts would take > > about a few seconds of system time (that's the time spent by the > > kernel to retrieve the data, > > But the inefficiency of mountfs - at least as currently implemented - scales > up with the number of individual values you want to retrieve, both in terms of > memory usage and time taken. I've taken that into account when guesstimating a "few seconds per 100k entries". My guess is that there's probably an order of magnitude difference between the performance of a fs based interface and a binary syscall based interface. That could be reduced somewhat with a readfile(2) type API. But the point is: this does not matter. Whether it's .5s or 5s is completely irrelevant, as neither is going to take down the system, and userspace processing is probably going to take as much, if not more time. And remember, we are talking about stopping and starting the automount daemon, which is something that happens, but it should not happen often by any measure. > With fsinfo(), I've tried to batch values together where it makes sense - and > there's no lingering memory overhead - no extra inodes, dentries and files > required. The dentries, inodes and files in your test are single use (except the root dentry) and can be made ephemeral if that turns out to be better. My guess is that dentries belonging to individual attributes should be deleted on final put, while the dentries belonging to the mount directory can be reclaimed normally. Thanks, Miklos ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 00/13] VFS: Filesystem information [ver #19] 2020-04-01 8:37 ` Miklos Szeredi @ 2020-04-01 12:35 ` Miklos Szeredi 2020-04-01 15:51 ` David Howells 2020-04-02 1:38 ` Ian Kent 2 siblings, 0 replies; 13+ messages in thread From: Miklos Szeredi @ 2020-04-01 12:35 UTC (permalink / raw) To: David Howells Cc: Ian Kent, Linus Torvalds, Al Viro, Linux NFS list, Andreas Dilger, Anna Schumaker, Theodore Ts'o, Linux API, linux-ext4, Trond Myklebust, Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong, Karel Zak, Jeff Layton, linux-fsdevel, LSM, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1618 bytes --] On Wed, Apr 1, 2020 at 10:37 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Wed, Apr 1, 2020 at 10:27 AM David Howells <dhowells@redhat.com> wrote: > > > > Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > According to dhowell's measurements processing 100k mounts would take > > > about a few seconds of system time (that's the time spent by the > > > kernel to retrieve the data, > > > > But the inefficiency of mountfs - at least as currently implemented - scales > > up with the number of individual values you want to retrieve, both in terms of > > memory usage and time taken. > > I've taken that into account when guesstimating a "few seconds per > 100k entries". My guess is that there's probably an order of > magnitude difference between the performance of a fs based interface > and a binary syscall based interface. That could be reduced somewhat > with a readfile(2) type API. And to show that I'm not completely off base, attached a patch that adds a limited readfile(2) syscall and uses it in the p2 method. Results are promising: ./test-fsinfo-perf /tmp/a 30000 --- make mounts --- --- test fsinfo by path --- sum(mnt_id) = 930000 --- test fsinfo by mnt_id --- sum(mnt_id) = 930000 --- test /proc/fdinfo --- sum(mnt_id) = 930000 --- test mountfs --- sum(mnt_id) = 930000 For 30000 mounts, f= 146400us f2= 136766us p= 1406569us p2= 221669us; p=9.6*f p=10.3*f2 p=6.3*p2 --- umount --- This is about a 2 fold increase in speed compared to open + read + close. Is someone still worried about performance, or can we move on to more interesting parts of the design? Thanks, Miklos [-- Attachment #2: fsmount-readfile.patch --] [-- Type: text/x-patch, Size: 6326 bytes --] Index: linux/fs/mountfs/super.c =================================================================== --- linux.orig/fs/mountfs/super.c 2020-04-01 14:21:24.609955072 +0200 +++ linux/fs/mountfs/super.c 2020-04-01 14:21:42.426151545 +0200 @@ -51,10 +51,11 @@ static bool mountfs_entry_visible(struct return visible; } + static int mountfs_attr_show(struct seq_file *sf, void *v) { const char *name = sf->file->f_path.dentry->d_name.name; - struct mountfs_entry *entry = sf->private; + struct mountfs_entry *entry = file_inode(sf->file)->i_private; struct mount *mnt; struct vfsmount *m; struct super_block *sb; @@ -140,12 +141,40 @@ static int mountfs_attr_show(struct seq_ return err; } +ssize_t mountfs_attr_readfile(struct file *file, char __user *buf, size_t size) +{ + struct seq_file m = { .size = PAGE_SIZE, .file = file }; + ssize_t ret; + +retry: + m.buf = kvmalloc(m.size, GFP_KERNEL); + if (!m.buf) + return -ENOMEM; + + ret = mountfs_attr_show(&m, NULL); + if (!ret) { + if (m.count == m.size) { + kvfree(m.buf); + m.size <<= 1; + m.count = 0; + goto retry; + } + ret = min(m.count, size); + if (copy_to_user(buf, m.buf, ret)) + ret = -EFAULT; + } + + kvfree(m.buf); + return ret; +} + static int mountfs_attr_open(struct inode *inode, struct file *file) { - return single_open(file, mountfs_attr_show, inode->i_private); + return single_open(file, mountfs_attr_show, NULL); } static const struct file_operations mountfs_attr_fops = { + .readfile = mountfs_attr_readfile, .open = mountfs_attr_open, .read = seq_read, .llseek = seq_lseek, Index: linux/samples/vfs/test-fsinfo-perf.c =================================================================== --- linux.orig/samples/vfs/test-fsinfo-perf.c 2020-04-01 14:21:24.609955072 +0200 +++ linux/samples/vfs/test-fsinfo-perf.c 2020-04-01 14:21:42.426151545 +0200 @@ -172,6 +172,12 @@ static void get_id_by_proc(int ix, const //printf("[%u] %u\n", ix, x); } +static long readfile(int dfd, const char *name, char *buffer, size_t size, + int flags) +{ + return syscall(__NR_readfile, dfd, name, buffer, size, flags); +} + static void get_id_by_fsinfo_2(void) { struct fsinfo_mount_topology t; @@ -300,11 +306,8 @@ static void get_id_by_mountfs(void) } sprintf(procfile, "%u/parent", mnt_id); - fd = openat(mntfd, procfile, O_RDONLY); - ERR(fd, procfile); - len = read(fd, buffer, sizeof(buffer) - 1); - ERR(len, "read/parent"); - close(fd); + len = readfile(mntfd, procfile, buffer, sizeof(buffer), 0); + ERR(len, "readfile/parent"); if (len > 0 && buffer[len - 1] == '\n') len--; buffer[len] = 0; @@ -319,11 +322,8 @@ static void get_id_by_mountfs(void) sum_check += x; sprintf(procfile, "%u/counter", mnt_id); - fd = openat(mntfd, procfile, O_RDONLY); - ERR(fd, procfile); - len = read(fd, buffer, sizeof(buffer) - 1); - ERR(len, "read/counter"); - close(fd); + len = readfile(mntfd, procfile, buffer, sizeof(buffer) - 1, 0); + ERR(len, "readfile/counter"); if (len > 0 && buffer[len - 1] == '\n') len--; buffer[len] = 0; Index: linux/arch/x86/entry/syscalls/syscall_64.tbl =================================================================== --- linux.orig/arch/x86/entry/syscalls/syscall_64.tbl 2020-04-01 14:21:37.284094840 +0200 +++ linux/arch/x86/entry/syscalls/syscall_64.tbl 2020-04-01 14:21:42.412151390 +0200 @@ -362,6 +362,7 @@ 439 common watch_mount __x64_sys_watch_mount 440 common watch_sb __x64_sys_watch_sb 441 common fsinfo __x64_sys_fsinfo +442 common readfile __x64_sys_readfile # # x32-specific system call numbers start at 512 to avoid cache impact Index: linux/fs/open.c =================================================================== --- linux.orig/fs/open.c 2020-04-01 14:21:37.284094840 +0200 +++ linux/fs/open.c 2020-04-01 14:21:42.424151523 +0200 @@ -1340,3 +1340,25 @@ int stream_open(struct inode *inode, str } EXPORT_SYMBOL(stream_open); + +SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename, + char __user *, buffer, size_t, bufsize, int, flags) +{ + ssize_t ret; + struct file file = {}; + + if (flags) + return -EINVAL; + + ret = user_path_at(dfd, filename, 0, &file.f_path); + if (!ret) { + file.f_inode = file.f_path.dentry->d_inode; + file.f_op = file.f_inode->i_fop; + ret = -EOPNOTSUPP; + if (file.f_op->readfile) + ret = file.f_op->readfile(&file, buffer, bufsize); + path_put(&file.f_path); + } + + return ret; +} Index: linux/include/linux/syscalls.h =================================================================== --- linux.orig/include/linux/syscalls.h 2020-04-01 14:21:37.284094840 +0200 +++ linux/include/linux/syscalls.h 2020-04-01 14:21:42.413151401 +0200 @@ -1011,6 +1011,8 @@ asmlinkage long sys_watch_sb(int dfd, co asmlinkage long sys_fsinfo(int dfd, const char __user *pathname, struct fsinfo_params __user *params, size_t params_size, void __user *result_buffer, size_t result_buf_size); +asmlinkage long sys_readfile(int dfd, const char __user *filename, + char __user *buffer, size_t bufsize, int flags); /* * Architecture-specific system calls Index: linux/include/uapi/asm-generic/unistd.h =================================================================== --- linux.orig/include/uapi/asm-generic/unistd.h 2020-04-01 14:21:37.284094840 +0200 +++ linux/include/uapi/asm-generic/unistd.h 2020-04-01 14:21:42.413151401 +0200 @@ -861,9 +861,11 @@ __SYSCALL(__NR_watch_mount, sys_watch_mo __SYSCALL(__NR_watch_sb, sys_watch_sb) #define __NR_fsinfo 441 __SYSCALL(__NR_fsinfo, sys_fsinfo) +#define __NR_readfile 442 +__SYSCALL(__NR_readfile, sys_readfile) #undef __NR_syscalls -#define __NR_syscalls 442 +#define __NR_syscalls 443 /* * 32 bit systems traditionally used different Index: linux/include/linux/fs.h =================================================================== --- linux.orig/include/linux/fs.h 2020-04-01 14:21:19.144894804 +0200 +++ linux/include/linux/fs.h 2020-04-01 14:21:42.425151534 +0200 @@ -1868,6 +1868,7 @@ struct file_operations { struct file *file_out, loff_t pos_out, loff_t len, unsigned int remap_flags); int (*fadvise)(struct file *, loff_t, loff_t, int); + ssize_t (*readfile)(struct file *, char __user *, size_t); } __randomize_layout; struct inode_operations { ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 00/13] VFS: Filesystem information [ver #19] 2020-04-01 8:37 ` Miklos Szeredi 2020-04-01 12:35 ` Miklos Szeredi @ 2020-04-01 15:51 ` David Howells 2020-04-02 1:38 ` Ian Kent 2 siblings, 0 replies; 13+ messages in thread From: David Howells @ 2020-04-01 15:51 UTC (permalink / raw) To: Miklos Szeredi Cc: dhowells, Ian Kent, Linus Torvalds, Al Viro, Linux NFS list, Andreas Dilger, Anna Schumaker, Theodore Ts'o, Linux API, linux-ext4, Trond Myklebust, Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong, Karel Zak, Jeff Layton, linux-fsdevel, LSM, linux-kernel Miklos Szeredi <miklos@szeredi.hu> wrote: > For 30000 mounts, f= 146400us f2= 136766us p= 1406569us p2= > 221669us; p=9.6*f p=10.3*f2 p=6.3*p2 f = 146400us f2= 136766us p = 1406569us <--- Order of magnitude slower p2= 221669us And more memory used because it's added a whole bunch of inodes and dentries to the cache. For each mount that's a pair for each dir and a pair for each file within the dir. So for the two files my test is reading, for 30000 mounts, that's 90000 dentries and 90000 inodes in mountfs alone. (gdb) p sizeof(struct dentry) $1 = 216 (gdb) p sizeof(struct inode) $2 = 696 (gdb) p (216*696)*30000*3/1024/1024 $3 = 615 so 615 MiB of RAM added to the caches in an extreme case. We're seeing customers with 10000+ mounts - that would be 205 MiB, just to read two values from each mount. I presume you're not going through /proc/fdinfo each time as that would add another d+i - for >1GiB added to the caches for 30000 mounts. David ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 00/13] VFS: Filesystem information [ver #19] 2020-04-01 8:37 ` Miklos Szeredi 2020-04-01 12:35 ` Miklos Szeredi 2020-04-01 15:51 ` David Howells @ 2020-04-02 1:38 ` Ian Kent 2020-04-02 14:14 ` Karel Zak 2 siblings, 1 reply; 13+ messages in thread From: Ian Kent @ 2020-04-02 1:38 UTC (permalink / raw) To: Miklos Szeredi, David Howells Cc: Linus Torvalds, Al Viro, Linux NFS list, Andreas Dilger, Anna Schumaker, Theodore Ts'o, Linux API, linux-ext4, Trond Myklebust, Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong, Karel Zak, Jeff Layton, linux-fsdevel, LSM, linux-kernel On Wed, 2020-04-01 at 10:37 +0200, Miklos Szeredi wrote: > On Wed, Apr 1, 2020 at 10:27 AM David Howells <dhowells@redhat.com> > wrote: > > Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > According to dhowell's measurements processing 100k mounts would > > > take > > > about a few seconds of system time (that's the time spent by the > > > kernel to retrieve the data, > > > > But the inefficiency of mountfs - at least as currently implemented > > - scales > > up with the number of individual values you want to retrieve, both > > in terms of > > memory usage and time taken. > > I've taken that into account when guesstimating a "few seconds per > 100k entries". My guess is that there's probably an order of > magnitude difference between the performance of a fs based interface > and a binary syscall based interface. That could be reduced somewhat > with a readfile(2) type API. > > But the point is: this does not matter. Whether it's .5s or 5s is > completely irrelevant, as neither is going to take down the system, > and userspace processing is probably going to take as much, if not > more time. And remember, we are talking about stopping and starting > the automount daemon, which is something that happens, but it should > not happen often by any measure. Yes, but don't forget, I'm reporting what I saw when testing during development. From previous discussion we know systemd (and probably the other apps like udisks2, et. al.) gets notified on mount and umount activity so its not going to be just starting and stopping autofs that's a problem with very large mount tables. To get a feel for the real difference we'd need to make the libmount changes for both and then check between the two and check behaviour. The mount and umount lookup case that Karel (and I) talked about should be sufficient. The biggest problem I had with fsinfo() when I was working with earlier series was getting fs specific options, in particular the need to use sb op ->fsinfo(). With this latest series David has made that part of the generic code and your patch also cover it. So the thing that was holding me up is done so we should be getting on with libmount improvements, we need to settle this. I prefer the system call interface and I'm not offering justification for that other than a general dislike (and on occasion outright frustration) of pretty much every proc implementation I have had to look at. > > > With fsinfo(), I've tried to batch values together where it makes > > sense - and > > there's no lingering memory overhead - no extra inodes, dentries > > and files > > required. > > The dentries, inodes and files in your test are single use (except > the > root dentry) and can be made ephemeral if that turns out to be > better. > My guess is that dentries belonging to individual attributes should > be > deleted on final put, while the dentries belonging to the mount > directory can be reclaimed normally. > > Thanks, > Miklos ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 00/13] VFS: Filesystem information [ver #19] 2020-04-02 1:38 ` Ian Kent @ 2020-04-02 14:14 ` Karel Zak 0 siblings, 0 replies; 13+ messages in thread From: Karel Zak @ 2020-04-02 14:14 UTC (permalink / raw) To: Ian Kent Cc: Miklos Szeredi, David Howells, Linus Torvalds, Al Viro, Linux NFS list, Andreas Dilger, Anna Schumaker, Theodore Ts'o, Linux API, linux-ext4, Trond Myklebust, Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong, Jeff Layton, linux-fsdevel, LSM, linux-kernel On Thu, Apr 02, 2020 at 09:38:20AM +0800, Ian Kent wrote: > I prefer the system call interface and I'm not offering justification > for that other than a general dislike (and on occasion outright > frustration) of pretty much every proc implementation I have had to > look at. Frankly, I'm modest, what about to have both interfaces in kernel -- fsinfo() as well mountfs? It's nothing unusual for example for block devices to have attribute accessible by /sys as well as by ioctl(). I can imagine that for complex task or performance sensitive tasks it's better to use fsinfo(), but in another simple use-cases (for example to convert mountpoint to device name in shell) is better to read /proc/.../<atrtr>. Karel -- Karel Zak <kzak@redhat.com> http://karelzak.blogspot.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 00/13] VFS: Filesystem information [ver #19] 2020-03-18 15:08 [PATCH 00/13] VFS: Filesystem information [ver #19] David Howells 2020-03-18 15:09 ` [PATCH 12/13] fsinfo: Example support for Ext4 " David Howells 2020-03-18 16:05 ` [PATCH 00/13] VFS: Filesystem information " Miklos Szeredi @ 2020-03-19 10:37 ` David Howells 2020-03-19 12:36 ` Miklos Szeredi 2 siblings, 1 reply; 13+ messages in thread From: David Howells @ 2020-03-19 10:37 UTC (permalink / raw) To: Miklos Szeredi Cc: dhowells, Linus Torvalds, Al Viro, Linux NFS list, Andreas Dilger, Anna Schumaker, Theodore Ts'o, Linux API, linux-ext4, Trond Myklebust, Ian Kent, Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong, Karel Zak, Jeff Layton, linux-fsdevel, LSM, linux-kernel Miklos Szeredi <miklos@szeredi.hu> wrote: > > (2) It's more efficient as we can return specific binary data rather than > > making huge text dumps. Granted, sysfs and procfs could present the > > same data, though as lots of little files which have to be > > individually opened, read, closed and parsed. > > Asked this a number of times, but you haven't answered yet: what > application would require such a high efficiency? Low efficiency means more time doing this when that time could be spent doing other things - or even putting the CPU in a powersaving state. Using an open/read/close render-to-text-and-parse interface *will* be slower and less efficient as there are more things you have to do to use it. Then consider doing a walk over all the mounts in the case where there are 10000 of them - we have issues with /proc/mounts for such. fsinfo() will end up doing a lot less work. > I strongly feel that mount info belongs in the latter category I feel strongly that a lot of stuff done through /proc or /sys shouldn't be. Yes, it's nice that you can explore it with cat and poke it with echo, but it has a number of problems: security, atomiticity, efficiency and providing an round-the-back way to pin stuff if not done right. > > (3) We wouldn't have the overhead of open and close (even adding a > > self-contained readfile() syscall has to do that internally > > Busted: add f_op->readfile() and be done with all that. For example > DEFINE_SHOW_ATTRIBUTE() could be trivially moved to that interface. Look at your example. "f_op->". That's "file->f_op->" I presume. You would have to make it "i_op->" to avoid the open and the close - and for things like procfs and sysfs, that's probably entirely reasonable - but bear in mind that you still have to apply all the LSM file security controls, just in case the backing filesystem is, say, ext4 rather than procfs. > We could optimize existing proc, sys, etc. interfaces, but it's not > been an issue, apparently. You can't get rid of or change many of the existing interfaces. A lot of them are effectively indirect system calls and are, as such, part of the fixed UAPI. You'd have to add a parallel optimised set. > > (6) Don't have to create/delete a bunch of sysfs/procfs nodes each time a > > mount happens or is removed - and since systemd makes much use of > > mount namespaces and mount propagation, this will create a lot of > > nodes. > > Not true. This may not be true if you roll your own special filesystem. It *is* true if you do it in procfs or sysfs. The files don't exist if you don't create nodes or attribute tables for them. > > The argument for doing this through procfs/sysfs/somemagicfs is that > > someone using a shell can just query the magic files using ordinary text > > tools, such as cat - and that has merit - but it doesn't solve the > > query-by-pathname problem. > > > > The suggested way around the query-by-pathname problem is to open the > > target file O_PATH and then look in a magic directory under procfs > > corresponding to the fd number to see a set of attribute files[*] laid out. > > Bash, however, can't open by O_PATH or O_NOFOLLOW as things stand... > > Bash doesn't have fsinfo(2) either, so that's not really a good argument. I never claimed that fsinfo() could be accessed directly from the shell. For you proposal, you claimed "immediately usable from all programming languages, including scripts". > Implementing a utility to show mount attribute(s) by path is trivial > for the file based interface, while it would need to be updated for > each extension of fsinfo(2). Same goes for libc, language bindings, > etc. That's not precisely true. If you aren't using an extension to an fsinfo() attribute, you wouldn't need to change anything[*]. If you want to use an extension - *even* through a file based interface - you *would* have to change your code and your parser. And, no, extending an fsinfo() attribute would not require any changes to libc unless libc is using that attribute[*] and wants to access the extension. [*] I assume that in C/C++ at least, you'd use linux/fsinfo.h rather than some libc version. [*] statfs() could be emulated this way, but I'm not sure what else libc specifically is going to look at. This is more aimed at libmount amongst other things. David ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 00/13] VFS: Filesystem information [ver #19] 2020-03-19 10:37 ` David Howells @ 2020-03-19 12:36 ` Miklos Szeredi 0 siblings, 0 replies; 13+ messages in thread From: Miklos Szeredi @ 2020-03-19 12:36 UTC (permalink / raw) To: David Howells Cc: Linus Torvalds, Al Viro, Linux NFS list, Andreas Dilger, Anna Schumaker, Theodore Ts'o, Linux API, linux-ext4, Trond Myklebust, Ian Kent, Miklos Szeredi, Christian Brauner, Jann Horn, Darrick J. Wong, Karel Zak, Jeff Layton, linux-fsdevel, LSM, linux-kernel On Thu, Mar 19, 2020 at 11:37 AM David Howells <dhowells@redhat.com> wrote: > > Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > (2) It's more efficient as we can return specific binary data rather than > > > making huge text dumps. Granted, sysfs and procfs could present the > > > same data, though as lots of little files which have to be > > > individually opened, read, closed and parsed. > > > > Asked this a number of times, but you haven't answered yet: what > > application would require such a high efficiency? > > Low efficiency means more time doing this when that time could be spent doing > other things - or even putting the CPU in a powersaving state. Using an > open/read/close render-to-text-and-parse interface *will* be slower and less > efficient as there are more things you have to do to use it. > > Then consider doing a walk over all the mounts in the case where there are > 10000 of them - we have issues with /proc/mounts for such. fsinfo() will end > up doing a lot less work. Current /proc/mounts problems arise from the fact that mount info can only be queried for the whole namespace, and hence changes related to a single mount will require rescanning the complete mount list. If mount info can be queried for individual mounts, then the need to scan the complete list will be rare. That's *the* point of this change. > > > (3) We wouldn't have the overhead of open and close (even adding a > > > self-contained readfile() syscall has to do that internally > > > > Busted: add f_op->readfile() and be done with all that. For example > > DEFINE_SHOW_ATTRIBUTE() could be trivially moved to that interface. > > Look at your example. "f_op->". That's "file->f_op->" I presume. > > You would have to make it "i_op->" to avoid the open and the close - and for > things like procfs and sysfs, that's probably entirely reasonable - but bear > in mind that you still have to apply all the LSM file security controls, just > in case the backing filesystem is, say, ext4 rather than procfs. > > > We could optimize existing proc, sys, etc. interfaces, but it's not > > been an issue, apparently. > > You can't get rid of or change many of the existing interfaces. A lot of them > are effectively indirect system calls and are, as such, part of the fixed > UAPI. You'd have to add a parallel optimised set. Sure. We already have the single_open() internal API that is basically a ->readfile() wrapper. Moving this up to the f_op level (no, it's not an i_op, and yes, we do need struct file, but it can be simply allocated on the stack) is a trivial optimization that would let a readfile(2) syscall access that level. No new complexity in that case. Same generally goes for seq_file: seq_readfile() is trivial to implement without messing with current implementation or any existing APIs. > > > > (6) Don't have to create/delete a bunch of sysfs/procfs nodes each time a > > > mount happens or is removed - and since systemd makes much use of > > > mount namespaces and mount propagation, this will create a lot of > > > nodes. > > > > Not true. > > This may not be true if you roll your own special filesystem. It *is* true if > you do it in procfs or sysfs. The files don't exist if you don't create nodes > or attribute tables for them. That's one of the reasons why I opted to roll my own. But the ideas therein could be applied to kernfs, if found to be generally useful. Nothing magic about that. > > > > The argument for doing this through procfs/sysfs/somemagicfs is that > > > someone using a shell can just query the magic files using ordinary text > > > tools, such as cat - and that has merit - but it doesn't solve the > > > query-by-pathname problem. > > > > > > The suggested way around the query-by-pathname problem is to open the > > > target file O_PATH and then look in a magic directory under procfs > > > corresponding to the fd number to see a set of attribute files[*] laid out. > > > Bash, however, can't open by O_PATH or O_NOFOLLOW as things stand... > > > > Bash doesn't have fsinfo(2) either, so that's not really a good argument. > > I never claimed that fsinfo() could be accessed directly from the shell. For > you proposal, you claimed "immediately usable from all programming languages, > including scripts". You are right. Note however: only special files need the O_PATH handling, regular files are directories can be opened by the shell without side effects. In any case, I think neither of us can be convinced of the other's right, so I guess It's up to Al and Linus to make a decision. Thanks, Miklos ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2020-04-02 14:14 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-03-18 15:08 [PATCH 00/13] VFS: Filesystem information [ver #19] David Howells 2020-03-18 15:09 ` [PATCH 12/13] fsinfo: Example support for Ext4 " David Howells 2020-03-18 16:05 ` [PATCH 00/13] VFS: Filesystem information " Miklos Szeredi 2020-04-01 5:22 ` Ian Kent 2020-04-01 8:18 ` Miklos Szeredi 2020-04-01 8:27 ` David Howells 2020-04-01 8:37 ` Miklos Szeredi 2020-04-01 12:35 ` Miklos Szeredi 2020-04-01 15:51 ` David Howells 2020-04-02 1:38 ` Ian Kent 2020-04-02 14:14 ` Karel Zak 2020-03-19 10:37 ` David Howells 2020-03-19 12:36 ` Miklos Szeredi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).