From: David Howells <dhowells@redhat.com> To: viro@zeniv.linux.org.uk Cc: dhowells@redhat.com, raven@themaw.net, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-block@vger.kernel.org, keyrings@vger.kernel.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications Date: Tue, 28 May 2019 17:01:47 +0100 [thread overview] Message-ID: <155905930702.7587.7100265859075976147.stgit@warthog.procyon.org.uk> (raw) Hi Al, Here's a set of patches to add a general variable-length notification queue concept and to add sources of events for: (1) Mount topology events, such as mounting, unmounting, mount expiry, mount reconfiguration. (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O errors (not complete yet). (3) Block layer events, such as I/O errors. (4) Key/keyring events, such as creating, linking and removal of keys. One of the reasons for this is so that we can remove the issue of processes having to repeatedly and regularly scan /proc/mounts, which has proven to be a system performance problem. To further aid this, the fsinfo() syscall on which this patch series depends, provides a way to access superblock and mount information in binary form without the need to parse /proc/mounts. Design decisions: (1) A misc chardev is used to create and open a ring buffer: fd = open("/dev/watch_queue", O_RDWR); which is then configured and mmap'd into userspace: ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE); ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter); buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); The fd cannot be read or written (though there is a facility to use write to inject records for debugging) and userspace just pulls data directly out of the buffer. (2) The ring index pointers are stored inside the ring and are thus accessible to userspace. Userspace should only update the tail pointer and never the head pointer or risk breaking the buffer. The kernel checks that the pointers appear valid before trying to use them. A 'skip' record is maintained around the pointers. (3) poll() can be used to wait for data to appear in the buffer. (4) Records in the buffer are binary, typed and have a length so that they can be of varying size. This means that multiple heterogeneous sources can share a common buffer. Tags may be specified when a watchpoint is created to help distinguish the sources. (5) The queue is reusable as there are 16 million types available, of which I've used 4, so there is scope for others to be used. (6) Records are filterable as types have up to 256 subtypes that can be individually filtered. Other filtration is also available. (7) Each time the buffer is opened, a new buffer is created - this means that there's no interference between watchers. (8) When recording a notification, the kernel will not sleep, but will rather mark a queue as overrun if there's insufficient space, thereby avoiding userspace causing the kernel to hang. (9) The 'watchpoint' should be specific where possible, meaning that you specify the object that you want to watch. (10) The buffer is created and then watchpoints are attached to it, using one of: keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01); mount_notify(AT_FDCWD, "/", 0, fd, 0x02); sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03); where in all three cases, fd indicates the queue and the number after is a tag between 0 and 255. (11) The watch must be removed if either the watch buffer is destroyed or the watched object is destroyed. Things I want to avoid: (1) Introducing features that make the core VFS dependent on the network stack or networking namespaces (ie. usage of netlink). (2) Dumping all this stuff into dmesg and having a daemon that sits there parsing the output and distributing it as this then puts the responsibility for security into userspace and makes handling namespaces tricky. Further, dmesg might not exist or might be inaccessible inside a container. (3) Letting users see events they shouldn't be able to see. Further things that could be considered: (1) Adding a keyctl call to allow a watch on a keyring to be extended to "children" of that keyring, such that the watch is removed from the child if it is unlinked from the keyring. (2) Adding global superblock event queue. (3) Propagating watches to child superblock over automounts. The patches can be found here also: http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications David --- David Howells (7): General notification queue with user mmap()'able ring buffer keys: Add a notification facility vfs: Add a mount-notification facility vfs: Add superblock notifications fsinfo: Export superblock notification counter block: Add block layer notifications Add sample notification program Documentation/security/keys/core.rst | 58 ++ Documentation/watch_queue.rst | 311 +++++++++++ arch/x86/entry/syscalls/syscall_32.tbl | 3 arch/x86/entry/syscalls/syscall_64.tbl | 3 block/Kconfig | 9 block/Makefile | 1 block/blk-core.c | 28 + block/blk-notify.c | 83 +++ drivers/misc/Kconfig | 13 drivers/misc/Makefile | 1 drivers/misc/watch_queue.c | 877 ++++++++++++++++++++++++++++++++ fs/Kconfig | 21 + fs/Makefile | 1 fs/fsinfo.c | 12 fs/mount.h | 33 + fs/mount_notify.c | 178 ++++++ fs/namespace.c | 9 fs/super.c | 116 ++++ include/linux/blkdev.h | 10 include/linux/dcache.h | 1 include/linux/fs.h | 78 +++ include/linux/key.h | 4 include/linux/lsm_hooks.h | 15 + include/linux/security.h | 14 + include/linux/syscalls.h | 5 include/linux/watch_queue.h | 86 +++ include/uapi/linux/fsinfo.h | 10 include/uapi/linux/keyctl.h | 1 include/uapi/linux/watch_queue.h | 185 +++++++ kernel/sys_ni.c | 6 mm/interval_tree.c | 2 mm/memory.c | 1 samples/Kconfig | 6 samples/Makefile | 1 samples/vfs/test-fsinfo.c | 13 samples/watch_queue/Makefile | 9 samples/watch_queue/watch_test.c | 284 ++++++++++ security/keys/Kconfig | 10 security/keys/compat.c | 2 security/keys/gc.c | 5 security/keys/internal.h | 30 + security/keys/key.c | 37 + security/keys/keyctl.c | 88 +++ security/keys/keyring.c | 17 - security/keys/request_key.c | 4 security/security.c | 9 46 files changed, 2652 insertions(+), 38 deletions(-) create mode 100644 Documentation/watch_queue.rst create mode 100644 block/blk-notify.c create mode 100644 drivers/misc/watch_queue.c create mode 100644 fs/mount_notify.c create mode 100644 include/linux/watch_queue.h create mode 100644 include/uapi/linux/watch_queue.h create mode 100644 samples/watch_queue/Makefile create mode 100644 samples/watch_queue/watch_test.c
WARNING: multiple messages have this Message-ID (diff)
From: David Howells <dhowells@redhat.com> To: viro@zeniv.linux.org.uk Cc: dhowells@redhat.com, raven@themaw.net, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-block@vger.kernel.org, keyrings@vger.kernel.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications Date: Tue, 28 May 2019 16:01:47 +0000 [thread overview] Message-ID: <155905930702.7587.7100265859075976147.stgit@warthog.procyon.org.uk> (raw) Hi Al, Here's a set of patches to add a general variable-length notification queue concept and to add sources of events for: (1) Mount topology events, such as mounting, unmounting, mount expiry, mount reconfiguration. (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O errors (not complete yet). (3) Block layer events, such as I/O errors. (4) Key/keyring events, such as creating, linking and removal of keys. One of the reasons for this is so that we can remove the issue of processes having to repeatedly and regularly scan /proc/mounts, which has proven to be a system performance problem. To further aid this, the fsinfo() syscall on which this patch series depends, provides a way to access superblock and mount information in binary form without the need to parse /proc/mounts. Design decisions: (1) A misc chardev is used to create and open a ring buffer: fd = open("/dev/watch_queue", O_RDWR); which is then configured and mmap'd into userspace: ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE); ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter); buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); The fd cannot be read or written (though there is a facility to use write to inject records for debugging) and userspace just pulls data directly out of the buffer. (2) The ring index pointers are stored inside the ring and are thus accessible to userspace. Userspace should only update the tail pointer and never the head pointer or risk breaking the buffer. The kernel checks that the pointers appear valid before trying to use them. A 'skip' record is maintained around the pointers. (3) poll() can be used to wait for data to appear in the buffer. (4) Records in the buffer are binary, typed and have a length so that they can be of varying size. This means that multiple heterogeneous sources can share a common buffer. Tags may be specified when a watchpoint is created to help distinguish the sources. (5) The queue is reusable as there are 16 million types available, of which I've used 4, so there is scope for others to be used. (6) Records are filterable as types have up to 256 subtypes that can be individually filtered. Other filtration is also available. (7) Each time the buffer is opened, a new buffer is created - this means that there's no interference between watchers. (8) When recording a notification, the kernel will not sleep, but will rather mark a queue as overrun if there's insufficient space, thereby avoiding userspace causing the kernel to hang. (9) The 'watchpoint' should be specific where possible, meaning that you specify the object that you want to watch. (10) The buffer is created and then watchpoints are attached to it, using one of: keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01); mount_notify(AT_FDCWD, "/", 0, fd, 0x02); sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03); where in all three cases, fd indicates the queue and the number after is a tag between 0 and 255. (11) The watch must be removed if either the watch buffer is destroyed or the watched object is destroyed. Things I want to avoid: (1) Introducing features that make the core VFS dependent on the network stack or networking namespaces (ie. usage of netlink). (2) Dumping all this stuff into dmesg and having a daemon that sits there parsing the output and distributing it as this then puts the responsibility for security into userspace and makes handling namespaces tricky. Further, dmesg might not exist or might be inaccessible inside a container. (3) Letting users see events they shouldn't be able to see. Further things that could be considered: (1) Adding a keyctl call to allow a watch on a keyring to be extended to "children" of that keyring, such that the watch is removed from the child if it is unlinked from the keyring. (2) Adding global superblock event queue. (3) Propagating watches to child superblock over automounts. The patches can be found here also: http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications David --- David Howells (7): General notification queue with user mmap()'able ring buffer keys: Add a notification facility vfs: Add a mount-notification facility vfs: Add superblock notifications fsinfo: Export superblock notification counter block: Add block layer notifications Add sample notification program Documentation/security/keys/core.rst | 58 ++ Documentation/watch_queue.rst | 311 +++++++++++ arch/x86/entry/syscalls/syscall_32.tbl | 3 arch/x86/entry/syscalls/syscall_64.tbl | 3 block/Kconfig | 9 block/Makefile | 1 block/blk-core.c | 28 + block/blk-notify.c | 83 +++ drivers/misc/Kconfig | 13 drivers/misc/Makefile | 1 drivers/misc/watch_queue.c | 877 ++++++++++++++++++++++++++++++++ fs/Kconfig | 21 + fs/Makefile | 1 fs/fsinfo.c | 12 fs/mount.h | 33 + fs/mount_notify.c | 178 ++++++ fs/namespace.c | 9 fs/super.c | 116 ++++ include/linux/blkdev.h | 10 include/linux/dcache.h | 1 include/linux/fs.h | 78 +++ include/linux/key.h | 4 include/linux/lsm_hooks.h | 15 + include/linux/security.h | 14 + include/linux/syscalls.h | 5 include/linux/watch_queue.h | 86 +++ include/uapi/linux/fsinfo.h | 10 include/uapi/linux/keyctl.h | 1 include/uapi/linux/watch_queue.h | 185 +++++++ kernel/sys_ni.c | 6 mm/interval_tree.c | 2 mm/memory.c | 1 samples/Kconfig | 6 samples/Makefile | 1 samples/vfs/test-fsinfo.c | 13 samples/watch_queue/Makefile | 9 samples/watch_queue/watch_test.c | 284 ++++++++++ security/keys/Kconfig | 10 security/keys/compat.c | 2 security/keys/gc.c | 5 security/keys/internal.h | 30 + security/keys/key.c | 37 + security/keys/keyctl.c | 88 +++ security/keys/keyring.c | 17 - security/keys/request_key.c | 4 security/security.c | 9 46 files changed, 2652 insertions(+), 38 deletions(-) create mode 100644 Documentation/watch_queue.rst create mode 100644 block/blk-notify.c create mode 100644 drivers/misc/watch_queue.c create mode 100644 fs/mount_notify.c create mode 100644 include/linux/watch_queue.h create mode 100644 include/uapi/linux/watch_queue.h create mode 100644 samples/watch_queue/Makefile create mode 100644 samples/watch_queue/watch_test.c
next reply other threads:[~2019-05-28 16:02 UTC|newest] Thread overview: 131+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-05-28 16:01 David Howells [this message] 2019-05-28 16:01 ` [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications David Howells 2019-05-28 16:01 ` [PATCH 1/7] General notification queue with user mmap()'able ring buffer David Howells 2019-05-28 16:01 ` David Howells 2019-05-28 16:26 ` Greg KH 2019-05-28 16:26 ` Greg KH 2019-05-28 17:30 ` David Howells 2019-05-28 17:30 ` David Howells 2019-05-28 23:12 ` Greg KH 2019-05-28 23:12 ` Greg KH 2019-05-29 16:06 ` David Howells 2019-05-29 16:06 ` David Howells 2019-05-29 17:46 ` Jann Horn 2019-05-29 17:46 ` Jann Horn 2019-05-29 21:02 ` David Howells 2019-05-29 21:02 ` David Howells 2019-05-31 11:14 ` Peter Zijlstra 2019-05-31 11:14 ` Peter Zijlstra 2019-05-31 12:02 ` David Howells 2019-05-31 12:02 ` David Howells 2019-05-31 13:26 ` Peter Zijlstra 2019-05-31 13:26 ` Peter Zijlstra 2019-05-31 14:20 ` David Howells 2019-05-31 14:20 ` David Howells 2019-05-31 16:44 ` Peter Zijlstra 2019-05-31 16:44 ` Peter Zijlstra 2019-05-31 17:12 ` David Howells 2019-05-31 17:12 ` David Howells 2019-06-17 16:24 ` Peter Zijlstra 2019-06-17 16:24 ` Peter Zijlstra 2019-05-29 23:09 ` Greg KH 2019-05-29 23:09 ` Greg KH 2019-05-29 23:11 ` Greg KH 2019-05-29 23:11 ` Greg KH 2019-05-30 9:50 ` Andrea Parri 2019-05-30 9:50 ` Andrea Parri 2019-05-31 8:35 ` Peter Zijlstra 2019-05-31 8:35 ` Peter Zijlstra 2019-05-31 8:47 ` Peter Zijlstra 2019-05-31 8:47 ` Peter Zijlstra 2019-05-31 12:42 ` David Howells 2019-05-31 12:42 ` David Howells 2019-05-31 14:55 ` David Howells 2019-05-31 14:55 ` David Howells 2019-05-28 19:14 ` Jann Horn 2019-05-28 19:14 ` Jann Horn 2019-05-28 22:28 ` David Howells 2019-05-28 22:28 ` David Howells 2019-05-28 23:16 ` Jann Horn 2019-05-28 23:16 ` Jann Horn 2019-05-28 16:02 ` [PATCH 2/7] keys: Add a notification facility David Howells 2019-05-28 16:02 ` David Howells 2019-05-28 16:02 ` [PATCH 3/7] vfs: Add a mount-notification facility David Howells 2019-05-28 16:02 ` David Howells 2019-05-28 20:06 ` Jann Horn 2019-05-28 20:06 ` Jann Horn 2019-05-28 23:04 ` David Howells 2019-05-28 23:04 ` David Howells 2019-05-28 23:23 ` Jann Horn 2019-05-28 23:23 ` Jann Horn 2019-05-29 11:16 ` David Howells 2019-05-29 11:16 ` David Howells 2019-05-28 23:08 ` David Howells 2019-05-28 23:08 ` David Howells 2019-05-29 10:55 ` David Howells 2019-05-29 10:55 ` David Howells 2019-05-29 11:00 ` David Howells 2019-05-29 11:00 ` David Howells 2019-05-29 15:53 ` Casey Schaufler 2019-05-29 15:53 ` Casey Schaufler 2019-05-29 16:12 ` Jann Horn 2019-05-29 16:12 ` Jann Horn 2019-05-29 17:04 ` Casey Schaufler 2019-05-29 17:04 ` Casey Schaufler 2019-06-03 16:30 ` David Howells 2019-06-03 16:30 ` David Howells 2019-05-29 17:13 ` Andy Lutomirski 2019-05-29 17:13 ` Andy Lutomirski 2019-05-29 17:46 ` Casey Schaufler 2019-05-29 17:46 ` Casey Schaufler 2019-05-29 18:11 ` Jann Horn 2019-05-29 18:11 ` Jann Horn 2019-05-29 19:28 ` Casey Schaufler 2019-05-29 19:28 ` Casey Schaufler 2019-05-29 19:47 ` Jann Horn 2019-05-29 19:47 ` Jann Horn 2019-05-29 20:50 ` Casey Schaufler 2019-05-29 20:50 ` Casey Schaufler 2019-05-29 23:12 ` Andy Lutomirski 2019-05-29 23:12 ` Andy Lutomirski 2019-05-29 23:56 ` Casey Schaufler 2019-05-29 23:56 ` Casey Schaufler 2019-05-28 16:02 ` [PATCH 4/7] vfs: Add superblock notifications David Howells 2019-05-28 16:02 ` David Howells 2019-05-28 20:27 ` Jann Horn 2019-05-28 20:27 ` Jann Horn 2019-05-29 12:58 ` David Howells 2019-05-29 12:58 ` David Howells 2019-05-29 14:16 ` Jann Horn 2019-05-29 14:16 ` Jann Horn 2019-05-28 16:02 ` [PATCH 5/7] fsinfo: Export superblock notification counter David Howells 2019-05-28 16:02 ` David Howells 2019-05-28 16:02 ` [PATCH 6/7] block: Add block layer notifications David Howells 2019-05-28 16:02 ` David Howells 2019-05-28 20:37 ` Jann Horn 2019-05-28 20:37 ` Jann Horn 2019-05-28 16:02 ` [PATCH 7/7] Add sample notification program David Howells 2019-05-28 16:02 ` David Howells 2019-05-28 23:58 ` [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications Greg KH 2019-05-28 23:58 ` Greg KH 2019-05-29 6:33 ` Amir Goldstein 2019-05-29 6:33 ` Amir Goldstein 2019-05-29 6:33 ` Amir Goldstein 2019-05-29 14:25 ` Jan Kara 2019-05-29 14:25 ` Jan Kara 2019-05-29 15:10 ` Greg KH 2019-05-29 15:10 ` Greg KH 2019-05-29 15:53 ` Amir Goldstein 2019-05-29 15:53 ` Amir Goldstein 2019-05-30 11:00 ` Jan Kara 2019-05-30 11:00 ` Jan Kara 2019-06-04 12:33 ` David Howells 2019-06-04 12:33 ` David Howells 2019-05-29 6:45 ` David Howells 2019-05-29 6:45 ` David Howells 2019-05-29 7:40 ` Amir Goldstein 2019-05-29 7:40 ` Amir Goldstein 2019-05-29 9:09 ` David Howells 2019-05-29 9:09 ` David Howells 2019-05-29 15:41 ` Casey Schaufler 2019-05-29 15:41 ` Casey Schaufler
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=155905930702.7587.7100265859075976147.stgit@warthog.procyon.org.uk \ --to=dhowells@redhat.com \ --cc=keyrings@vger.kernel.org \ --cc=linux-api@vger.kernel.org \ --cc=linux-block@vger.kernel.org \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-security-module@vger.kernel.org \ --cc=raven@themaw.net \ --cc=viro@zeniv.linux.org.uk \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.