From: Jan Kara <jack@suse.cz>
To: Amir Goldstein <amir73il@gmail.com>
Cc: David Howells <dhowells@redhat.com>,
Al Viro <viro@zeniv.linux.org.uk>, Ian Kent <raven@themaw.net>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
linux-api@vger.kernel.org,
linux-block <linux-block@vger.kernel.org>,
keyrings@vger.kernel.org,
LSM List <linux-security-module@vger.kernel.org>,
linux-kernel <linux-kernel@vger.kernel.org>,
Jan Kara <jack@suse.cz>
Subject: Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
Date: Wed, 29 May 2019 16:25:04 +0200 [thread overview]
Message-ID: <20190529142504.GC32147@quack2.suse.cz> (raw)
In-Reply-To: <CAOQ4uxjC1M7jwjd9zSaSa6UW2dbEjc+ZbFSo7j9F1YHAQxQ8LQ@mail.gmail.com>
On Wed 29-05-19 09:33:35, Amir Goldstein wrote:
> On Tue, May 28, 2019 at 7:03 PM David Howells <dhowells@redhat.com> wrote:
> >
> >
> > Hi Al,
> >
> > Here's a set of patches to add a general variable-length notification queue
> > concept and to add sources of events for:
> >
> > (1) Mount topology events, such as mounting, unmounting, mount expiry,
> > mount reconfiguration.
> >
> > (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
> > errors (not complete yet).
> >
> > (3) Block layer events, such as I/O errors.
> >
> > (4) Key/keyring events, such as creating, linking and removal of keys.
> >
> > One of the reasons for this is so that we can remove the issue of processes
> > having to repeatedly and regularly scan /proc/mounts, which has proven to
> > be a system performance problem. To further aid this, the fsinfo() syscall
> > on which this patch series depends, provides a way to access superblock and
> > mount information in binary form without the need to parse /proc/mounts.
> >
> >
> > Design decisions:
> >
> > (1) A misc chardev is used to create and open a ring buffer:
> >
> > fd = open("/dev/watch_queue", O_RDWR);
> >
> > which is then configured and mmap'd into userspace:
> >
> > ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
> > ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
> > buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
> > MAP_SHARED, fd, 0);
> >
> > The fd cannot be read or written (though there is a facility to use
> > write to inject records for debugging) and userspace just pulls data
> > directly out of the buffer.
> >
> > (2) The ring index pointers are stored inside the ring and are thus
> > accessible to userspace. Userspace should only update the tail
> > pointer and never the head pointer or risk breaking the buffer. The
> > kernel checks that the pointers appear valid before trying to use
> > them. A 'skip' record is maintained around the pointers.
> >
> > (3) poll() can be used to wait for data to appear in the buffer.
> >
> > (4) Records in the buffer are binary, typed and have a length so that they
> > can be of varying size.
> >
> > This means that multiple heterogeneous sources can share a common
> > buffer. Tags may be specified when a watchpoint is created to help
> > distinguish the sources.
> >
> > (5) The queue is reusable as there are 16 million types available, of
> > which I've used 4, so there is scope for others to be used.
> >
> > (6) Records are filterable as types have up to 256 subtypes that can be
> > individually filtered. Other filtration is also available.
> >
> > (7) Each time the buffer is opened, a new buffer is created - this means
> > that there's no interference between watchers.
> >
> > (8) When recording a notification, the kernel will not sleep, but will
> > rather mark a queue as overrun if there's insufficient space, thereby
> > avoiding userspace causing the kernel to hang.
> >
> > (9) The 'watchpoint' should be specific where possible, meaning that you
> > specify the object that you want to watch.
> >
> > (10) The buffer is created and then watchpoints are attached to it, using
> > one of:
> >
> > keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
> > mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
> > sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);
> >
> > where in all three cases, fd indicates the queue and the number after
> > is a tag between 0 and 255.
> >
> > (11) The watch must be removed if either the watch buffer is destroyed or
> > the watched object is destroyed.
> >
> >
> > Things I want to avoid:
> >
> > (1) Introducing features that make the core VFS dependent on the network
> > stack or networking namespaces (ie. usage of netlink).
> >
> > (2) Dumping all this stuff into dmesg and having a daemon that sits there
> > parsing the output and distributing it as this then puts the
> > responsibility for security into userspace and makes handling
> > namespaces tricky. Further, dmesg might not exist or might be
> > inaccessible inside a container.
> >
> > (3) Letting users see events they shouldn't be able to see.
> >
> >
> > Further things that could be considered:
> >
> > (1) Adding a keyctl call to allow a watch on a keyring to be extended to
> > "children" of that keyring, such that the watch is removed from the
> > child if it is unlinked from the keyring.
> >
> > (2) Adding global superblock event queue.
> >
> > (3) Propagating watches to child superblock over automounts.
> >
>
> David,
>
> I am interested to know how you envision filesystem notifications would
> look with this interface.
>
> fanotify can certainly benefit from providing a ring buffer interface to read
> events.
>
> From what I have seen, a common practice of users is to monitor mounts
> (somehow) and place FAN_MARK_MOUNT fanotify watches dynamically.
> It'd be good if those users can use a single watch mechanism/API for
> watching the mount namespace and filesystem events within mounts.
>
> A similar usability concern is with sb_notify and FAN_MARK_FILESYSTEM.
> It provides users with two complete different mechanisms to watch error
> and filesystem events. That is generally not a good thing to have.
>
> I am not asking that you implement fs_notify() before merging sb_notify()
> and I understand that you have a use case for sb_notify().
> I am asking that you show me the path towards a unified API (how a
> typical program would look like), so that we know before merging your
> new API that it could be extended to accommodate fsnotify events
> where the final result will look wholesome to users.
Are you sure we want to combine notification about file changes etc. with
administrator-type notifications about the filesystem? To me these two
sound like rather different (although sometimes related) things.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
next prev parent reply other threads:[~2019-05-29 14:25 UTC|newest]
Thread overview: 65+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-28 16:01 [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications David Howells
2019-05-28 16:01 ` [PATCH 1/7] General notification queue with user mmap()'able ring buffer David Howells
2019-05-28 16:26 ` Greg KH
2019-05-28 17:30 ` David Howells
2019-05-28 23:12 ` Greg KH
2019-05-29 16:06 ` David Howells
2019-05-29 17:46 ` Jann Horn
2019-05-29 21:02 ` David Howells
2019-05-31 11:14 ` Peter Zijlstra
2019-05-31 12:02 ` David Howells
2019-05-31 13:26 ` Peter Zijlstra
2019-05-31 14:20 ` David Howells
2019-05-31 16:44 ` Peter Zijlstra
2019-05-31 17:12 ` David Howells
2019-06-17 16:24 ` Peter Zijlstra
2019-05-29 23:09 ` Greg KH
2019-05-29 23:11 ` Greg KH
2019-05-30 9:50 ` Andrea Parri
2019-05-31 8:35 ` Peter Zijlstra
2019-05-31 8:47 ` Peter Zijlstra
2019-05-31 12:42 ` David Howells
2019-05-31 14:55 ` David Howells
2019-05-28 19:14 ` Jann Horn
2019-05-28 22:28 ` David Howells
2019-05-28 23:16 ` Jann Horn
2019-05-28 16:02 ` [PATCH 2/7] keys: Add a notification facility David Howells
2019-05-28 16:02 ` [PATCH 3/7] vfs: Add a mount-notification facility David Howells
2019-05-28 20:06 ` Jann Horn
2019-05-28 23:04 ` David Howells
2019-05-28 23:23 ` Jann Horn
2019-05-29 11:16 ` David Howells
2019-05-28 23:08 ` David Howells
2019-05-29 10:55 ` David Howells
2019-05-29 11:00 ` David Howells
2019-05-29 15:53 ` Casey Schaufler
2019-05-29 16:12 ` Jann Horn
2019-05-29 17:04 ` Casey Schaufler
2019-06-03 16:30 ` David Howells
2019-05-29 17:13 ` Andy Lutomirski
2019-05-29 17:46 ` Casey Schaufler
2019-05-29 18:11 ` Jann Horn
2019-05-29 19:28 ` Casey Schaufler
2019-05-29 19:47 ` Jann Horn
2019-05-29 20:50 ` Casey Schaufler
2019-05-29 23:12 ` Andy Lutomirski
2019-05-29 23:56 ` Casey Schaufler
2019-05-28 16:02 ` [PATCH 4/7] vfs: Add superblock notifications David Howells
2019-05-28 20:27 ` Jann Horn
2019-05-29 12:58 ` David Howells
2019-05-29 14:16 ` Jann Horn
2019-05-28 16:02 ` [PATCH 5/7] fsinfo: Export superblock notification counter David Howells
2019-05-28 16:02 ` [PATCH 6/7] block: Add block layer notifications David Howells
2019-05-28 20:37 ` Jann Horn
2019-05-28 16:02 ` [PATCH 7/7] Add sample notification program David Howells
2019-05-28 23:58 ` [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications Greg KH
2019-05-29 6:33 ` Amir Goldstein
2019-05-29 14:25 ` Jan Kara [this message]
2019-05-29 15:10 ` Greg KH
2019-05-29 15:53 ` Amir Goldstein
2019-05-30 11:00 ` Jan Kara
2019-06-04 12:33 ` David Howells
2019-05-29 6:45 ` David Howells
2019-05-29 7:40 ` Amir Goldstein
2019-05-29 9:09 ` David Howells
2019-05-29 15:41 ` Casey Schaufler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190529142504.GC32147@quack2.suse.cz \
--to=jack@suse.cz \
--cc=amir73il@gmail.com \
--cc=dhowells@redhat.com \
--cc=keyrings@vger.kernel.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-security-module@vger.kernel.org \
--cc=raven@themaw.net \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).