From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_MUTT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E24ECC28CC1 for ; Wed, 29 May 2019 14:25:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id BBD9123A41 for ; Wed, 29 May 2019 14:25:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726701AbfE2OZH (ORCPT ); Wed, 29 May 2019 10:25:07 -0400 Received: from mx2.suse.de ([195.135.220.15]:41902 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726118AbfE2OZH (ORCPT ); Wed, 29 May 2019 10:25:07 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 10E91AF42; Wed, 29 May 2019 14:25:05 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id 8E7301E3F53; Wed, 29 May 2019 16:25:04 +0200 (CEST) Date: Wed, 29 May 2019 16:25:04 +0200 From: Jan Kara To: Amir Goldstein Cc: David Howells , Al Viro , Ian Kent , linux-fsdevel , linux-api@vger.kernel.org, linux-block , keyrings@vger.kernel.org, LSM List , linux-kernel , Jan Kara Subject: Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications Message-ID: <20190529142504.GC32147@quack2.suse.cz> References: <155905930702.7587.7100265859075976147.stgit@warthog.procyon.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org On Wed 29-05-19 09:33:35, Amir Goldstein wrote: > On Tue, May 28, 2019 at 7:03 PM David Howells wrote: > > > > > > Hi Al, > > > > Here's a set of patches to add a general variable-length notification queue > > concept and to add sources of events for: > > > > (1) Mount topology events, such as mounting, unmounting, mount expiry, > > mount reconfiguration. > > > > (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O > > errors (not complete yet). > > > > (3) Block layer events, such as I/O errors. > > > > (4) Key/keyring events, such as creating, linking and removal of keys. > > > > One of the reasons for this is so that we can remove the issue of processes > > having to repeatedly and regularly scan /proc/mounts, which has proven to > > be a system performance problem. To further aid this, the fsinfo() syscall > > on which this patch series depends, provides a way to access superblock and > > mount information in binary form without the need to parse /proc/mounts. > > > > > > Design decisions: > > > > (1) A misc chardev is used to create and open a ring buffer: > > > > fd = open("/dev/watch_queue", O_RDWR); > > > > which is then configured and mmap'd into userspace: > > > > ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE); > > ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter); > > buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE, > > MAP_SHARED, fd, 0); > > > > The fd cannot be read or written (though there is a facility to use > > write to inject records for debugging) and userspace just pulls data > > directly out of the buffer. > > > > (2) The ring index pointers are stored inside the ring and are thus > > accessible to userspace. Userspace should only update the tail > > pointer and never the head pointer or risk breaking the buffer. The > > kernel checks that the pointers appear valid before trying to use > > them. A 'skip' record is maintained around the pointers. > > > > (3) poll() can be used to wait for data to appear in the buffer. > > > > (4) Records in the buffer are binary, typed and have a length so that they > > can be of varying size. > > > > This means that multiple heterogeneous sources can share a common > > buffer. Tags may be specified when a watchpoint is created to help > > distinguish the sources. > > > > (5) The queue is reusable as there are 16 million types available, of > > which I've used 4, so there is scope for others to be used. > > > > (6) Records are filterable as types have up to 256 subtypes that can be > > individually filtered. Other filtration is also available. > > > > (7) Each time the buffer is opened, a new buffer is created - this means > > that there's no interference between watchers. > > > > (8) When recording a notification, the kernel will not sleep, but will > > rather mark a queue as overrun if there's insufficient space, thereby > > avoiding userspace causing the kernel to hang. > > > > (9) The 'watchpoint' should be specific where possible, meaning that you > > specify the object that you want to watch. > > > > (10) The buffer is created and then watchpoints are attached to it, using > > one of: > > > > keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01); > > mount_notify(AT_FDCWD, "/", 0, fd, 0x02); > > sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03); > > > > where in all three cases, fd indicates the queue and the number after > > is a tag between 0 and 255. > > > > (11) The watch must be removed if either the watch buffer is destroyed or > > the watched object is destroyed. > > > > > > Things I want to avoid: > > > > (1) Introducing features that make the core VFS dependent on the network > > stack or networking namespaces (ie. usage of netlink). > > > > (2) Dumping all this stuff into dmesg and having a daemon that sits there > > parsing the output and distributing it as this then puts the > > responsibility for security into userspace and makes handling > > namespaces tricky. Further, dmesg might not exist or might be > > inaccessible inside a container. > > > > (3) Letting users see events they shouldn't be able to see. > > > > > > Further things that could be considered: > > > > (1) Adding a keyctl call to allow a watch on a keyring to be extended to > > "children" of that keyring, such that the watch is removed from the > > child if it is unlinked from the keyring. > > > > (2) Adding global superblock event queue. > > > > (3) Propagating watches to child superblock over automounts. > > > > David, > > I am interested to know how you envision filesystem notifications would > look with this interface. > > fanotify can certainly benefit from providing a ring buffer interface to read > events. > > From what I have seen, a common practice of users is to monitor mounts > (somehow) and place FAN_MARK_MOUNT fanotify watches dynamically. > It'd be good if those users can use a single watch mechanism/API for > watching the mount namespace and filesystem events within mounts. > > A similar usability concern is with sb_notify and FAN_MARK_FILESYSTEM. > It provides users with two complete different mechanisms to watch error > and filesystem events. That is generally not a good thing to have. > > I am not asking that you implement fs_notify() before merging sb_notify() > and I understand that you have a use case for sb_notify(). > I am asking that you show me the path towards a unified API (how a > typical program would look like), so that we know before merging your > new API that it could be extended to accommodate fsnotify events > where the final result will look wholesome to users. Are you sure we want to combine notification about file changes etc. with administrator-type notifications about the filesystem? To me these two sound like rather different (although sometimes related) things. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Date: Wed, 29 May 2019 14:25:04 +0000 Subject: Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications Message-Id: <20190529142504.GC32147@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit List-Id: References: <155905930702.7587.7100265859075976147.stgit@warthog.procyon.org.uk> In-Reply-To: To: Amir Goldstein Cc: David Howells , Al Viro , Ian Kent , linux-fsdevel , linux-api@vger.kernel.org, linux-block , keyrings@vger.kernel.org, LSM List , linux-kernel , Jan Kara On Wed 29-05-19 09:33:35, Amir Goldstein wrote: > On Tue, May 28, 2019 at 7:03 PM David Howells wrote: > > > > > > Hi Al, > > > > Here's a set of patches to add a general variable-length notification queue > > concept and to add sources of events for: > > > > (1) Mount topology events, such as mounting, unmounting, mount expiry, > > mount reconfiguration. > > > > (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O > > errors (not complete yet). > > > > (3) Block layer events, such as I/O errors. > > > > (4) Key/keyring events, such as creating, linking and removal of keys. > > > > One of the reasons for this is so that we can remove the issue of processes > > having to repeatedly and regularly scan /proc/mounts, which has proven to > > be a system performance problem. To further aid this, the fsinfo() syscall > > on which this patch series depends, provides a way to access superblock and > > mount information in binary form without the need to parse /proc/mounts. > > > > > > Design decisions: > > > > (1) A misc chardev is used to create and open a ring buffer: > > > > fd = open("/dev/watch_queue", O_RDWR); > > > > which is then configured and mmap'd into userspace: > > > > ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE); > > ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter); > > buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE, > > MAP_SHARED, fd, 0); > > > > The fd cannot be read or written (though there is a facility to use > > write to inject records for debugging) and userspace just pulls data > > directly out of the buffer. > > > > (2) The ring index pointers are stored inside the ring and are thus > > accessible to userspace. Userspace should only update the tail > > pointer and never the head pointer or risk breaking the buffer. The > > kernel checks that the pointers appear valid before trying to use > > them. A 'skip' record is maintained around the pointers. > > > > (3) poll() can be used to wait for data to appear in the buffer. > > > > (4) Records in the buffer are binary, typed and have a length so that they > > can be of varying size. > > > > This means that multiple heterogeneous sources can share a common > > buffer. Tags may be specified when a watchpoint is created to help > > distinguish the sources. > > > > (5) The queue is reusable as there are 16 million types available, of > > which I've used 4, so there is scope for others to be used. > > > > (6) Records are filterable as types have up to 256 subtypes that can be > > individually filtered. Other filtration is also available. > > > > (7) Each time the buffer is opened, a new buffer is created - this means > > that there's no interference between watchers. > > > > (8) When recording a notification, the kernel will not sleep, but will > > rather mark a queue as overrun if there's insufficient space, thereby > > avoiding userspace causing the kernel to hang. > > > > (9) The 'watchpoint' should be specific where possible, meaning that you > > specify the object that you want to watch. > > > > (10) The buffer is created and then watchpoints are attached to it, using > > one of: > > > > keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01); > > mount_notify(AT_FDCWD, "/", 0, fd, 0x02); > > sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03); > > > > where in all three cases, fd indicates the queue and the number after > > is a tag between 0 and 255. > > > > (11) The watch must be removed if either the watch buffer is destroyed or > > the watched object is destroyed. > > > > > > Things I want to avoid: > > > > (1) Introducing features that make the core VFS dependent on the network > > stack or networking namespaces (ie. usage of netlink). > > > > (2) Dumping all this stuff into dmesg and having a daemon that sits there > > parsing the output and distributing it as this then puts the > > responsibility for security into userspace and makes handling > > namespaces tricky. Further, dmesg might not exist or might be > > inaccessible inside a container. > > > > (3) Letting users see events they shouldn't be able to see. > > > > > > Further things that could be considered: > > > > (1) Adding a keyctl call to allow a watch on a keyring to be extended to > > "children" of that keyring, such that the watch is removed from the > > child if it is unlinked from the keyring. > > > > (2) Adding global superblock event queue. > > > > (3) Propagating watches to child superblock over automounts. > > > > David, > > I am interested to know how you envision filesystem notifications would > look with this interface. > > fanotify can certainly benefit from providing a ring buffer interface to read > events. > > From what I have seen, a common practice of users is to monitor mounts > (somehow) and place FAN_MARK_MOUNT fanotify watches dynamically. > It'd be good if those users can use a single watch mechanism/API for > watching the mount namespace and filesystem events within mounts. > > A similar usability concern is with sb_notify and FAN_MARK_FILESYSTEM. > It provides users with two complete different mechanisms to watch error > and filesystem events. That is generally not a good thing to have. > > I am not asking that you implement fs_notify() before merging sb_notify() > and I understand that you have a use case for sb_notify(). > I am asking that you show me the path towards a unified API (how a > typical program would look like), so that we know before merging your > new API that it could be extended to accommodate fsnotify events > where the final result will look wholesome to users. Are you sure we want to combine notification about file changes etc. with administrator-type notifications about the filesystem? To me these two sound like rather different (although sometimes related) things. Honza -- Jan Kara SUSE Labs, CR