linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
@ 2019-05-28 16:01 David Howells
  2019-05-28 16:01 ` [PATCH 1/7] General notification queue with user mmap()'able ring buffer David Howells
                   ` (10 more replies)
  0 siblings, 11 replies; 65+ messages in thread
From: David Howells @ 2019-05-28 16:01 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel


Hi Al,

Here's a set of patches to add a general variable-length notification queue
concept and to add sources of events for:

 (1) Mount topology events, such as mounting, unmounting, mount expiry,
     mount reconfiguration.

 (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
     errors (not complete yet).

 (3) Block layer events, such as I/O errors.

 (4) Key/keyring events, such as creating, linking and removal of keys.

One of the reasons for this is so that we can remove the issue of processes
having to repeatedly and regularly scan /proc/mounts, which has proven to
be a system performance problem.  To further aid this, the fsinfo() syscall
on which this patch series depends, provides a way to access superblock and
mount information in binary form without the need to parse /proc/mounts.


Design decisions:

 (1) A misc chardev is used to create and open a ring buffer:

	fd = open("/dev/watch_queue", O_RDWR);

     which is then configured and mmap'd into userspace:

	ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
	buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
		   MAP_SHARED, fd, 0);

     The fd cannot be read or written (though there is a facility to use
     write to inject records for debugging) and userspace just pulls data
     directly out of the buffer.

 (2) The ring index pointers are stored inside the ring and are thus
     accessible to userspace.  Userspace should only update the tail
     pointer and never the head pointer or risk breaking the buffer.  The
     kernel checks that the pointers appear valid before trying to use
     them.  A 'skip' record is maintained around the pointers.

 (3) poll() can be used to wait for data to appear in the buffer.

 (4) Records in the buffer are binary, typed and have a length so that they
     can be of varying size.

     This means that multiple heterogeneous sources can share a common
     buffer.  Tags may be specified when a watchpoint is created to help
     distinguish the sources.

 (5) The queue is reusable as there are 16 million types available, of
     which I've used 4, so there is scope for others to be used.

 (6) Records are filterable as types have up to 256 subtypes that can be
     individually filtered.  Other filtration is also available.

 (7) Each time the buffer is opened, a new buffer is created - this means
     that there's no interference between watchers.

 (8) When recording a notification, the kernel will not sleep, but will
     rather mark a queue as overrun if there's insufficient space, thereby
     avoiding userspace causing the kernel to hang.

 (9) The 'watchpoint' should be specific where possible, meaning that you
     specify the object that you want to watch.

(10) The buffer is created and then watchpoints are attached to it, using
     one of:

	keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
	mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
	sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);

     where in all three cases, fd indicates the queue and the number after
     is a tag between 0 and 255.

(11) The watch must be removed if either the watch buffer is destroyed or
     the watched object is destroyed.


Things I want to avoid:

 (1) Introducing features that make the core VFS dependent on the network
     stack or networking namespaces (ie. usage of netlink).

 (2) Dumping all this stuff into dmesg and having a daemon that sits there
     parsing the output and distributing it as this then puts the
     responsibility for security into userspace and makes handling
     namespaces tricky.  Further, dmesg might not exist or might be
     inaccessible inside a container.

 (3) Letting users see events they shouldn't be able to see.


Further things that could be considered:

 (1) Adding a keyctl call to allow a watch on a keyring to be extended to
     "children" of that keyring, such that the watch is removed from the
     child if it is unlinked from the keyring.

 (2) Adding global superblock event queue.

 (3) Propagating watches to child superblock over automounts.


The patches can be found here also:

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications

David
---
David Howells (7):
      General notification queue with user mmap()'able ring buffer
      keys: Add a notification facility
      vfs: Add a mount-notification facility
      vfs: Add superblock notifications
      fsinfo: Export superblock notification counter
      block: Add block layer notifications
      Add sample notification program


 Documentation/security/keys/core.rst   |   58 ++
 Documentation/watch_queue.rst          |  311 +++++++++++
 arch/x86/entry/syscalls/syscall_32.tbl |    3 
 arch/x86/entry/syscalls/syscall_64.tbl |    3 
 block/Kconfig                          |    9 
 block/Makefile                         |    1 
 block/blk-core.c                       |   28 +
 block/blk-notify.c                     |   83 +++
 drivers/misc/Kconfig                   |   13 
 drivers/misc/Makefile                  |    1 
 drivers/misc/watch_queue.c             |  877 ++++++++++++++++++++++++++++++++
 fs/Kconfig                             |   21 +
 fs/Makefile                            |    1 
 fs/fsinfo.c                            |   12 
 fs/mount.h                             |   33 +
 fs/mount_notify.c                      |  178 ++++++
 fs/namespace.c                         |    9 
 fs/super.c                             |  116 ++++
 include/linux/blkdev.h                 |   10 
 include/linux/dcache.h                 |    1 
 include/linux/fs.h                     |   78 +++
 include/linux/key.h                    |    4 
 include/linux/lsm_hooks.h              |   15 +
 include/linux/security.h               |   14 +
 include/linux/syscalls.h               |    5 
 include/linux/watch_queue.h            |   86 +++
 include/uapi/linux/fsinfo.h            |   10 
 include/uapi/linux/keyctl.h            |    1 
 include/uapi/linux/watch_queue.h       |  185 +++++++
 kernel/sys_ni.c                        |    6 
 mm/interval_tree.c                     |    2 
 mm/memory.c                            |    1 
 samples/Kconfig                        |    6 
 samples/Makefile                       |    1 
 samples/vfs/test-fsinfo.c              |   13 
 samples/watch_queue/Makefile           |    9 
 samples/watch_queue/watch_test.c       |  284 ++++++++++
 security/keys/Kconfig                  |   10 
 security/keys/compat.c                 |    2 
 security/keys/gc.c                     |    5 
 security/keys/internal.h               |   30 +
 security/keys/key.c                    |   37 +
 security/keys/keyctl.c                 |   88 +++
 security/keys/keyring.c                |   17 -
 security/keys/request_key.c            |    4 
 security/security.c                    |    9 
 46 files changed, 2652 insertions(+), 38 deletions(-)
 create mode 100644 Documentation/watch_queue.rst
 create mode 100644 block/blk-notify.c
 create mode 100644 drivers/misc/watch_queue.c
 create mode 100644 fs/mount_notify.c
 create mode 100644 include/linux/watch_queue.h
 create mode 100644 include/uapi/linux/watch_queue.h
 create mode 100644 samples/watch_queue/Makefile
 create mode 100644 samples/watch_queue/watch_test.c


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-28 16:01 [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications David Howells
@ 2019-05-28 16:01 ` David Howells
  2019-05-28 16:26   ` Greg KH
                     ` (3 more replies)
  2019-05-28 16:02 ` [PATCH 2/7] keys: Add a notification facility David Howells
                   ` (9 subsequent siblings)
  10 siblings, 4 replies; 65+ messages in thread
From: David Howells @ 2019-05-28 16:01 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel

Implement a misc device that implements a general notification queue as a
ring buffer that can be mmap()'d from userspace.

The way this is done is:

 (1) An application opens the device and indicates the size of the ring
     buffer that it wants to reserve in pages (this can only be set once):

	fd = open("/dev/watch_queue", O_RDWR);
	ioctl(fd, IOC_WATCH_QUEUE_NR_PAGES, nr_of_pages);

 (2) The application should then map the pages that the device has
     reserved.  Each instance of the device created by open() allocates
     separate pages so that maps of different fds don't interfere with one
     another.  Multiple mmap() calls on the same fd, however, will all work
     together.

	page_size = sysconf(_SC_PAGESIZE);
	mapping_size = nr_of_pages * page_size;
	char *buf = mmap(NULL, mapping_size, PROT_READ|PROT_WRITE,
			 MAP_SHARED, fd, 0);

The ring is divided into 8-byte slots.  Entries written into the ring are
variable size and can use between 1 and 63 slots.  A special entry is
maintained in the first two slots of the ring that contains the head and
tail pointers.  This is skipped when the ring wraps round.  Note that
multislot entries, therefore, aren't allowed to be broken over the end of
the ring, but instead "skip" entries are inserted to pad out the buffer.

Each entry has a 1-slot header that describes it:

	struct watch_notification {
		__u32	type:24;
		__u32	subtype:8;
		__u32	info;
	};

The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink).  The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer, type-specific flags and meta
flags, such as an overrun indicator.

Supplementary data, such as the key ID that generated an event, are
attached in additional slots.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 Documentation/watch_queue.rst    |  311 +++++++++++++
 drivers/misc/Kconfig             |   13 +
 drivers/misc/Makefile            |    1 
 drivers/misc/watch_queue.c       |  877 ++++++++++++++++++++++++++++++++++++++
 include/linux/lsm_hooks.h        |   15 +
 include/linux/security.h         |   14 +
 include/linux/watch_queue.h      |   86 ++++
 include/uapi/linux/watch_queue.h |   82 ++++
 mm/interval_tree.c               |    2 
 mm/memory.c                      |    1 
 security/security.c              |    9 
 11 files changed, 1411 insertions(+)
 create mode 100644 Documentation/watch_queue.rst
 create mode 100644 drivers/misc/watch_queue.c
 create mode 100644 include/linux/watch_queue.h
 create mode 100644 include/uapi/linux/watch_queue.h

diff --git a/Documentation/watch_queue.rst b/Documentation/watch_queue.rst
new file mode 100644
index 000000000000..01fe937092d6
--- /dev/null
+++ b/Documentation/watch_queue.rst
@@ -0,0 +1,311 @@
+============================
+Mappable notifications queue
+============================
+
+This is a misc device that acts as a mapped ring buffer by which userspace can
+receive notifications from the kernel.  This is can be used in conjunction
+with::
+
+  * Key/keyring notifications
+
+  * Mount topology change notifications
+
+  * Superblock event notifications
+
+
+The notifications buffers can be enabled by:
+
+	"Device Drivers"/"Misc devices"/"Mappable notification queue"
+	(CONFIG_WATCH_QUEUE)
+
+This document has the following sections:
+
+.. contents:: :local:
+
+
+Overview
+========
+
+This facility appears as a misc device file that is opened and then mapped and
+polled.  Each time it is opened, it creates a new buffer specific to the
+returned file descriptor.  Then, when the opening process sets watches, it
+indicates that particular buffer it wants notifications from that watch to be
+written into.  Note that there are no read() and write() methods (except for
+debugging).  The user is expected to access the ring directly and to use poll
+to wait for new data.
+
+If a watch is in place, notifications are only written into the buffer if the
+filter criteria are passed and if there's sufficient space available in the
+ring.  If neither of those is so, a notification will be discarded.  In the
+latter case, an overrun indicator will also be set.
+
+Note that when producing a notification, the kernel does not wait for the
+consumers to collect it, but rather just continues on.  This means that
+notifications can be generated whilst spinlocks are held and also protects the
+kernel from being held up indefinitely by a userspace malfunction.
+
+As far as the ring goes, the head index belongs to the kernel and the tail
+index belongs to userspace.  The kernel will refuse to write anything if the
+tail index becomes invalid.  Userspace *must* use appropriate memory barriers
+between reading or updating the tail index and reading the ring.
+
+
+Record Structure
+================
+
+Notification records in the ring may occupy a variable number of slots within
+the buffer, beginning with a 1-slot header::
+
+	struct watch_notification {
+		__u16	type;
+		__u16	subtype;
+		__u32	info;
+	};
+
+"type" indicates the source of the notification record and "subtype" indicates
+the type of record from that source (see the Watch Sources section below).  The
+type may also be "WATCH_TYPE_META".  This is a special record type generated
+internally by the watch queue driver itself.  There are two subtypes, one of
+which indicates records that should be just skipped (padding or metadata):
+
+    * WATCH_META_SKIP_NOTIFICATION
+    * WATCH_META_REMOVAL_NOTIFICATION
+
+The former indicates a record that should just be skipped and the latter
+indicates that an object on which a watchpoint was installed was removed or
+destroyed.
+
+"info" indicates a bunch of things, including:
+
+  * The length of the record (mask with WATCH_INFO_LENGTH).  This indicates the
+    size of the record, which may be between 1 and 63 slots.  Note that this is
+    placed appropriately within the info value so that no shifting is required
+    to convert number of occupied slots to byte length.
+
+  * The watchpoint ID (mask with WATCH_INFO_ID).  This indicates that caller's
+    ID of the watchpoint, which may be between 0 and 255.  Multiple watchpoints
+    may share a queue, and this provides a means to distinguish them.
+
+  * A buffer overrun flag (WATCH_INFO_OVERRUN flag).  If this is set in a
+    notification record, some of the preceding records were discarded.
+
+  * An ENOMEM-loss flag (WATCH_INFO_ENOMEM flag).  This is set to indicate that
+    an event was lost to ENOMEM.
+
+  * A recursive-change flag (WATCH_INFO_RECURSIVE flag).  This is set to
+    indicate that the change that happened was recursive - for instance
+    changing the attributes on an entire mount subtree.
+
+  * An exact-match flag (WATCH_INFO_IN_SUBTREE flag).  This is set if the event
+    didn't happen exactly at the watchpoint, but rather somewhere in the
+    subtree thereunder.
+
+  * Some type-specific flags (WATCH_INFO_TYPE_FLAGS).  These are set by the
+    notification producer to indicate some meaning to the kernel.
+
+Everything in info apart from the length can be used for filtering.
+
+
+Ring Structure
+==============
+
+The ring is divided into 8-byte slots.  The caller uses an ioctl() to set the
+size of the ring after opening and this must be a power-of-2 multiple of the
+system page size (so that the mask can be used with AND).
+
+The head and tail indices are stored in the first two slots in the ring, which
+are marked out as a skippable entry::
+
+	struct watch_queue_buffer {
+		union {
+			struct {
+				struct watch_notification watch;
+				volatile __u32	head;
+				volatile __u32	tail;
+				__u32		mask;
+			} meta;
+			struct watch_notification slots[0];
+		};
+	};
+
+In "meta.watch", type will be set to WATCH_TYPE_META and subtype to
+WATCH_META_SKIP_NOTIFICATION so that anyone processing the buffer will just
+skip this record.  Also, because this record is here, records cannot wrap round
+the end of the buffer, so a skippable padding element will be inserted at the
+end of the buffer if needed.  Thus the contents of a notification record in the
+buffer are always contiguous.
+
+"meta.mask" is an AND'able mask to turn the index counters into slots array
+indices.
+
+The buffer is empty if "meta.head" == "meta.tail".
+
+[!] NOTE that the ring indices "meta.head" and "meta.tail" are indices into
+"slots[]" not byte offsets into the buffer.
+
+[!] NOTE that userspace must never change the head pointer.  This belongs to
+the kernel and will be updated by that.  The kernel will never change the tail
+pointer.
+
+[!] NOTE that userspace must never AND-off the tail pointer before updating it,
+but should just keep adding to it and letting it wrap naturally.  The value
+*should* be masked off when used as an index into slots[].
+
+[!] NOTE that if the distance between head and tail becomes too great, the
+kernel will assume the buffer is full and write no more until the issue is
+resolved.
+
+
+Watch Sources
+=============
+
+Any particular buffer can be fed from multiple sources.  Sources include:
+
+  * WATCH_TYPE_MOUNT_NOTIFY
+
+    Notifications of this type indicate mount tree topology changes and mount
+    attribute changes.  A watchpoint can be set on a particular file or
+    directory and notifications from the path subtree rooted at that point will
+    be intercepted.
+
+  * WATCH_TYPE_SB_NOTIFY
+
+    Notifications of this type indicate superblock events, such as quota limits
+    being hit, I/O errors being produced or network server loss/reconnection.
+    Watchpoints of this type are set directly on superblocks.
+
+  * WATCH_TYPE_KEY_NOTIFY
+
+    Notifications of this type indicate changes to keys and keyrings, including
+    the changes of keyring contents or the attributes of keys.
+
+    See Documentation/security/keys/core.rst for more information.
+
+  * WATCH_TYPE_BLOCK_NOTIFY
+
+    Notifications of this type indicate block layer events, such as I/O errors
+    or temporary link loss.  Watchpoints of this type are set on a global
+    queue.
+
+
+Configuring Watchpoints
+=======================
+
+When a watchpoint is set up, the caller assigns an ID and can set filtering
+parameters.  The following structure is filled out and passed to the
+watchpoint creation system call::
+
+	struct watch_notification_filter {
+		__u64	subtype_filter[4];
+		__u32	info_filter;
+		__u32	info_mask;
+		__u32	info_id;
+		__u32	__reserved;
+	};
+
+"subtype_filter" is a bitmask indicating the subtypes that are of interest.  In
+this version of the structure, only the first 256 subtypes are supported.  Bit
+0 of subtype_filter[0] corresponds to subtype 0, bit 1 to subtype 1, and so on.
+
+"info_filter" and "info_mask" act as a filter on the info field of the
+notification record.  The notification is only written into the buffer if::
+
+	(watch.info & info_mask) == info_filter
+
+This can be used, for example, to ignore events that are not exactly on the
+watched point in a mount tree by specifying WATCH_INFO_IN_SUBTREE must be 0.
+
+"info_id" is OR'd into watch.info.  This indicates the watchpoint ID in the top
+8 bits.  All bits outside of WATCH_INFO_ID must be 0.
+
+"__reserved" must be 0.
+
+If the pointer to this structure is NULL, this indicates to the system call
+that the watchpoint should be removed.
+
+
+Polling
+=======
+
+The file descriptor that holds the buffer may be used with poll() and similar.
+POLLIN and POLLRDNORM are set if the buffer indices differ.  POLLERR is set if
+the buffer indices are further apart than the size of the buffer.  Wake-up
+events are only generated if the buffer is transitioned from an empty state.
+
+
+Example
+=======
+
+A buffer is created with something like the following::
+
+	fd = open("/dev/watch_queue", O_RDWR);
+
+	#define BUF_SIZE 4
+	ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
+
+	page_size = sysconf(_SC_PAGESIZE);
+	buf = mmap(NULL, BUF_SIZE * page_size,
+		   PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+
+It can then be set to receive mount topology change notifications, keyring
+change notifications and superblock notifications::
+
+	memset(&filter, 0, sizeof(filter));
+	filter.subtype_filter[0] = ~0ULL;
+	filter.info_mask	 = WATCH_INFO_IN_SUBTREE;
+	filter.info_filter	 = 0;
+	filter.info_id		 = 0x01000000;
+
+	keyctl(KEYCTL_WATCH_KEY, KEY_SPEC_SESSION_KEYRING, fd, &filter);
+
+	mount_notify(AT_FDCWD, "/", 0, fd, &filter);
+
+	sb_notify(AT_FDCWD, "/", 0, fd, &filter);
+
+The notifications can then be consumed by something like the following::
+
+	extern void saw_mount_change(struct watch_notification *n);
+	extern void saw_key_change(struct watch_notification *n);
+
+	static int consumer(int fd, struct watch_queue_buffer *buf)
+	{
+		struct watch_notification *n;
+		struct pollfd p[1];
+		unsigned int head, tail, mask = buf->meta.mask;
+
+		for (;;) {
+			p[0].fd = fd;
+			p[0].events = POLLIN | POLLERR;
+			p[0].revents = 0;
+
+			if (poll(p, 1, -1) == -1 || p[0].revents & POLLERR)
+				goto went_wrong;
+
+			while (head = _atomic_load_acquire(buf->meta.head),
+			       tail = buf->meta.tail,
+			       tail != head
+			       ) {
+				n = &buf->slots[tail & mask];
+				if ((n->info & WATCH_INFO_LENGTH) == 0)
+					goto went_wrong;
+
+				switch (n->type) {
+				case WATCH_TYPE_MOUNT_NOTIFY:
+					saw_mount_change(n);
+					break;
+				case WATCH_TYPE_KEY_NOTIFY:
+					saw_key_change(n);
+					break;
+				}
+
+				tail += (n->info & WATCH_INFO_LENGTH) >> WATCH_LENGTH_SHIFT;
+				_atomic_store_release(buf->meta.tail, tail);
+			}
+		}
+
+	went_wrong:
+		return 0;
+	}
+
+Note the memory barriers when loading the head pointer and storing the tail
+pointer!
diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index 6a0365b2332c..19668c0ebe03 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -4,6 +4,19 @@
 
 menu "Misc devices"
 
+config WATCH_QUEUE
+	bool "Mappable notification queue"
+	default n
+	depends on MMU
+	help
+	  This is a general notification queue for the kernel to pass events to
+	  userspace through a mmap()'able ring buffer.  It can be used in
+	  conjunction with watches for mount topology change notifications,
+	  superblock change notifications and key/keyring change notifications.
+
+	  Note that in theory this should work fine with NOMMU, but I'm not
+	  sure how to make that work.
+
 config SENSORS_LIS3LV02D
 	tristate
 	depends on INPUT
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index b9affcdaa3d6..bf16acd9f8cc 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -3,6 +3,7 @@
 # Makefile for misc devices that really don't fit anywhere else.
 #
 
+obj-$(CONFIG_WATCH_QUEUE)	+= watch_queue.o
 obj-$(CONFIG_IBM_ASM)		+= ibmasm/
 obj-$(CONFIG_IBMVMC)		+= ibmvmc.o
 obj-$(CONFIG_AD525X_DPOT)	+= ad525x_dpot.o
diff --git a/drivers/misc/watch_queue.c b/drivers/misc/watch_queue.c
new file mode 100644
index 000000000000..39a09ea15d97
--- /dev/null
+++ b/drivers/misc/watch_queue.c
@@ -0,0 +1,877 @@
+/* User-mappable watch queue
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ *
+ * See Documentation/watch_queue.rst
+ */
+
+#define pr_fmt(fmt) "watchq: " fmt
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/printk.h>
+#include <linux/miscdevice.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/poll.h>
+#include <linux/uaccess.h>
+#include <linux/vmalloc.h>
+#include <linux/file.h>
+#include <linux/security.h>
+#include <linux/cred.h>
+#include <linux/watch_queue.h>
+
+#define DEBUG_WITH_WRITE /* Allow use of write() to record notifications */
+
+MODULE_DESCRIPTION("Watch queue");
+MODULE_AUTHOR("Red Hat, Inc.");
+MODULE_LICENSE("GPL");
+
+struct watch_type_filter {
+	enum watch_notification_type type;
+	__u32		subtype_filter[1];	/* Bitmask of subtypes to filter on */
+	__u32		info_filter;		/* Filter on watch_notification::info */
+	__u32		info_mask;		/* Mask of relevant bits in info_filter */
+};
+
+struct watch_filter {
+	union {
+		struct rcu_head	rcu;
+		unsigned long	type_filter[2];	/* Bitmask of accepted types */
+	};
+	u32		nr_filters;		/* Number of filters */
+	struct watch_type_filter filters[];
+};
+
+struct watch_queue {
+	struct rcu_head		rcu;
+	struct address_space	mapping;
+	const struct cred	*cred;		/* Creds of the owner of the queue */
+	struct watch_filter __rcu *filter;
+	wait_queue_head_t	waiters;
+	struct hlist_head	watches;	/* Contributory watches */
+	refcount_t		usage;
+	spinlock_t		lock;
+	bool			defunct;	/* T when queues closed */
+	u8			nr_pages;	/* Size of pages[] */
+	u8			flag_next;	/* Flag to apply to next item */
+#ifdef DEBUG_WITH_WRITE
+	u8			debug;
+#endif
+	u32			size;
+	struct watch_queue_buffer *buffer;	/* Pointer to first record */
+
+	/* The mappable pages.  The zeroth page holds the ring pointers. */
+	struct page		**pages;
+};
+
+/**
+ * post_one_notification - Post an event notification to one queue
+ * @wqueue: The watch queue to add the event to.
+ * @n: The notification record to post.
+ * @cred: The credentials to use in security checks.
+ *
+ * Post a notification of an event into an mmap'd queue and let the user know.
+ * Returns true if successful and false on failure (eg. buffer overrun or
+ * userspace mucked up the ring indices).
+ *
+ *
+ * The size of the notification should be set in n->flags & WATCH_LENGTH and
+ * should be in units of sizeof(*n).
+ */
+static bool post_one_notification(struct watch_queue *wqueue,
+				  struct watch_notification *n,
+				  const struct cred *cred)
+{
+	struct watch_queue_buffer *buf = wqueue->buffer;
+	unsigned int metalen = sizeof(buf->meta) / sizeof(buf->slots[0]);
+	unsigned int size = wqueue->size, mask = size - 1;
+	unsigned int len;
+	unsigned int ring_tail, tail, head, used, segment, h;
+
+	if (!buf)
+		return false;
+
+	len = (n->info & WATCH_INFO_LENGTH) >> WATCH_LENGTH_SHIFT;
+	if (len == 0)
+		return false;
+
+	spin_lock_bh(&wqueue->lock); /* Protect head pointer */
+
+	if (wqueue->defunct ||
+	    security_post_notification(wqueue->cred, cred, n) < 0)
+		goto out;
+
+	ring_tail = READ_ONCE(buf->meta.tail);
+	head = READ_ONCE(buf->meta.head);
+	used = head - ring_tail;
+
+	/* Check to see if userspace mucked up the pointers */
+	if (used >= size)
+		goto overrun;
+	tail = ring_tail & mask;
+	if (tail > 0 && tail < metalen)
+		goto overrun;
+
+	h = head & mask;
+	if (h >= tail) {
+		/* Head is at or after tail in the buffer.  There may then be
+		 * two segments: one to the end of buffer and one at the
+		 * beginning of the buffer between the metadata block and the
+		 * tail pointer.
+		 */
+		segment = size - h;
+		if (len > segment) {
+			/* Not enough space in the post-head segment; we need
+			 * to wrap.  When wrapping, we will have to skip the
+			 * metadata at the beginning of the buffer.
+			 */
+			if (len > tail - metalen)
+				goto overrun;
+
+			/* Fill the space at the end of the page */
+			buf->slots[h].type	= WATCH_TYPE_META;
+			buf->slots[h].subtype	= WATCH_META_SKIP_NOTIFICATION;
+			buf->slots[h].info	= segment << WATCH_LENGTH_SHIFT;
+			head += segment;
+			h = 0;
+			if (h >= tail)
+				goto overrun;
+		}
+	}
+
+	if (h == 0) {
+		/* Reset and skip the header metadata */
+		buf->meta.watch.type = WATCH_TYPE_META;
+		buf->meta.watch.subtype = WATCH_META_SKIP_NOTIFICATION;
+		buf->meta.watch.info = metalen << WATCH_LENGTH_SHIFT;
+		head += metalen;
+		h = metalen;
+		if (h >= tail)
+			goto overrun;
+	}
+
+	if (h < tail) {
+		/* Head is before tail in the buffer.  There may be one segment
+		 * between the two, but we may need to skip the metadata block.
+		 */
+		segment = tail - h;
+		if (len > segment)
+			goto overrun;
+	}
+
+	n->info |= wqueue->flag_next;
+	wqueue->flag_next = 0;
+	memcpy(buf->slots + h, n, len * sizeof(buf->slots[0]));
+	head += len;
+
+	smp_store_release(&buf->meta.head, head);
+	spin_unlock_bh(&wqueue->lock);
+	if (used == 0)
+		wake_up(&wqueue->waiters);
+	return true;
+
+overrun:
+	wqueue->flag_next = WATCH_INFO_OVERRUN;
+out:
+	spin_unlock_bh(&wqueue->lock);
+	return false;
+}
+
+/*
+ * Apply filter rules to a notification.
+ */
+static bool filter_watch_notification(const struct watch_filter *wf,
+				      const struct watch_notification *n)
+{
+	const struct watch_type_filter *wt;
+	int i;
+
+	if (!test_bit(n->type, wf->type_filter))
+		return false;
+
+	for (i = 0; i < wf->nr_filters; i++) {
+		wt = &wf->filters[i];
+		if (n->type == wt->type &&
+		    ((1U << n->subtype) & wt->subtype_filter[0]) &&
+		    (n->info & wt->info_mask) == wt->info_filter)
+			return true;
+	}
+
+	return false; /* If there is a filter, the default is to reject. */
+}
+
+/**
+ * __post_watch_notification - Post an event notification
+ * @wlist: The watch list to post the event to.
+ * @n: The notification record to post.
+ * @cred: The creds of the process that triggered the notification.
+ * @id: The ID to match on the watch.
+ *
+ * Post a notification of an event into a set of watch queues and let the users
+ * know.
+ *
+ * If @n is NULL then WATCH_INFO_LENGTH will be set on the next event posted.
+ *
+ * The size of the notification should be set in n->info & WATCH_INFO_LENGTH and
+ * should be in units of sizeof(*n).
+ */
+void __post_watch_notification(struct watch_list *wlist,
+			       struct watch_notification *n,
+			       const struct cred *cred,
+			       u64 id)
+{
+	const struct watch_filter *wf;
+	struct watch_queue *wqueue;
+	struct watch *watch;
+
+	rcu_read_lock();
+
+	hlist_for_each_entry_rcu(watch, &wlist->watchers, list_node) {
+		if (watch->id != id)
+			continue;
+		n->info &= ~(WATCH_INFO_ID | WATCH_INFO_OVERRUN);
+		n->info |= watch->info_id;
+
+		wqueue = rcu_dereference(watch->queue);
+		wf = rcu_dereference(wqueue->filter);
+		if (wf && !filter_watch_notification(wf, n))
+			continue;
+
+		post_one_notification(wqueue, n, cred);
+	}
+
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL(__post_watch_notification);
+
+/*
+ * Allow the queue to be polled.
+ */
+static __poll_t watch_queue_poll(struct file *file, poll_table *wait)
+{
+	struct watch_queue *wqueue = file->private_data;
+	struct watch_queue_buffer *buf = wqueue->buffer;
+	unsigned int head, tail;
+	__poll_t mask = 0;
+
+	poll_wait(file, &wqueue->waiters, wait);
+
+	head = READ_ONCE(buf->meta.head);
+	tail = READ_ONCE(buf->meta.tail);
+	if (head != tail)
+		mask |= EPOLLIN | EPOLLRDNORM;
+	if (head - tail > wqueue->size)
+		mask |= EPOLLERR;
+	return mask;
+}
+
+static int watch_queue_set_page_dirty(struct page *page)
+{
+	SetPageDirty(page);
+	return 0;
+}
+
+static const struct address_space_operations watch_queue_aops = {
+	.set_page_dirty	= watch_queue_set_page_dirty,
+};
+
+static vm_fault_t watch_queue_fault(struct vm_fault *vmf)
+{
+	struct watch_queue *wqueue = vmf->vma->vm_file->private_data;
+	struct page *page;
+
+	page = wqueue->pages[vmf->pgoff];
+	get_page(page);
+	if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
+		put_page(page);
+		return VM_FAULT_RETRY;
+	}
+	vmf->page = page;
+	return VM_FAULT_LOCKED;
+}
+
+static void watch_queue_map_pages(struct vm_fault *vmf,
+				  pgoff_t start_pgoff, pgoff_t end_pgoff)
+{
+	struct watch_queue *wqueue = vmf->vma->vm_file->private_data;
+	struct page *page;
+
+	rcu_read_lock();
+
+	do {
+		page = wqueue->pages[start_pgoff];
+		if (trylock_page(page)) {
+			vm_fault_t ret;
+			get_page(page);
+			ret = alloc_set_pte(vmf, NULL, page);
+			if (ret != 0)
+				put_page(page);
+
+			unlock_page(page);
+		}
+	} while (++start_pgoff < end_pgoff);
+
+	rcu_read_unlock();
+}
+
+static const struct vm_operations_struct watch_queue_vm_ops = {
+	.fault		= watch_queue_fault,
+	.map_pages	= watch_queue_map_pages,
+};
+
+/*
+ * Map the buffer.
+ */
+static int watch_queue_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct watch_queue *wqueue = file->private_data;
+
+	if (vma->vm_pgoff != 0 ||
+	    vma->vm_end - vma->vm_start > wqueue->nr_pages * PAGE_SIZE ||
+	    !(pgprot_val(vma->vm_page_prot) & pgprot_val(PAGE_SHARED)))
+		return -EINVAL;
+
+	vma->vm_ops = &watch_queue_vm_ops;
+
+	vma_interval_tree_insert(vma, &wqueue->mapping.i_mmap);
+	return 0;
+}
+
+/*
+ * Allocate the required number of pages.
+ */
+static long watch_queue_set_size(struct watch_queue *wqueue, unsigned long nr_pages)
+{
+	struct watch_queue_buffer *buf;
+	u32 len;
+	int i;
+
+	if (nr_pages == 0 ||
+	    nr_pages > 16 || /* TODO: choose a better hard limit */
+	    !is_power_of_2(nr_pages))
+		return -EINVAL;
+
+	wqueue->pages = kcalloc(nr_pages, sizeof(struct page *), GFP_KERNEL);
+	if (!wqueue->pages)
+		goto err;
+
+	for (i = 0; i < nr_pages; i++) {
+		wqueue->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
+		if (!wqueue->pages[i])
+			goto err_some_pages;
+		wqueue->pages[i]->mapping = &wqueue->mapping;
+		SetPageUptodate(wqueue->pages[i]);
+	}
+
+	buf = vmap(wqueue->pages, nr_pages, VM_MAP, PAGE_SHARED);
+	if (!buf)
+		goto err_some_pages;
+
+	wqueue->buffer = buf;
+	wqueue->nr_pages = nr_pages;
+	wqueue->size = ((nr_pages * PAGE_SIZE) / sizeof(struct watch_notification));
+
+	/* The first four slots in the buffer contain metadata about the ring,
+	 * including the head and tail indices and mask.
+	 */
+	len = sizeof(buf->meta) / sizeof(buf->slots[0]);
+	buf->meta.watch.info	= len << WATCH_LENGTH_SHIFT;
+	buf->meta.watch.type	= WATCH_TYPE_META;
+	buf->meta.watch.subtype	= WATCH_META_SKIP_NOTIFICATION;
+	buf->meta.tail		= len;
+	buf->meta.mask		= wqueue->size - 1;
+	smp_store_release(&buf->meta.head, len);
+	return 0;
+
+err_some_pages:
+	for (i--; i >= 0; i--) {
+		ClearPageUptodate(wqueue->pages[i]);
+		wqueue->pages[i]->mapping = NULL;
+		put_page(wqueue->pages[i]);
+	}
+
+	kfree(wqueue->pages);
+	wqueue->pages = NULL;
+err:
+	return -ENOMEM;
+}
+
+/*
+ * Set the filter on a watch queue.
+ */
+static long watch_queue_set_filter(struct inode *inode,
+				   struct watch_queue *wqueue,
+				   struct watch_notification_filter __user *_filter)
+{
+	struct watch_notification_type_filter *tf;
+	struct watch_notification_filter filter;
+	struct watch_type_filter *q;
+	struct watch_filter *wfilter;
+	int ret, nr_filter = 0, i;
+
+	if (!_filter) {
+		/* Remove the old filter */
+		wfilter = NULL;
+		goto set;
+	}
+
+	/* Grab the user's filter specification */
+	if (copy_from_user(&filter, _filter, sizeof(filter)) != 0)
+		return -EFAULT;
+	if (filter.nr_filters == 0 ||
+	    filter.nr_filters > 16 ||
+	    filter.__reserved != 0)
+		return -EINVAL;
+
+	tf = memdup_user(_filter->filters, filter.nr_filters * sizeof(*tf));
+	if (IS_ERR(tf))
+		return PTR_ERR(tf);
+
+	ret = -EINVAL;
+	for (i = 0; i < filter.nr_filters; i++) {
+		if ((tf[i].info_filter & ~tf[i].info_mask) ||
+		    tf[i].info_mask & WATCH_INFO_LENGTH)
+			goto err_filter;
+		/* Ignore any unknown types */
+		if (tf[i].type >= sizeof(wfilter->type_filter) * 8)
+			continue;
+		nr_filter++;
+	}
+
+	/* Now we need to build the internal filter from only the relevant
+	 * user-specified filters.
+	 */
+	ret = -ENOMEM;
+	wfilter = kzalloc(struct_size(wfilter, filters, nr_filter), GFP_KERNEL);
+	if (!wfilter)
+		goto err_filter;
+	wfilter->nr_filters = nr_filter;
+
+	q = wfilter->filters;
+	for (i = 0; i < filter.nr_filters; i++) {
+		if (tf[i].type >= sizeof(wfilter->type_filter) * BITS_PER_LONG)
+			continue;
+
+		q->type			= tf[i].type;
+		q->info_filter		= tf[i].info_filter;
+		q->info_mask		= tf[i].info_mask;
+		q->subtype_filter[0]	= tf[i].subtype_filter[0];
+		__set_bit(q->type, wfilter->type_filter);
+		q++;
+	}
+
+	kfree(tf);
+set:
+	rcu_swap_protected(wqueue->filter, wfilter,
+			   lockdep_is_held(&inode->i_rwsem));
+	if (wfilter)
+		kfree_rcu(wfilter, rcu);
+	return 0;
+
+err_filter:
+	kfree(tf);
+	return ret;
+}
+
+/*
+ * Set parameters.
+ */
+static long watch_queue_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	struct watch_queue *wqueue = file->private_data;
+	struct inode *inode = file_inode(file);
+	long ret;
+
+	switch (cmd) {
+	case IOC_WATCH_QUEUE_SET_SIZE:
+		if (wqueue->buffer)
+			return -EBUSY;
+		inode_lock(inode);
+		ret = watch_queue_set_size(wqueue, arg);
+		inode_unlock(inode);
+		return ret;
+
+	case IOC_WATCH_QUEUE_SET_FILTER:
+		inode_lock(inode);
+		ret = watch_queue_set_filter(
+			inode, wqueue,
+			(struct watch_notification_filter __user *)arg);
+		inode_unlock(inode);
+		return ret;
+
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+/*
+ * Open the file.
+ */
+static int watch_queue_open(struct inode *inode, struct file *file)
+{
+	struct watch_queue *wqueue;
+
+	wqueue = kzalloc(sizeof(*wqueue), GFP_KERNEL);
+	if (!wqueue)
+		return -ENOMEM;
+
+	wqueue->mapping.a_ops = &watch_queue_aops;
+	wqueue->mapping.i_mmap = RB_ROOT_CACHED;
+	init_rwsem(&wqueue->mapping.i_mmap_rwsem);
+	spin_lock_init(&wqueue->mapping.private_lock);
+
+	refcount_set(&wqueue->usage, 1);
+	spin_lock_init(&wqueue->lock);
+	init_waitqueue_head(&wqueue->waiters);
+	wqueue->cred = get_cred(file->f_cred);
+
+	file->private_data = wqueue;
+	return 0;
+}
+
+/**
+ * put_watch_queue - Dispose of a ref on a watchqueue.
+ * @wqueue: The watch queue to unref.
+ */
+void put_watch_queue(struct watch_queue *wqueue)
+{
+	if (refcount_dec_and_test(&wqueue->usage))
+		kfree_rcu(wqueue, rcu);
+}
+EXPORT_SYMBOL(put_watch_queue);
+
+static void free_watch(struct rcu_head *rcu)
+{
+	struct watch *watch = container_of(rcu, struct watch, rcu);
+
+	put_watch_queue(rcu_access_pointer(watch->queue));
+}
+
+/*
+ * Discard a watch.
+ */
+static void put_watch(struct watch *watch)
+{
+	if (refcount_dec_and_test(&watch->usage))
+		call_rcu(&watch->rcu, free_watch);
+}
+
+/**
+ * init_watch_queue - Initialise a watch
+ * @watch: The watch to initialise.
+ * @wqueue: The queue to assign.
+ *
+ * Initialise a watch and set the watch queue.
+ */
+void init_watch(struct watch *watch, struct watch_queue *wqueue)
+{
+	refcount_set(&watch->usage, 1);
+	INIT_HLIST_NODE(&watch->list_node);
+	INIT_HLIST_NODE(&watch->queue_node);
+	rcu_assign_pointer(watch->queue, wqueue);
+}
+
+/**
+ * add_watch_to_object - Add a watch on an object to a watch list
+ * @watch: The watch to add
+ * @wlist: The watch list to add to
+ *
+ * @watch->queue must have been set to point to the queue to post notifications
+ * to and the watch list of the object to be watched.
+ *
+ * The caller must pin the queue and the list both and must hold the list
+ * locked against racing watch additions/removals.
+ */
+int add_watch_to_object(struct watch *watch, struct watch_list *wlist)
+{
+	struct watch_queue *wqueue = rcu_access_pointer(watch->queue);
+	struct watch *w;
+
+	hlist_for_each_entry(w, &wlist->watchers, list_node) {
+		if (watch->id == w->id)
+			return -EBUSY;
+	}
+
+	rcu_assign_pointer(watch->watch_list, wlist);
+
+	spin_lock_bh(&wqueue->lock);
+	refcount_inc(&wqueue->usage);
+	hlist_add_head(&watch->queue_node, &wqueue->watches);
+	spin_unlock_bh(&wqueue->lock);
+
+	hlist_add_head(&watch->list_node, &wlist->watchers);
+	return 0;
+}
+EXPORT_SYMBOL(add_watch_to_object);
+
+/**
+ * remove_watch_from_object - Remove a watch or all watches from an object.
+ * @wlist: The watch list to remove from
+ * @wq: The watch queue of interest (ignored if @all is true)
+ * @id: The ID of the watch to remove (ignored if @all is true)
+ * @all: True to remove all objects
+ *
+ * Remove a specific watch or all watches from an object.  A notification is
+ * sent to the watcher to tell them that this happened.
+ */
+int remove_watch_from_object(struct watch_list *wlist, struct watch_queue *wq,
+			     u64 id, bool all)
+{
+	struct watch_notification n;
+	struct watch_queue *wqueue;
+	struct watch *watch;
+	int ret = -EBADSLT;
+
+	rcu_read_lock();
+
+again:
+	spin_lock(&wlist->lock);
+	hlist_for_each_entry(watch, &wlist->watchers, list_node) {
+		if (all ||
+		    (watch->id == id && rcu_access_pointer(watch->queue) == wq))
+			goto found;
+	}
+	spin_unlock(&wlist->lock);
+	goto out;
+
+found:
+	ret = 0;
+	hlist_del_init_rcu(&watch->list_node);
+	rcu_assign_pointer(watch->watch_list, NULL);
+	spin_unlock(&wlist->lock);
+
+	n.type = WATCH_TYPE_META;
+	n.subtype = WATCH_META_REMOVAL_NOTIFICATION;
+	n.info = watch->info_id | sizeof(n);
+
+	wqueue = rcu_dereference(watch->queue);
+	post_one_notification(wqueue, &n, wq ? wq->cred : NULL);
+
+	/* We don't need the watch list lock for the next bit as RCU is
+	 * protecting everything from being deallocated.
+	 */
+	if (wqueue) {
+		spin_lock_bh(&wqueue->lock);
+
+		if (!hlist_unhashed(&watch->queue_node)) {
+			hlist_del_init_rcu(&watch->queue_node);
+			put_watch(watch);
+		}
+
+		spin_unlock_bh(&wqueue->lock);
+	}
+
+	if (wlist->release_watch) {
+		rcu_read_unlock();
+		wlist->release_watch(wlist, watch);
+		rcu_read_lock();
+	}
+	put_watch(watch);
+
+	if (all && !hlist_empty(&wlist->watchers))
+		goto again;
+out:
+	rcu_read_unlock();
+	return ret;
+}
+EXPORT_SYMBOL(remove_watch_from_object);
+
+/*
+ * Remove all the watches that are contributory to a queue.  This will
+ * potentially race with removal of the watches by the destruction of the
+ * objects being watched or the distribution of notifications.
+ */
+static void watch_queue_clear(struct watch_queue *wqueue)
+{
+	struct watch_list *wlist;
+	struct watch *watch;
+	bool release;
+
+	rcu_read_lock();
+	spin_lock_bh(&wqueue->lock);
+
+	/* Prevent new additions and prevent notifications from happening */
+	wqueue->defunct = true;
+
+	while (!hlist_empty(&wqueue->watches)) {
+		watch = hlist_entry(wqueue->watches.first, struct watch, queue_node);
+		hlist_del_init_rcu(&watch->queue_node);
+		spin_unlock_bh(&wqueue->lock);
+
+		/* We can't do the next bit under the queue lock as we need to
+		 * get the list lock - which would cause a deadlock if someone
+		 * was removing from the opposite direction at the same time or
+		 * posting a notification.
+		 */
+		wlist = rcu_dereference(watch->watch_list);
+		if (wlist) {
+			spin_lock(&wlist->lock);
+
+			release = !hlist_unhashed(&watch->list_node);
+			if (release) {
+				hlist_del_init_rcu(&watch->list_node);
+				rcu_assign_pointer(watch->watch_list, NULL);
+			}
+
+			spin_unlock(&wlist->lock);
+
+			if (release) {
+				if (wlist->release_watch) {
+					rcu_read_unlock();
+					/* This might need to call dput(), so
+					 * we have to drop all the locks.
+					 */
+					wlist->release_watch(wlist, watch);
+					rcu_read_lock();
+				}
+				put_watch(watch);
+			}
+		}
+
+		put_watch(watch);
+		spin_lock_bh(&wqueue->lock);
+	}
+
+	spin_unlock_bh(&wqueue->lock);
+	rcu_read_unlock();
+}
+
+/*
+ * Release the file.
+ */
+static int watch_queue_release(struct inode *inode, struct file *file)
+{
+	struct watch_filter *wfilter;
+	struct watch_queue *wqueue = file->private_data;
+	int i, pgref;
+
+	watch_queue_clear(wqueue);
+
+	if (wqueue->pages && wqueue->pages[0])
+		WARN_ON(page_ref_count(wqueue->pages[0]) != 1);
+
+	if (wqueue->buffer)
+		vfree(wqueue->buffer);
+	for (i = 0; i < wqueue->nr_pages; i++) {
+		ClearPageUptodate(wqueue->pages[i]);
+		wqueue->pages[i]->mapping = NULL;
+		pgref = page_ref_count(wqueue->pages[i]);
+		WARN(pgref != 1,
+		     "FREE PAGE[%d] refcount %d\n", i, page_ref_count(wqueue->pages[i]));
+		__free_page(wqueue->pages[i]);
+	}
+
+	wfilter = rcu_access_pointer(wqueue->filter);
+	if (wfilter)
+		kfree_rcu(wfilter, rcu);
+	kfree(wqueue->pages);
+	put_cred(wqueue->cred);
+	put_watch_queue(wqueue);
+	return 0;
+}
+
+#ifdef DEBUG_WITH_WRITE
+static ssize_t watch_queue_write(struct file *file,
+				 const char __user *_buf, size_t len, loff_t *pos)
+{
+	struct watch_notification *n;
+	struct watch_queue *wqueue = file->private_data;
+	ssize_t ret;
+
+	if (!wqueue->buffer)
+		return -ENOBUFS;
+
+	if (len & ~WATCH_INFO_LENGTH || len == 0 || !_buf)
+		return -EINVAL;
+
+	n = memdup_user(_buf, len);
+	if (IS_ERR(n))
+		return PTR_ERR(n);
+
+	ret = -EINVAL;
+	if ((n->info & WATCH_INFO_LENGTH) != len)
+		goto error;
+	n->info &= (WATCH_INFO_LENGTH | WATCH_INFO_TYPE_FLAGS | WATCH_INFO_ID);
+
+	if (post_one_notification(wqueue, n, file->f_cred))
+		wqueue->debug = 0;
+	else
+		wqueue->debug++;
+	ret = len;
+	if (wqueue->debug > 20)
+		ret = -EIO;
+
+error:
+	kfree(n);
+	return ret;
+}
+#endif
+
+static const struct file_operations watch_queue_fops = {
+	.owner		= THIS_MODULE,
+	.open		= watch_queue_open,
+	.release	= watch_queue_release,
+	.unlocked_ioctl	= watch_queue_ioctl,
+	.poll		= watch_queue_poll,
+	.mmap		= watch_queue_mmap,
+#ifdef DEBUG_WITH_WRITE
+	.write		= watch_queue_write,
+#endif
+	.llseek		= no_llseek,
+};
+
+/**
+ * get_watch_queue - Get a watch queue from its file descriptor.
+ * @fd: The fd to query.
+ */
+struct watch_queue *get_watch_queue(int fd)
+{
+	struct watch_queue *wqueue = ERR_PTR(-EBADF);
+	struct fd f;
+
+	f = fdget(fd);
+	if (f.file) {
+		wqueue = ERR_PTR(-EINVAL);
+		if (f.file->f_op == &watch_queue_fops) {
+			wqueue = f.file->private_data;
+			refcount_inc(&wqueue->usage);
+		}
+		fdput(f);
+	}
+
+	return wqueue;
+}
+EXPORT_SYMBOL(get_watch_queue);
+
+static struct miscdevice watch_queue_dev = {
+	.minor	= MISC_DYNAMIC_MINOR,
+	.name	= "watch_queue",
+	.fops	= &watch_queue_fops,
+	.mode	= 0666,
+};
+
+static int __init watch_queue_init(void)
+{
+	int ret;
+
+	ret = misc_register(&watch_queue_dev);
+	if (ret < 0)
+		pr_err("Failed to register %d\n", ret);
+	return ret;
+}
+fs_initcall(watch_queue_init);
+
+static void __exit watch_queue_exit(void)
+{
+	misc_deregister(&watch_queue_dev);
+}
+module_exit(watch_queue_exit);
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 2474c3f785ca..2f72ea80d4fe 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1420,6 +1420,13 @@
  *	@ctx is a pointer in which to place the allocated security context.
  *	@ctxlen points to the place to put the length of @ctx.
  *
+ * @post_notification:
+ *	Check to see if a watch notification can be posted to a particular
+ *	queue.
+ *	@q_cred: The credentials of the target watch queue.
+ *	@cred: The event-triggerer's credentials
+ *	@n: The notification being posted
+ *
  * Security hooks for using the eBPF maps and programs functionalities through
  * eBPF syscalls.
  *
@@ -1698,6 +1705,11 @@ union security_list_options {
 	int (*inode_notifysecctx)(struct inode *inode, void *ctx, u32 ctxlen);
 	int (*inode_setsecctx)(struct dentry *dentry, void *ctx, u32 ctxlen);
 	int (*inode_getsecctx)(struct inode *inode, void **ctx, u32 *ctxlen);
+#ifdef CONFIG_WATCH_QUEUE
+	int (*post_notification)(const struct cred *q_cred,
+				 const struct cred *cred,
+				 struct watch_notification *n);
+#endif
 
 #ifdef CONFIG_SECURITY_NETWORK
 	int (*unix_stream_connect)(struct sock *sock, struct sock *other,
@@ -1977,6 +1989,9 @@ struct security_hook_heads {
 	struct hlist_head inode_notifysecctx;
 	struct hlist_head inode_setsecctx;
 	struct hlist_head inode_getsecctx;
+#ifdef CONFIG_WATCH_QUEUE
+	struct hlist_head post_notification;
+#endif
 #ifdef CONFIG_SECURITY_NETWORK
 	struct hlist_head unix_stream_connect;
 	struct hlist_head unix_may_send;
diff --git a/include/linux/security.h b/include/linux/security.h
index 23c8b602c0ab..1df8d55de8da 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -58,6 +58,7 @@ struct fs_context;
 struct fs_parameter;
 enum fs_value_type;
 struct fsinfo_kparams;
+struct watch_notification;
 
 /* Default (no) options for the capable function */
 #define CAP_OPT_NONE 0x0
@@ -396,6 +397,11 @@ void security_inode_invalidate_secctx(struct inode *inode);
 int security_inode_notifysecctx(struct inode *inode, void *ctx, u32 ctxlen);
 int security_inode_setsecctx(struct dentry *dentry, void *ctx, u32 ctxlen);
 int security_inode_getsecctx(struct inode *inode, void **ctx, u32 *ctxlen);
+#ifdef CONFIG_WATCH_QUEUE
+int security_post_notification(const struct cred *q_cred,
+			       const struct cred *cred,
+			       struct watch_notification *n);
+#endif
 #else /* CONFIG_SECURITY */
 
 static inline int call_lsm_notifier(enum lsm_event event, void *data)
@@ -1215,6 +1221,14 @@ static inline int security_inode_getsecctx(struct inode *inode, void **ctx, u32
 {
 	return -EOPNOTSUPP;
 }
+#ifdef CONFIG_WATCH_QUEUE
+static inline int security_post_notification(const struct cred *q_cred,
+					     const struct cred *cred,
+					     struct watch_notification *n)
+{
+	return 0;
+}
+#endif
 #endif	/* CONFIG_SECURITY */
 
 #ifdef CONFIG_SECURITY_NETWORK
diff --git a/include/linux/watch_queue.h b/include/linux/watch_queue.h
new file mode 100644
index 000000000000..f200b68c799e
--- /dev/null
+++ b/include/linux/watch_queue.h
@@ -0,0 +1,86 @@
+/* User-mappable watch queue
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ *
+ * See Documentation/watch_queue.rst
+ */
+
+#ifndef _LINUX_WATCH_QUEUE_H
+#define _LINUX_WATCH_QUEUE_H
+
+#include <uapi/linux/watch_queue.h>
+
+#ifdef CONFIG_WATCH_QUEUE
+
+struct watch_queue;
+
+/*
+ * Representation of a watch on an object.
+ */
+struct watch {
+	union {
+		struct rcu_head	rcu;
+		u32		info_id;	/* ID to be OR'd in to info field */
+	};
+	struct watch_queue __rcu *queue;	/* Queue to post events to */
+	struct hlist_node	queue_node;	/* Link in queue->watches */
+	struct watch_list __rcu	*watch_list;
+	struct hlist_node	list_node;	/* Link in watch_list->watchers */
+	void			*private;	/* Private data for the watched object */
+	u64			id;		/* Internal identifier */
+	refcount_t		usage;
+};
+
+/*
+ * List of watches on an object.
+ */
+struct watch_list {
+	struct rcu_head		rcu;
+	struct hlist_head	watchers;
+	void (*release_watch)(struct watch_list *, struct watch *);
+	spinlock_t		lock;
+};
+
+extern void __post_watch_notification(struct watch_list *,
+				      struct watch_notification *,
+				      const struct cred *,
+				      u64);
+extern struct watch_queue *get_watch_queue(int);
+extern void put_watch_queue(struct watch_queue *);
+extern void put_watch_list(struct watch_list *);
+extern void init_watch(struct watch *, struct watch_queue *);
+extern int add_watch_to_object(struct watch *, struct watch_list *);
+extern int remove_watch_from_object(struct watch_list *, struct watch_queue *, u64, bool);
+
+static inline void init_watch_list(struct watch_list *wlist)
+{
+	INIT_HLIST_HEAD(&wlist->watchers);
+	spin_lock_init(&wlist->lock);
+}
+
+static inline void post_watch_notification(struct watch_list *wlist,
+					   struct watch_notification *n,
+					   const struct cred *cred,
+					   u64 id)
+{
+	if (unlikely(wlist))
+		__post_watch_notification(wlist, n, cred, id);
+}
+
+static inline void remove_watch_list(struct watch_list *wlist)
+{
+	if (wlist) {
+		remove_watch_from_object(wlist, NULL, 0, true);
+		kfree_rcu(wlist, rcu);
+	}
+}
+
+#endif
+
+#endif /* _LINUX_WATCH_QUEUE_H */
diff --git a/include/uapi/linux/watch_queue.h b/include/uapi/linux/watch_queue.h
new file mode 100644
index 000000000000..01746982c2ba
--- /dev/null
+++ b/include/uapi/linux/watch_queue.h
@@ -0,0 +1,82 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_WATCH_QUEUE_H
+#define _UAPI_LINUX_WATCH_QUEUE_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define IOC_WATCH_QUEUE_SET_SIZE	_IO('s', 0x01)	/* Set the size in pages */
+#define IOC_WATCH_QUEUE_SET_FILTER	_IO('s', 0x02)	/* Set the filter */
+
+enum watch_notification_type {
+	WATCH_TYPE_META		= 0,	/* Special record */
+	WATCH_TYPE_MOUNT_NOTIFY	= 1,	/* Mount notification record */
+	WATCH_TYPE_SB_NOTIFY	= 2,	/* Superblock notification */
+	WATCH_TYPE_KEY_NOTIFY	= 3,	/* Key/keyring change notification */
+	WATCH_TYPE_BLOCK_NOTIFY	= 4,	/* Block layer notifications */
+#define WATCH_TYPE___NR 5
+};
+
+enum watch_meta_notification_subtype {
+	WATCH_META_SKIP_NOTIFICATION	= 0,	/* Just skip this record */
+	WATCH_META_REMOVAL_NOTIFICATION	= 1,	/* Watched object was removed */
+};
+
+/*
+ * Notification record
+ */
+struct watch_notification {
+	__u32			type:24;	/* enum watch_notification_type */
+	__u32			subtype:8;	/* Type-specific subtype (filterable) */
+	__u32			info;
+#define WATCH_INFO_OVERRUN	0x00000001	/* Event(s) lost due to overrun */
+#define WATCH_INFO_ENOMEM	0x00000002	/* Event(s) lost due to ENOMEM */
+#define WATCH_INFO_RECURSIVE	0x00000004	/* Change was recursive */
+#define WATCH_INFO_LENGTH	0x000001f8	/* Length of record / sizeof(watch_notification) */
+#define WATCH_INFO_IN_SUBTREE	0x00000200	/* Change was not at watched root */
+#define WATCH_INFO_TYPE_FLAGS	0x00ff0000	/* Type-specific flags */
+#define WATCH_INFO_FLAG_0	0x00010000
+#define WATCH_INFO_FLAG_1	0x00020000
+#define WATCH_INFO_FLAG_2	0x00040000
+#define WATCH_INFO_FLAG_3	0x00080000
+#define WATCH_INFO_FLAG_4	0x00100000
+#define WATCH_INFO_FLAG_5	0x00200000
+#define WATCH_INFO_FLAG_6	0x00400000
+#define WATCH_INFO_FLAG_7	0x00800000
+#define WATCH_INFO_ID		0xff000000	/* ID of watchpoint */
+};
+
+#define WATCH_LENGTH_SHIFT	3
+
+struct watch_queue_buffer {
+	union {
+		/* The first few entries are special, containing the
+		 * ring management variables.
+		 */
+		struct {
+			struct watch_notification watch; /* WATCH_TYPE_SKIP */
+			volatile __u32	head;		/* Ring head index */
+			volatile __u32	tail;		/* Ring tail index */
+			__u32		mask;		/* Ring index mask */
+		} meta;
+		struct watch_notification slots[0];
+	};
+};
+
+/*
+ * Notification filtering rules (IOC_WATCH_QUEUE_SET_FILTER).
+ */
+struct watch_notification_type_filter {
+	__u32	type;			/* Type to apply filter to */
+	__u32	info_filter;		/* Filter on watch_notification::info */
+	__u32	info_mask;		/* Mask of relevant bits in info_filter */
+	__u32	subtype_filter[8];	/* Bitmask of subtypes to filter on */
+};
+
+struct watch_notification_filter {
+	__u32	nr_filters;		/* Number of filters */
+	__u32	__reserved;		/* Must be 0 */
+	struct watch_notification_type_filter filters[];
+};
+
+#endif /* _UAPI_LINUX_WATCH_QUEUE_H */
diff --git a/mm/interval_tree.c b/mm/interval_tree.c
index 27ddfd29112a..9a53ddf4bd62 100644
--- a/mm/interval_tree.c
+++ b/mm/interval_tree.c
@@ -25,6 +25,8 @@ INTERVAL_TREE_DEFINE(struct vm_area_struct, shared.rb,
 		     unsigned long, shared.rb_subtree_last,
 		     vma_start_pgoff, vma_last_pgoff,, vma_interval_tree)
 
+EXPORT_SYMBOL_GPL(vma_interval_tree_insert);
+
 /* Insert node immediately after prev in the interval tree */
 void vma_interval_tree_insert_after(struct vm_area_struct *node,
 				    struct vm_area_struct *prev,
diff --git a/mm/memory.c b/mm/memory.c
index 96f1d473c89a..9f2fa2138287 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3360,6 +3360,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(alloc_set_pte);
 
 
 /**
diff --git a/security/security.c b/security/security.c
index 3af886e8fced..af758dc71e24 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1929,6 +1929,15 @@ int security_inode_getsecctx(struct inode *inode, void **ctx, u32 *ctxlen)
 }
 EXPORT_SYMBOL(security_inode_getsecctx);
 
+#ifdef CONFIG_WATCH_QUEUE
+int security_post_notification(const struct cred *q_cred,
+			       const struct cred *cred,
+			       struct watch_notification *n)
+{
+	return call_int_hook(post_notification, 0, q_cred, cred, n);
+}
+#endif
+
 #ifdef CONFIG_SECURITY_NETWORK
 
 int security_unix_stream_connect(struct sock *sock, struct sock *other, struct sock *newsk)


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 2/7] keys: Add a notification facility
  2019-05-28 16:01 [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications David Howells
  2019-05-28 16:01 ` [PATCH 1/7] General notification queue with user mmap()'able ring buffer David Howells
@ 2019-05-28 16:02 ` David Howells
  2019-05-28 16:02 ` [PATCH 3/7] vfs: Add a mount-notification facility David Howells
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2019-05-28 16:02 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel

Add a key/keyring change notification facility whereby notifications about
changes in key and keyring content and attributes can be received.

Firstly, an event queue needs to be created:

	fd = open("/dev/event_queue", O_RDWR);
	ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, page_size << n);

then a notification can be set up to report notifications via that queue:

	struct watch_notification_filter filter = {
		.nr_filters = 1,
		.filters = {
			[0] = {
				.type = WATCH_TYPE_KEY_NOTIFY,
				.subtype_filter[0] = UINT_MAX,
			},
		},
	};
	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
	keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);

After that, records will be placed into the queue when events occur in
which keys are changed in some way.  Records are of the following format:

	struct key_notification {
		struct watch_notification watch;
		__u32	key_id;
		__u32	aux;
	} *n;

Where:

	n->watch.type will be WATCH_TYPE_KEY_NOTIFY.

	n->watch.subtype will indicate the type of event, such as
	NOTIFY_KEY_REVOKED.

	n->watch.info & WATCH_INFO_LENGTH will indicate the length of the
	record.

	n->watch.info & WATCH_INFO_ID will be the second argument to
	keyctl_watch_key(), shifted.

	n->key will be the ID of the affected key.

	n->aux will hold subtype-dependent information, such as the key
	being linked into the keyring specified by n->key in the case of
	NOTIFY_KEY_LINKED.

Note that it is permissible for event records to be of variable length -
or, at least, the length may be dependent on the subtype.  Note also that
the queue can be shared between multiple notifications of various types.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 Documentation/security/keys/core.rst |   58 ++++++++++++++++++++++
 include/linux/key.h                  |    4 ++
 include/uapi/linux/keyctl.h          |    1 
 include/uapi/linux/watch_queue.h     |   25 ++++++++++
 security/keys/Kconfig                |   10 ++++
 security/keys/compat.c               |    2 +
 security/keys/gc.c                   |    5 ++
 security/keys/internal.h             |   30 +++++++++++-
 security/keys/key.c                  |   37 +++++++++-----
 security/keys/keyctl.c               |   88 +++++++++++++++++++++++++++++++++-
 security/keys/keyring.c              |   17 +++++--
 security/keys/request_key.c          |    4 +-
 12 files changed, 257 insertions(+), 24 deletions(-)

diff --git a/Documentation/security/keys/core.rst b/Documentation/security/keys/core.rst
index 9521c4207f01..05ef58c753f3 100644
--- a/Documentation/security/keys/core.rst
+++ b/Documentation/security/keys/core.rst
@@ -808,6 +808,7 @@ The keyctl syscall functions are:
      A process must have search permission on the key for this function to be
      successful.
 
+
   *  Compute a Diffie-Hellman shared secret or public key::
 
 	long keyctl(KEYCTL_DH_COMPUTE, struct keyctl_dh_params *params,
@@ -1001,6 +1002,63 @@ The keyctl syscall functions are:
      written into the output buffer.  Verification returns 0 on success.
 
 
+  *  Watch a key or keyring for changes::
+
+	long keyctl(KEYCTL_WATCH_KEY, key_serial_t key, int queue_fd,
+		    const struct watch_notification_filter *filter);
+
+     This will set or remove a watch for changes on the specified key or
+     keyring.
+
+     "key" is the ID of the key to be watched.
+
+     "queue_fd" is a file descriptor referring to an open "/dev/watch_queue"
+     which manages the buffer into which notifications will be delivered.
+
+     "filter" is either NULL to remove a watch or a filter specification to
+     indicate what events are required from the key.
+
+     See Documentation/watch_queue.rst for more information.
+
+     Note that only one watch may be emplaced for any particular { key,
+     queue_fd } combination.
+
+     Notification records look like::
+
+	struct key_notification {
+		struct watch_notification watch;
+		__u32	key_id;
+		__u32	aux;
+	};
+
+     In this, watch::type will be "WATCH_TYPE_KEY_NOTIFY" and subtype will be
+     one of::
+
+	NOTIFY_KEY_INSTANTIATED
+	NOTIFY_KEY_UPDATED
+	NOTIFY_KEY_LINKED
+	NOTIFY_KEY_UNLINKED
+	NOTIFY_KEY_CLEARED
+	NOTIFY_KEY_REVOKED
+	NOTIFY_KEY_INVALIDATED
+	NOTIFY_KEY_SETATTR
+
+     Where these indicate a key being instantiated/rejected, updated, a link
+     being made in a keyring, a link being removed from a keyring, a keyring
+     being cleared, a key being revoked, a key being invalidated or a key
+     having one of its attributes changed (user, group, perm, timeout,
+     restriction).
+
+     If a watched key is deleted, a basic watch_notification will be issued
+     with "type" set to WATCH_TYPE_META and "subtype" set to
+     watch_meta_removal_notification.  The watchpoint ID will be set in the
+     "info" field.
+
+     This needs to be configured by enabling:
+
+	"Provide key/keyring change notifications" (KEY_NOTIFICATIONS)
+
+
 Kernel Services
 ===============
 
diff --git a/include/linux/key.h b/include/linux/key.h
index 7099985e35a9..f1c43852c0c6 100644
--- a/include/linux/key.h
+++ b/include/linux/key.h
@@ -159,6 +159,9 @@ struct key {
 		struct list_head graveyard_link;
 		struct rb_node	serial_node;
 	};
+#ifdef CONFIG_KEY_NOTIFICATIONS
+	struct watch_list	*watchers;	/* Entities watching this key for changes */
+#endif
 	struct rw_semaphore	sem;		/* change vs change sem */
 	struct key_user		*user;		/* owner of this key */
 	void			*security;	/* security data for this key */
@@ -193,6 +196,7 @@ struct key {
 #define KEY_FLAG_ROOT_CAN_INVAL	7	/* set if key can be invalidated by root without permission */
 #define KEY_FLAG_KEEP		8	/* set if key should not be removed */
 #define KEY_FLAG_UID_KEYRING	9	/* set if key is a user or user session keyring */
+#define KEY_FLAG_SET_WATCH_PROXY 10	/* Set if watch_proxy should be set on added keys */
 
 	/* the key type and key description string
 	 * - the desc is used to match a key against search criteria
diff --git a/include/uapi/linux/keyctl.h b/include/uapi/linux/keyctl.h
index f45ee0f69c0c..e9e7da849619 100644
--- a/include/uapi/linux/keyctl.h
+++ b/include/uapi/linux/keyctl.h
@@ -67,6 +67,7 @@
 #define KEYCTL_PKEY_SIGN		27	/* Create a public key signature */
 #define KEYCTL_PKEY_VERIFY		28	/* Verify a public key signature */
 #define KEYCTL_RESTRICT_KEYRING		29	/* Restrict keys allowed to link to a keyring */
+#define KEYCTL_WATCH_KEY		30	/* Watch a key or ring of keys for changes */
 
 /* keyctl structures */
 struct keyctl_dh_params {
diff --git a/include/uapi/linux/watch_queue.h b/include/uapi/linux/watch_queue.h
index 01746982c2ba..e3bb35a480ae 100644
--- a/include/uapi/linux/watch_queue.h
+++ b/include/uapi/linux/watch_queue.h
@@ -79,4 +79,29 @@ struct watch_notification_filter {
 	struct watch_notification_type_filter filters[];
 };
 
+/*
+ * Type of key/keyring change notification.
+ */
+enum key_notification_subtype {
+	NOTIFY_KEY_INSTANTIATED	= 0, /* Key was instantiated (aux is error code) */
+	NOTIFY_KEY_UPDATED	= 1, /* Key was updated */
+	NOTIFY_KEY_LINKED	= 2, /* Key (aux) was added to watched keyring */
+	NOTIFY_KEY_UNLINKED	= 3, /* Key (aux) was removed from watched keyring */
+	NOTIFY_KEY_CLEARED	= 4, /* Keyring was cleared */
+	NOTIFY_KEY_REVOKED	= 5, /* Key was revoked */
+	NOTIFY_KEY_INVALIDATED	= 6, /* Key was invalidated */
+	NOTIFY_KEY_SETATTR	= 7, /* Key's attributes got changed */
+};
+
+/*
+ * Key/keyring notification record.
+ * - watch.type = WATCH_TYPE_KEY_NOTIFY
+ * - watch.subtype = enum key_notification_type
+ */
+struct key_notification {
+	struct watch_notification watch;
+	__u32	key_id;		/* The key/keyring affected */
+	__u32	aux;		/* Per-type auxiliary data */
+};
+
 #endif /* _UAPI_LINUX_WATCH_QUEUE_H */
diff --git a/security/keys/Kconfig b/security/keys/Kconfig
index 6462e6654ccf..fbe064fa0a17 100644
--- a/security/keys/Kconfig
+++ b/security/keys/Kconfig
@@ -101,3 +101,13 @@ config KEY_DH_OPERATIONS
 	 in the kernel.
 
 	 If you are unsure as to whether this is required, answer N.
+
+config KEY_NOTIFICATIONS
+	bool "Provide key/keyring change notifications"
+	depends on KEYS
+	select WATCH_QUEUE
+	help
+	  This option provides support for getting change notifications on keys
+	  and keyrings on which the caller has View permission.  This makes use
+	  of the /dev/watch_queue misc device to handle the notification
+	  buffer and provides KEYCTL_WATCH_KEY to enable/disable watches.
diff --git a/security/keys/compat.c b/security/keys/compat.c
index 9482df601dc3..021d8e1c9233 100644
--- a/security/keys/compat.c
+++ b/security/keys/compat.c
@@ -158,6 +158,8 @@ COMPAT_SYSCALL_DEFINE5(keyctl, u32, option,
 	case KEYCTL_PKEY_VERIFY:
 		return keyctl_pkey_verify(compat_ptr(arg2), compat_ptr(arg3),
 					  compat_ptr(arg4), compat_ptr(arg5));
+	case KEYCTL_WATCH_KEY:
+		return keyctl_watch_key(arg2, arg3, arg4);
 
 	default:
 		return -EOPNOTSUPP;
diff --git a/security/keys/gc.c b/security/keys/gc.c
index 634e96b380e8..b685b9a85a9e 100644
--- a/security/keys/gc.c
+++ b/security/keys/gc.c
@@ -135,6 +135,11 @@ static noinline void key_gc_unused_keys(struct list_head *keys)
 		kdebug("- %u", key->serial);
 		key_check(key);
 
+#ifdef CONFIG_KEY_NOTIFICATIONS
+		remove_watch_list(key->watchers);
+		key->watchers = NULL;
+#endif
+
 		/* Throw away the key data if the key is instantiated */
 		if (state == KEY_IS_POSITIVE && key->type->destroy)
 			key->type->destroy(key);
diff --git a/security/keys/internal.h b/security/keys/internal.h
index 8f533c81aa8d..a7ac0f823ade 100644
--- a/security/keys/internal.h
+++ b/security/keys/internal.h
@@ -19,6 +19,7 @@
 #include <linux/task_work.h>
 #include <linux/keyctl.h>
 #include <linux/refcount.h>
+#include <linux/watch_queue.h>
 #include <linux/compat.h>
 
 struct iovec;
@@ -97,7 +98,8 @@ extern int __key_link_begin(struct key *keyring,
 			    const struct keyring_index_key *index_key,
 			    struct assoc_array_edit **_edit);
 extern int __key_link_check_live_key(struct key *keyring, struct key *key);
-extern void __key_link(struct key *key, struct assoc_array_edit **_edit);
+extern void __key_link(struct key *keyring, struct key *key,
+		       struct assoc_array_edit **_edit);
 extern void __key_link_end(struct key *keyring,
 			   const struct keyring_index_key *index_key,
 			   struct assoc_array_edit *edit);
@@ -178,6 +180,23 @@ extern int key_task_permission(const key_ref_t key_ref,
 			       const struct cred *cred,
 			       key_perm_t perm);
 
+static inline void notify_key(struct key *key,
+			      enum key_notification_subtype subtype, u32 aux)
+{
+#ifdef CONFIG_KEY_NOTIFICATIONS
+	struct key_notification n = {
+		.watch.type	= WATCH_TYPE_KEY_NOTIFY,
+		.watch.subtype	= subtype,
+		.watch.info	= sizeof(n),
+		.key_id		= key_serial(key),
+		.aux		= aux,
+	};
+
+	post_watch_notification(key->watchers, &n.watch, current_cred(),
+				n.key_id);
+#endif
+}
+
 /*
  * Check to see whether permission is granted to use a key in the desired way.
  */
@@ -324,6 +343,15 @@ static inline long keyctl_pkey_e_d_s(int op,
 }
 #endif
 
+#ifdef CONFIG_KEY_NOTIFICATIONS
+extern long keyctl_watch_key(key_serial_t, int, int);
+#else
+static inline long keyctl_watch_key(key_serial_t key_id, int watch_fd, int watch_id)
+{
+	return -EOPNOTSUPP;
+}
+#endif
+
 /*
  * Debugging key validation
  */
diff --git a/security/keys/key.c b/security/keys/key.c
index 696f1c092c50..9d9f94992470 100644
--- a/security/keys/key.c
+++ b/security/keys/key.c
@@ -412,6 +412,7 @@ static void mark_key_instantiated(struct key *key, int reject_error)
 	 */
 	smp_store_release(&key->state,
 			  (reject_error < 0) ? reject_error : KEY_IS_POSITIVE);
+	notify_key(key, NOTIFY_KEY_INSTANTIATED, reject_error);
 }
 
 /*
@@ -454,7 +455,7 @@ static int __key_instantiate_and_link(struct key *key,
 				if (test_bit(KEY_FLAG_KEEP, &keyring->flags))
 					set_bit(KEY_FLAG_KEEP, &key->flags);
 
-				__key_link(key, _edit);
+				__key_link(keyring, key, _edit);
 			}
 
 			/* disable the authorisation key */
@@ -603,7 +604,7 @@ int key_reject_and_link(struct key *key,
 
 		/* and link it into the destination keyring */
 		if (keyring && link_ret == 0)
-			__key_link(key, &edit);
+			__key_link(keyring, key, &edit);
 
 		/* disable the authorisation key */
 		if (authkey)
@@ -756,9 +757,11 @@ static inline key_ref_t __key_update(key_ref_t key_ref,
 	down_write(&key->sem);
 
 	ret = key->type->update(key, prep);
-	if (ret == 0)
+	if (ret == 0) {
 		/* Updating a negative key positively instantiates it */
 		mark_key_instantiated(key, 0);
+		notify_key(key, NOTIFY_KEY_UPDATED, 0);
+	}
 
 	up_write(&key->sem);
 
@@ -999,9 +1002,11 @@ int key_update(key_ref_t key_ref, const void *payload, size_t plen)
 	down_write(&key->sem);
 
 	ret = key->type->update(key, &prep);
-	if (ret == 0)
+	if (ret == 0) {
 		/* Updating a negative key positively instantiates it */
 		mark_key_instantiated(key, 0);
+		notify_key(key, NOTIFY_KEY_UPDATED, 0);
+	}
 
 	up_write(&key->sem);
 
@@ -1033,15 +1038,17 @@ void key_revoke(struct key *key)
 	 *   instantiated
 	 */
 	down_write_nested(&key->sem, 1);
-	if (!test_and_set_bit(KEY_FLAG_REVOKED, &key->flags) &&
-	    key->type->revoke)
-		key->type->revoke(key);
-
-	/* set the death time to no more than the expiry time */
-	time = ktime_get_real_seconds();
-	if (key->revoked_at == 0 || key->revoked_at > time) {
-		key->revoked_at = time;
-		key_schedule_gc(key->revoked_at + key_gc_delay);
+	if (!test_and_set_bit(KEY_FLAG_REVOKED, &key->flags)) {
+		notify_key(key, NOTIFY_KEY_REVOKED, 0);
+		if (key->type->revoke)
+			key->type->revoke(key);
+
+		/* set the death time to no more than the expiry time */
+		time = ktime_get_real_seconds();
+		if (key->revoked_at == 0 || key->revoked_at > time) {
+			key->revoked_at = time;
+			key_schedule_gc(key->revoked_at + key_gc_delay);
+		}
 	}
 
 	up_write(&key->sem);
@@ -1063,8 +1070,10 @@ void key_invalidate(struct key *key)
 
 	if (!test_bit(KEY_FLAG_INVALIDATED, &key->flags)) {
 		down_write_nested(&key->sem, 1);
-		if (!test_and_set_bit(KEY_FLAG_INVALIDATED, &key->flags))
+		if (!test_and_set_bit(KEY_FLAG_INVALIDATED, &key->flags)) {
+			notify_key(key, NOTIFY_KEY_INVALIDATED, 0);
 			key_schedule_gc_links();
+		}
 		up_write(&key->sem);
 	}
 }
diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 3e4053a217c3..cc2e6feafbc7 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -914,6 +914,7 @@ long keyctl_chown_key(key_serial_t id, uid_t user, gid_t group)
 	if (group != (gid_t) -1)
 		key->gid = gid;
 
+	notify_key(key, NOTIFY_KEY_SETATTR, 0);
 	ret = 0;
 
 error_put:
@@ -964,6 +965,7 @@ long keyctl_setperm_key(key_serial_t id, key_perm_t perm)
 	/* if we're not the sysadmin, we can only change a key that we own */
 	if (capable(CAP_SYS_ADMIN) || uid_eq(key->uid, current_fsuid())) {
 		key->perm = perm;
+		notify_key(key, NOTIFY_KEY_SETATTR, 0);
 		ret = 0;
 	}
 
@@ -1355,10 +1357,12 @@ long keyctl_set_timeout(key_serial_t id, unsigned timeout)
 okay:
 	key = key_ref_to_ptr(key_ref);
 	ret = 0;
-	if (test_bit(KEY_FLAG_KEEP, &key->flags))
+	if (test_bit(KEY_FLAG_KEEP, &key->flags)) {
 		ret = -EPERM;
-	else
+	} else {
 		key_set_timeout(key, timeout);
+		notify_key(key, NOTIFY_KEY_SETATTR, 0);
+	}
 	key_put(key);
 
 error:
@@ -1631,6 +1635,83 @@ long keyctl_restrict_keyring(key_serial_t id, const char __user *_type,
 	return ret;
 }
 
+#ifdef CONFIG_KEY_NOTIFICATIONS
+/*
+ * Watch for changes to a key.
+ *
+ * The caller must have View permission to watch a key or keyring.
+ */
+long keyctl_watch_key(key_serial_t id, int watch_queue_fd, int watch_id)
+{
+	struct watch_queue *wqueue;
+	struct watch_list *wlist = NULL;
+	struct watch *watch;
+	struct key *key;
+	key_ref_t key_ref;
+	long ret = -ENOMEM;
+
+	if (watch_id < -1 || watch_id > 0xff)
+		return -EINVAL;
+
+	key_ref = lookup_user_key(id, KEY_LOOKUP_CREATE, KEY_NEED_VIEW);
+	if (IS_ERR(key_ref))
+		return PTR_ERR(key_ref);
+	key = key_ref_to_ptr(key_ref);
+
+	wqueue = get_watch_queue(watch_queue_fd);
+	if (IS_ERR(wqueue)) {
+		ret = PTR_ERR(wqueue);
+		goto err_key;
+	}
+
+	if (watch_id >= 0) {
+		if (!key->watchers) {
+			wlist = kzalloc(sizeof(*wlist), GFP_KERNEL);
+			if (!wlist)
+				goto err_wqueue;
+			INIT_HLIST_HEAD(&wlist->watchers);
+			spin_lock_init(&wlist->lock);
+		}
+
+		watch = kzalloc(sizeof(*watch), GFP_KERNEL);
+		if (!watch)
+			goto err_wlist;
+
+		init_watch(watch, wqueue);
+		watch->id	= key->serial;
+		watch->info_id	= (u32)watch_id << 24;
+
+		down_write(&key->sem);
+		if (!key->watchers) {
+			key->watchers = wlist;
+			wlist = NULL;
+		}
+
+		ret = add_watch_to_object(watch, key->watchers);
+		up_write(&key->sem);
+
+		if (ret < 0)
+			kfree(watch);
+	} else if (key->watchers) {
+		down_write(&key->sem);
+		ret = remove_watch_from_object(key->watchers,
+					       wqueue, key_serial(key),
+					       false);
+		up_write(&key->sem);
+	} else {
+		ret = -EBADSLT;
+	}
+
+err_wlist:
+	kfree(wlist);
+err_wqueue:
+	put_watch_queue(wqueue);
+err_key:
+	key_put(key);
+	return ret;
+}
+#endif /* CONFIG_KEY_NOTIFICATIONS */
+
 /*
  * The key control system call
  */
@@ -1771,6 +1852,9 @@ SYSCALL_DEFINE5(keyctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			(const void __user *)arg4,
 			(const void __user *)arg5);
 
+	case KEYCTL_WATCH_KEY:
+		return keyctl_watch_key((key_serial_t)arg2, (int)arg3, (int)arg4);
+
 	default:
 		return -EOPNOTSUPP;
 	}
diff --git a/security/keys/keyring.c b/security/keys/keyring.c
index e14f09e3a4b0..f0f9ab3c5587 100644
--- a/security/keys/keyring.c
+++ b/security/keys/keyring.c
@@ -1018,12 +1018,14 @@ int keyring_restrict(key_ref_t keyring_ref, const char *type,
 	down_write(&keyring->sem);
 	down_write(&keyring_serialise_restrict_sem);
 
-	if (keyring->restrict_link)
+	if (keyring->restrict_link) {
 		ret = -EEXIST;
-	else if (keyring_detect_restriction_cycle(keyring, restrict_link))
+	} else if (keyring_detect_restriction_cycle(keyring, restrict_link)) {
 		ret = -EDEADLK;
-	else
+	} else {
 		keyring->restrict_link = restrict_link;
+		notify_key(keyring, NOTIFY_KEY_SETATTR, 0);
+	}
 
 	up_write(&keyring_serialise_restrict_sem);
 	up_write(&keyring->sem);
@@ -1286,12 +1288,14 @@ int __key_link_check_live_key(struct key *keyring, struct key *key)
  * holds at most one link to any given key of a particular type+description
  * combination.
  */
-void __key_link(struct key *key, struct assoc_array_edit **_edit)
+void __key_link(struct key *keyring, struct key *key,
+		struct assoc_array_edit **_edit)
 {
 	__key_get(key);
 	assoc_array_insert_set_object(*_edit, keyring_key_to_ptr(key));
 	assoc_array_apply_edit(*_edit);
 	*_edit = NULL;
+	notify_key(keyring, NOTIFY_KEY_LINKED, key_serial(key));
 }
 
 /*
@@ -1369,7 +1373,7 @@ int key_link(struct key *keyring, struct key *key)
 		if (ret == 0)
 			ret = __key_link_check_live_key(keyring, key);
 		if (ret == 0)
-			__key_link(key, &edit);
+			__key_link(keyring, key, &edit);
 		__key_link_end(keyring, &key->index_key, edit);
 	}
 
@@ -1398,6 +1402,7 @@ EXPORT_SYMBOL(key_link);
 int key_unlink(struct key *keyring, struct key *key)
 {
 	struct assoc_array_edit *edit;
+	key_serial_t target = key_serial(key);
 	int ret;
 
 	key_check(keyring);
@@ -1419,6 +1424,7 @@ int key_unlink(struct key *keyring, struct key *key)
 		goto error;
 
 	assoc_array_apply_edit(edit);
+	notify_key(keyring, NOTIFY_KEY_UNLINKED, target);
 	key_payload_reserve(keyring, keyring->datalen - KEYQUOTA_LINK_BYTES);
 	ret = 0;
 
@@ -1452,6 +1458,7 @@ int keyring_clear(struct key *keyring)
 	} else {
 		if (edit)
 			assoc_array_apply_edit(edit);
+		notify_key(keyring, NOTIFY_KEY_CLEARED, 0);
 		key_payload_reserve(keyring, 0);
 		ret = 0;
 	}
diff --git a/security/keys/request_key.c b/security/keys/request_key.c
index 75d87f9e0f49..5f474d0e8620 100644
--- a/security/keys/request_key.c
+++ b/security/keys/request_key.c
@@ -387,7 +387,7 @@ static int construct_alloc_key(struct keyring_search_context *ctx,
 		goto key_already_present;
 
 	if (dest_keyring)
-		__key_link(key, &edit);
+		__key_link(dest_keyring, key, &edit);
 
 	mutex_unlock(&key_construction_mutex);
 	if (dest_keyring)
@@ -406,7 +406,7 @@ static int construct_alloc_key(struct keyring_search_context *ctx,
 	if (dest_keyring) {
 		ret = __key_link_check_live_key(dest_keyring, key);
 		if (ret == 0)
-			__key_link(key, &edit);
+			__key_link(dest_keyring, key, &edit);
 		__key_link_end(dest_keyring, &ctx->index_key, edit);
 		if (ret < 0)
 			goto link_check_failed;


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 3/7] vfs: Add a mount-notification facility
  2019-05-28 16:01 [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications David Howells
  2019-05-28 16:01 ` [PATCH 1/7] General notification queue with user mmap()'able ring buffer David Howells
  2019-05-28 16:02 ` [PATCH 2/7] keys: Add a notification facility David Howells
@ 2019-05-28 16:02 ` David Howells
  2019-05-28 20:06   ` Jann Horn
                     ` (4 more replies)
  2019-05-28 16:02 ` [PATCH 4/7] vfs: Add superblock notifications David Howells
                   ` (7 subsequent siblings)
  10 siblings, 5 replies; 65+ messages in thread
From: David Howells @ 2019-05-28 16:02 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel

Add a mount notification facility whereby notifications about changes in
mount topology and configuration can be received.  Note that this only
covers vfsmount topology changes and not superblock events.  A separate
facility will be added for that.

Firstly, an event queue needs to be created:

	fd = open("/dev/event_queue", O_RDWR);
	ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, page_size << n);

then a notification can be set up to report notifications via that queue:

	struct watch_notification_filter filter = {
		.nr_filters = 1,
		.filters = {
			[0] = {
				.type = WATCH_TYPE_MOUNT_NOTIFY,
				.subtype_filter[0] = UINT_MAX,
			},
		},
	};
	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
	mount_notify(AT_FDCWD, "/", 0, fd, 0x02);

In this case, it would let me monitor the mount topology subtree rooted at
"/" for events.  Mount notifications propagate up the tree towards the
root, so a watch will catch all of the events happening in the subtree
rooted at the watch.

After setting the watch, records will be placed into the queue when, for
example, as superblock switches between read-write and read-only.  Records
are of the following format:

	struct mount_notification {
		struct watch_notification watch;
		__u32	triggered_on;
		__u32	changed_mount;
	} *n;

Where:

	n->watch.type will be WATCH_TYPE_MOUNT_NOTIFY.

	n->watch.subtype will indicate the type of event, such as
	NOTIFY_MOUNT_NEW_MOUNT.

	n->watch.info & WATCH_INFO_LENGTH will indicate the length of the
	record.

	n->watch.info & WATCH_INFO_ID will be the fifth argument to
	mount_notify(), shifted.

	n->watch.info & WATCH_INFO_FLAG_0 will be used for
	NOTIFY_MOUNT_READONLY, being set if the superblock becomes R/O, and
	being cleared otherwise, and for NOTIFY_MOUNT_NEW_MOUNT, being set
	if the new mount is a submount (e.g. an automount).

	n->triggered_on indicates the ID of the mount on which the watch
	was installed.

	n->changed_mount indicates the ID of the mount that was affected.

The mount IDs can be retrieved with the fsinfo() syscall, using the
fsinfo_mount_info and fsinfo_mount_child attributes.  There are
notification counters there too for when a buffer overrun occurs, thereby
allowing the mount tree to be quickly rescanned.

Note that it is permissible for event records to be of variable length -
or, at least, the length may be dependent on the subtype.  Note also that
the queue can be shared between multiple notifications of various types.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/Kconfig                             |    9 ++
 fs/Makefile                            |    1 
 fs/mount.h                             |   33 ++++--
 fs/mount_notify.c                      |  178 ++++++++++++++++++++++++++++++++
 fs/namespace.c                         |    9 +-
 include/linux/dcache.h                 |    1 
 include/linux/syscalls.h               |    2 
 include/uapi/linux/watch_queue.h       |   24 ++++
 kernel/sys_ni.c                        |    3 +
 11 files changed, 248 insertions(+), 14 deletions(-)
 create mode 100644 fs/mount_notify.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 03decae51513..a8416a9a0ccb 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -439,3 +439,4 @@
 432	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
 433	i386	fspick			sys_fspick			__ia32_sys_fspick
 434	i386	fsinfo			sys_fsinfo			__ia32_sys_fsinfo
+435	i386	mount_notify		sys_mount_notify		__ia32_sys_mount_notify
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index ea63df9a1020..ea052a94eb97 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -356,6 +356,7 @@
 432	common	fsmount			__x64_sys_fsmount
 433	common	fspick			__x64_sys_fspick
 434	common	fsinfo			__x64_sys_fsinfo
+435	common	mount_notify		__x64_sys_mount_notify
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Kconfig b/fs/Kconfig
index 9e7d2f2c0111..a26bbe27a791 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -121,6 +121,15 @@ source "fs/crypto/Kconfig"
 
 source "fs/notify/Kconfig"
 
+config MOUNT_NOTIFICATIONS
+	bool "Mount topology change notifications"
+	select WATCH_QUEUE
+	help
+	  This option provides support for getting change notifications on the
+	  mount tree topology.  This makes use of the /dev/watch_queue misc
+	  device to handle the notification buffer and provides the
+	  mount_notify() system call to enable/disable watchpoints.
+
 source "fs/quota/Kconfig"
 
 source "fs/autofs/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index 26eaeae4b9a1..c6a71daf2464 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -131,3 +131,4 @@ obj-$(CONFIG_F2FS_FS)		+= f2fs/
 obj-$(CONFIG_CEPH_FS)		+= ceph/
 obj-$(CONFIG_PSTORE)		+= pstore/
 obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
+obj-$(CONFIG_MOUNT_NOTIFICATIONS) += mount_notify.o
diff --git a/fs/mount.h b/fs/mount.h
index 47795802f78e..a95b805d00d8 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -4,6 +4,7 @@
 #include <linux/poll.h>
 #include <linux/ns_common.h>
 #include <linux/fs_pin.h>
+#include <linux/watch_queue.h>
 
 struct mnt_namespace {
 	atomic_t		count;
@@ -67,9 +68,13 @@ struct mount {
 	int mnt_id;			/* mount identifier */
 	int mnt_group_id;		/* peer group identifier */
 	int mnt_expiry_mark;		/* true if marked for expiry */
+	int mnt_nr_watchers;		/* The number of subtree watches tracking this */
 	struct hlist_head mnt_pins;
 	struct fs_pin mnt_umount;
 	struct dentry *mnt_ex_mountpoint;
+#ifdef CONFIG_MOUNT_NOTIFICATIONS
+	struct watch_list *mnt_watchers; /* Watches on dentries within this mount */
+#endif
 	atomic_t mnt_notify_counter;	/* Number of notifications generated */
 } __randomize_layout;
 
@@ -153,18 +158,8 @@ static inline bool is_anon_ns(struct mnt_namespace *ns)
 	return ns->seq == 0;
 }
 
-/*
- * Type of mount topology change notification.
- */
-enum mount_notification_subtype {
-	NOTIFY_MOUNT_NEW_MOUNT	= 0, /* New mount added */
-	NOTIFY_MOUNT_UNMOUNT	= 1, /* Mount removed manually */
-	NOTIFY_MOUNT_EXPIRY	= 2, /* Automount expired */
-	NOTIFY_MOUNT_READONLY	= 3, /* Mount R/O state changed */
-	NOTIFY_MOUNT_SETATTR	= 4, /* Mount attributes changed */
-	NOTIFY_MOUNT_MOVE_FROM	= 5, /* Mount moved from here */
-	NOTIFY_MOUNT_MOVE_TO	= 6, /* Mount moved to here (compare op_id) */
-};
+extern void post_mount_notification(struct mount *changed,
+				    struct mount_notification *notify);
 
 static inline void notify_mount(struct mount *changed,
 				struct mount *aux,
@@ -172,4 +167,18 @@ static inline void notify_mount(struct mount *changed,
 				u32 info_flags)
 {
 	atomic_inc(&changed->mnt_notify_counter);
+
+#ifdef CONFIG_MOUNT_NOTIFICATIONS
+	{
+		struct mount_notification n = {
+			.watch.type	= WATCH_TYPE_MOUNT_NOTIFY,
+			.watch.subtype	= subtype,
+			.watch.info	= info_flags | sizeof(n),
+			.triggered_on	= changed->mnt_id,
+			.changed_mount	= aux ? aux->mnt_id : 0,
+		};
+
+		post_mount_notification(changed, &n);
+	}
+#endif
 }
diff --git a/fs/mount_notify.c b/fs/mount_notify.c
new file mode 100644
index 000000000000..6c7f323dbd4f
--- /dev/null
+++ b/fs/mount_notify.c
@@ -0,0 +1,178 @@
+/* Provide mount topology/attribute change notifications.
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/syscalls.h>
+#include <linux/slab.h>
+#include "mount.h"
+
+/*
+ * Post mount notifications to all watches going rootwards along the tree.
+ *
+ * Must be called with the mount_lock held.
+ */
+void post_mount_notification(struct mount *changed,
+			     struct mount_notification *notify)
+{
+	const struct cred *cred = current_cred();
+	struct path cursor;
+	struct mount *mnt;
+	unsigned seq;
+
+	seq = 0;
+	rcu_read_lock();
+restart:
+	cursor.mnt = &changed->mnt;
+	cursor.dentry = changed->mnt.mnt_root;
+	mnt = real_mount(cursor.mnt);
+	notify->watch.info &= ~WATCH_INFO_IN_SUBTREE;
+
+	read_seqbegin_or_lock(&rename_lock, &seq);
+	for (;;) {
+		if (mnt->mnt_watchers &&
+		    !hlist_empty(&mnt->mnt_watchers->watchers)) {
+			if (cursor.dentry->d_flags & DCACHE_MOUNT_WATCH)
+				post_watch_notification(mnt->mnt_watchers,
+							&notify->watch, cred,
+							(unsigned long)cursor.dentry);
+		} else {
+			cursor.dentry = mnt->mnt.mnt_root;
+		}
+		notify->watch.info |= WATCH_INFO_IN_SUBTREE;
+
+		if (cursor.dentry == cursor.mnt->mnt_root ||
+		    IS_ROOT(cursor.dentry)) {
+			struct mount *parent = READ_ONCE(mnt->mnt_parent);
+
+			/* Escaped? */
+			if (cursor.dentry != cursor.mnt->mnt_root)
+				break;
+
+			/* Global root? */
+			if (mnt != parent) {
+				cursor.dentry = READ_ONCE(mnt->mnt_mountpoint);
+				mnt = parent;
+				cursor.mnt = &mnt->mnt;
+				continue;
+			}
+			break;
+		}
+
+		cursor.dentry = cursor.dentry->d_parent;
+	}
+
+	if (need_seqretry(&rename_lock, seq)) {
+		seq = 1;
+		goto restart;
+	}
+
+	done_seqretry(&rename_lock, seq);
+	rcu_read_unlock();
+}
+
+static void release_mount_watch(struct watch_list *wlist, struct watch *watch)
+{
+	struct vfsmount *mnt = watch->private;
+	struct dentry *dentry = (struct dentry *)(unsigned long)watch->id;
+
+	dput(dentry);
+	mntput(mnt);
+}
+
+/**
+ * sys_mount_notify - Watch for mount topology/attribute changes
+ * @dfd: Base directory to pathwalk from or fd referring to mount.
+ * @filename: Path to mount to place the watch upon
+ * @at_flags: Pathwalk control flags
+ * @watch_fd: The watch queue to send notifications to.
+ * @watch_id: The watch ID to be placed in the notification (-1 to remove watch)
+ */
+SYSCALL_DEFINE5(mount_notify,
+		int, dfd,
+		const char __user *, filename,
+		unsigned int, at_flags,
+		int, watch_fd,
+		int, watch_id)
+{
+	struct watch_queue *wqueue;
+	struct watch_list *wlist = NULL;
+	struct watch *watch;
+	struct mount *m;
+	struct path path;
+	int ret;
+
+	if (watch_id < -1 || watch_id > 0xff)
+		return -EINVAL;
+
+	ret = user_path_at(dfd, filename, at_flags, &path);
+	if (ret)
+		return ret;
+
+	wqueue = get_watch_queue(watch_fd);
+	if (IS_ERR(wqueue))
+		goto err_path;
+
+	m = real_mount(path.mnt);
+
+	if (watch_id >= 0) {
+		if (!m->mnt_watchers) {
+			wlist = kzalloc(sizeof(*wlist), GFP_KERNEL);
+			if (!wlist)
+				goto err_wqueue;
+			INIT_HLIST_HEAD(&wlist->watchers);
+			spin_lock_init(&wlist->lock);
+			wlist->release_watch = release_mount_watch;
+		}
+
+		watch = kzalloc(sizeof(*watch), GFP_KERNEL);
+		if (!watch)
+			goto err_wlist;
+
+		init_watch(watch, wqueue);
+		watch->id		= (unsigned long)path.dentry;
+		watch->private		= path.mnt;
+		watch->info_id		= (u32)watch_id << 24;
+
+		down_write(&m->mnt.mnt_sb->s_umount);
+		if (!m->mnt_watchers) {
+			m->mnt_watchers = wlist;
+			wlist = NULL;
+		}
+
+		ret = add_watch_to_object(watch, m->mnt_watchers);
+		if (ret == 0) {
+			spin_lock(&path.dentry->d_lock);
+			path.dentry->d_flags |= DCACHE_MOUNT_WATCH;
+			spin_unlock(&path.dentry->d_lock);
+			path_get(&path);
+		}
+		up_write(&m->mnt.mnt_sb->s_umount);
+		if (ret < 0)
+			kfree(watch);
+	} else if (m->mnt_watchers) {
+		down_write(&m->mnt.mnt_sb->s_umount);
+		ret = remove_watch_from_object(m->mnt_watchers, wqueue,
+					       (unsigned long)path.dentry,
+					       false);
+		up_write(&m->mnt.mnt_sb->s_umount);
+	} else {
+		ret = -EBADSLT;
+	}
+
+err_wlist:
+	kfree(wlist);
+err_wqueue:
+	put_watch_queue(wqueue);
+err_path:
+	path_put(&path);
+	return ret;
+}
diff --git a/fs/namespace.c b/fs/namespace.c
index ae03066b2d9b..de778b2e8ec4 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -515,7 +515,8 @@ static int mnt_make_readonly(struct mount *mnt)
 	mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
 	unlock_mount_hash();
 	if (ret == 0)
-		notify_mount(mnt, NULL, NOTIFY_MOUNT_READONLY, 0x10000);
+		notify_mount(mnt, NULL, NOTIFY_MOUNT_READONLY,
+			     WATCH_INFO_FLAG_0);
 	return ret;
 }
 
@@ -1478,6 +1479,10 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 		list_del_init(&p->mnt_expire);
 		list_del_init(&p->mnt_list);
 
+#ifdef CONFIG_MOUNT_NOTIFICATIONS
+		if (p->mnt_watchers)
+			remove_watch_list(p->mnt_watchers);
+#endif
 		ns = p->mnt_ns;
 		if (ns) {
 			ns->mounts--;
@@ -2115,7 +2120,7 @@ static int attach_recursive_mnt(struct mount *source_mnt,
 		mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
 		notify_mount(dest_mnt, source_mnt, NOTIFY_MOUNT_NEW_MOUNT,
 			     source_mnt->mnt.mnt_sb->s_flags & SB_SUBMOUNT ?
-			     0x10000 : 0);
+			     WATCH_INFO_FLAG_0 : 0);
 		commit_tree(source_mnt);
 	}
 
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 361305ddd75e..5db8e244d9a0 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -217,6 +217,7 @@ struct dentry_operations {
 #define DCACHE_PAR_LOOKUP		0x10000000 /* being looked up (with parent locked shared) */
 #define DCACHE_DENTRY_CURSOR		0x20000000
 #define DCACHE_NORCU			0x40000000 /* No RCU delay for freeing */
+#define DCACHE_MOUNT_WATCH		0x80000000 /* There's a mount watch here */
 
 extern seqlock_t rename_lock;
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 217d25b62b4f..7c2b66175f3c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1001,6 +1001,8 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
 asmlinkage long sys_fsinfo(int dfd, const char __user *path,
 			   struct fsinfo_params __user *params,
 			   void __user *buffer, size_t buf_size);
+asmlinkage long sys_mount_notify(int dfd, const char __user *path,
+				 unsigned int at_flags, int watch_fd, int watch_id);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/watch_queue.h b/include/uapi/linux/watch_queue.h
index e3bb35a480ae..388b4141bcee 100644
--- a/include/uapi/linux/watch_queue.h
+++ b/include/uapi/linux/watch_queue.h
@@ -104,4 +104,28 @@ struct key_notification {
 	__u32	aux;		/* Per-type auxiliary data */
 };
 
+/*
+ * Type of mount topology change notification.
+ */
+enum mount_notification_subtype {
+	NOTIFY_MOUNT_NEW_MOUNT	= 0, /* New mount added */
+	NOTIFY_MOUNT_UNMOUNT	= 1, /* Mount removed manually */
+	NOTIFY_MOUNT_EXPIRY	= 2, /* Automount expired */
+	NOTIFY_MOUNT_READONLY	= 3, /* Mount R/O state changed */
+	NOTIFY_MOUNT_SETATTR	= 4, /* Mount attributes changed */
+	NOTIFY_MOUNT_MOVE_FROM	= 5, /* Mount moved from here */
+	NOTIFY_MOUNT_MOVE_TO	= 6, /* Mount moved to here (compare op_id) */
+};
+
+/*
+ * Mount topology/configuration change notification record.
+ * - watch.type = WATCH_TYPE_MOUNT_NOTIFY
+ * - watch.subtype = enum mount_notification_subtype
+ */
+struct mount_notification {
+	struct watch_notification watch; /* WATCH_TYPE_MOUNT_NOTIFY */
+	__u32	triggered_on;		/* The mount that the notify was on */
+	__u32	changed_mount;		/* The mount that got changed */
+};
+
 #endif /* _UAPI_LINUX_WATCH_QUEUE_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index d1d9d76cae1e..97b025e7863c 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -88,6 +88,9 @@ COND_SYSCALL(ioprio_get);
 /* fs/locks.c */
 COND_SYSCALL(flock);
 
+/* fs/mount_notify.c */
+COND_SYSCALL(mount_notify);
+
 /* fs/namei.c */
 
 /* fs/namespace.c */


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 4/7] vfs: Add superblock notifications
  2019-05-28 16:01 [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications David Howells
                   ` (2 preceding siblings ...)
  2019-05-28 16:02 ` [PATCH 3/7] vfs: Add a mount-notification facility David Howells
@ 2019-05-28 16:02 ` David Howells
  2019-05-28 20:27   ` Jann Horn
  2019-05-29 12:58   ` David Howells
  2019-05-28 16:02 ` [PATCH 5/7] fsinfo: Export superblock notification counter David Howells
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 65+ messages in thread
From: David Howells @ 2019-05-28 16:02 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel

Add a superblock event notification facility whereby notifications about
superblock events, such as I/O errors (EIO), quota limits being hit
(EDQUOT) and running out of space (ENOSPC) can be reported to a monitoring
process asynchronously.  Note that this does not cover vfsmount topology
changes.  mount_notify() is used for that.

Firstly, an event queue needs to be created:

	fd = open("/dev/event_queue", O_RDWR);
	ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, page_size << n);

then a notification can be set up to report notifications via that queue:

	struct watch_notification_filter filter = {
		.nr_filters = 1,
		.filters = {
			[0] = {
				.type = WATCH_TYPE_SB_NOTIFY,
				.subtype_filter[0] = UINT_MAX,
			},
		},
	};
	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
	sb_notify(AT_FDCWD, "/home/dhowells", 0, fd, 0x03);

In this case, it would let me monitor my own homedir for events.  After
setting the watch, records will be placed into the queue when, for example,
as superblock switches between read-write and read-only.  Records are of
the following format:

	struct superblock_notification {
		struct watch_notification watch;
		__u64	sb_id;
	} *n;

Where:

	n->watch.type will be WATCH_TYPE_SB_NOTIFY.

	n->watch.subtype will indicate the type of event, such as
	NOTIFY_SUPERBLOCK_READONLY.

	n->watch.info & WATCH_INFO_LENGTH will indicate the length of the
	record.

	n->watch.info & WATCH_INFO_ID will be the fifth argument to
	sb_notify(), shifted.

	n->watch.info & WATCH_INFO_FLAG_0 will be used for
	NOTIFY_SUPERBLOCK_READONLY, being set if the superblock becomes
	R/O, and being cleared otherwise.

	n->sb_id will be the ID of the superblock, as can be retrieved with
	the fsinfo() syscall, as part of the fsinfo_sb_notifications
	attribute in the the watch_id field.

Note that it is permissible for event records to be of variable length -
or, at least, the length may be dependent on the subtype.  Note also that
the queue can be shared between multiple notifications of various types.

[*] QUESTION: Does this want to be per-sb, per-mount_namespace,
    per-some-new-notify-ns or per-system?  Or do multiple options make
    sense?

[*] QUESTION: I've done it this way so that anyone could theoretically
    monitor the superblock of any filesystem they can pathwalk to, but do
    we need other security controls?

[*] QUESTION: Should the LSM be able to filter the events a queue can
    receive?  For instance the opener of the queue would grant that queue
    subject creds (by ->f_cred) that could be used to govern what events
    could be seen, assuming the target superblock to have some object
    creds, based on, say, the mounter.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/Kconfig                             |   12 +++
 fs/super.c                             |  115 ++++++++++++++++++++++++++++++++
 include/linux/fs.h                     |   77 +++++++++++++++++++++
 include/linux/syscalls.h               |    2 +
 include/uapi/linux/watch_queue.h       |   26 +++++++
 kernel/sys_ni.c                        |    3 +
 8 files changed, 237 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index a8416a9a0ccb..429416ce60e1 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -440,3 +440,4 @@
 433	i386	fspick			sys_fspick			__ia32_sys_fspick
 434	i386	fsinfo			sys_fsinfo			__ia32_sys_fsinfo
 435	i386	mount_notify		sys_mount_notify		__ia32_sys_mount_notify
+436	i386	sb_notify		sys_sb_notify			__ia32_sys_sb_notify
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index ea052a94eb97..4ae146e472db 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -357,6 +357,7 @@
 433	common	fspick			__x64_sys_fspick
 434	common	fsinfo			__x64_sys_fsinfo
 435	common	mount_notify		__x64_sys_mount_notify
+436	common	sb_notify		__x64_sys_sb_notify
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Kconfig b/fs/Kconfig
index a26bbe27a791..fc0fa4b35f3c 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -130,6 +130,18 @@ config MOUNT_NOTIFICATIONS
 	  device to handle the notification buffer and provides the
 	  mount_notify() system call to enable/disable watchpoints.
 
+config SB_NOTIFICATIONS
+	bool "Superblock event notifications"
+	select WATCH_QUEUE
+	help
+	  This option provides support for receiving superblock event
+	  notifications.  This makes use of the /dev/watch_queue misc device to
+	  handle the notification buffer and provides the sb_notify() system
+	  call to enable/disable watches.
+
+	  Events can include things like changing between R/W and R/O, EIO
+	  generation, ENOSPC generation and EDQUOT generation.
+
 source "fs/quota/Kconfig"
 
 source "fs/autofs/Kconfig"
diff --git a/fs/super.c b/fs/super.c
index 61819e8e5469..991d69d9dbed 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -36,6 +36,8 @@
 #include <linux/lockdep.h>
 #include <linux/user_namespace.h>
 #include <linux/fs_context.h>
+#include <linux/syscalls.h>
+#include <linux/namei.h>
 #include <uapi/linux/mount.h>
 #include "internal.h"
 
@@ -350,6 +352,10 @@ void deactivate_locked_super(struct super_block *s)
 {
 	struct file_system_type *fs = s->s_type;
 	if (atomic_dec_and_test(&s->s_active)) {
+#ifdef CONFIG_SB_NOTIFICATIONS
+		if (s->s_watchers)
+			remove_watch_list(s->s_watchers);
+#endif
 		cleancache_invalidate_fs(s);
 		unregister_shrinker(&s->s_shrink);
 		fs->kill_sb(s);
@@ -990,6 +996,8 @@ int reconfigure_super(struct fs_context *fc)
 	/* Needs to be ordered wrt mnt_is_readonly() */
 	smp_wmb();
 	sb->s_readonly_remount = 0;
+	notify_sb(sb, NOTIFY_SUPERBLOCK_READONLY,
+		  remount_ro ? WATCH_INFO_FLAG_0 : 0);
 
 	/*
 	 * Some filesystems modify their metadata via some other path than the
@@ -1808,3 +1816,110 @@ int thaw_super(struct super_block *sb)
 	return thaw_super_locked(sb);
 }
 EXPORT_SYMBOL(thaw_super);
+
+#ifdef CONFIG_SB_NOTIFICATIONS
+/*
+ * Post superblock notifications.
+ */
+void post_sb_notification(struct super_block *s, struct superblock_notification *n)
+{
+	post_watch_notification(s->s_watchers, &n->watch, current_cred(),
+				s->s_unique_id);
+}
+
+static void release_sb_watch(struct watch_list *wlist, struct watch *watch)
+{
+	struct super_block *s = watch->private;
+
+	put_super(s);
+}
+
+/**
+ * sys_sb_notify - Watch for superblock events.
+ * @dfd: Base directory to pathwalk from or fd referring to superblock.
+ * @filename: Path to superblock to place the watch upon
+ * @at_flags: Pathwalk control flags
+ * @watch_fd: The watch queue to send notifications to.
+ * @watch_id: The watch ID to be placed in the notification (-1 to remove watch)
+ */
+SYSCALL_DEFINE5(sb_notify,
+		int, dfd,
+		const char __user *, filename,
+		unsigned int, at_flags,
+		int, watch_fd,
+		int, watch_id)
+{
+	struct watch_queue *wqueue;
+	struct super_block *s;
+	struct watch_list *wlist = NULL;
+	struct watch *watch;
+	struct path path;
+	int ret;
+
+	if (watch_id < -1 || watch_id > 0xff)
+		return -EINVAL;
+
+	ret = user_path_at(dfd, filename, at_flags, &path);
+	if (ret)
+		return ret;
+
+	wqueue = get_watch_queue(watch_fd);
+	if (IS_ERR(wqueue))
+		goto err_path;
+
+	s = path.dentry->d_sb;
+	if (watch_id >= 0) {
+		if (!s->s_watchers) {
+			wlist = kzalloc(sizeof(*wlist), GFP_KERNEL);
+			if (!wlist)
+				goto err_wqueue;
+			INIT_HLIST_HEAD(&wlist->watchers);
+			spin_lock_init(&wlist->lock);
+			wlist->release_watch = release_sb_watch;
+		}
+
+		watch = kzalloc(sizeof(*watch), GFP_KERNEL);
+		if (!watch)
+			goto err_wlist;
+
+		init_watch(watch, wqueue);
+		watch->id		= s->s_unique_id;
+		watch->private		= s;
+		watch->info_id		= (u32)watch_id << 24;
+
+		down_write(&s->s_umount);
+		ret = -EIO;
+		if (atomic_read(&s->s_active)) {
+			if (!s->s_watchers) {
+				s->s_watchers = wlist;
+				wlist = NULL;
+			}
+
+			ret = add_watch_to_object(watch, s->s_watchers);
+			if (ret == 0) {
+				spin_lock(&sb_lock);
+				s->s_count++;
+				spin_unlock(&sb_lock);
+			}
+		}
+		up_write(&s->s_umount);
+		if (ret < 0)
+			kfree(watch);
+	} else if (s->s_watchers) {
+		down_write(&s->s_umount);
+		ret = remove_watch_from_object(s->s_watchers, wqueue,
+					       s->s_unique_id, false);
+		up_write(&s->s_umount);
+	} else {
+		ret = -EBADSLT;
+	}
+
+err_wlist:
+	kfree(wlist);
+err_wqueue:
+	put_watch_queue(wqueue);
+err_path:
+	path_put(&path);
+	return ret;
+}
+#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f1c74596cd77..79ede28f54cc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -40,6 +40,7 @@
 #include <linux/fs_types.h>
 #include <linux/build_bug.h>
 #include <linux/stddef.h>
+#include <linux/watch_queue.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -1530,6 +1531,10 @@ struct super_block {
 
 	/* Superblock event notifications */
 	u64			s_unique_id;
+
+#ifdef CONFIG_SB_NOTIFICATIONS
+	struct watch_list	*s_watchers;
+#endif
 } __randomize_layout;
 
 /* Helper functions so that in most cases filesystems will
@@ -3530,4 +3535,76 @@ static inline struct sock *io_uring_get_socket(struct file *file)
 }
 #endif
 
+extern void post_sb_notification(struct super_block *, struct superblock_notification *);
+
+/**
+ * notify_sb: Post simple superblock notification.
+ * @s: The superblock the notification is about.
+ * @subtype: The type of notification.
+ * @info: WATCH_INFO_FLAG_* flags to be set in the record.
+ */
+static inline void notify_sb(struct super_block *s,
+			     enum superblock_notification_type subtype,
+			     u32 info)
+{
+#ifdef CONFIG_SB_NOTIFICATIONS
+	if (unlikely(s->s_watchers)) {
+		struct superblock_notification n = {
+			.watch.type	= WATCH_TYPE_SB_NOTIFY,
+			.watch.subtype	= subtype,
+			.watch.info	= sizeof(n) | info,
+			.sb_id		= s->s_unique_id,
+		};
+
+		post_sb_notification(s, &n);
+	}
+			     
+#endif
+}
+
+/**
+ * sb_error: Post superblock error notification.
+ * @s: The superblock the notification is about.
+ * @error: The error number to be recorded.
+ */
+static inline int sb_error(struct super_block *s, int error)
+{
+#ifdef CONFIG_SB_NOTIFICATIONS
+	if (unlikely(s->s_watchers)) {
+		struct superblock_error_notification n = {
+			.s.watch.type	= WATCH_TYPE_SB_NOTIFY,
+			.s.watch.subtype = NOTIFY_SUPERBLOCK_ERROR,
+			.s.watch.info	= sizeof(n),
+			.s.sb_id	= s->s_unique_id,
+			.error_number	= error,
+			.error_cookie	= 0,
+		};
+
+		post_sb_notification(s, &n.s);
+	}
+#endif
+	return error;
+}
+
+/**
+ * sb_EDQUOT: Post superblock quota overrun notification.
+ * @s: The superblock the notification is about.
+ */
+static inline int sb_EQDUOT(struct super_block *s)
+{
+#ifdef CONFIG_SB_NOTIFICATIONS
+	if (unlikely(s->s_watchers)) {
+		struct superblock_notification n = {
+			.watch.type	= WATCH_TYPE_SB_NOTIFY,
+			.watch.subtype	= NOTIFY_SUPERBLOCK_EDQUOT,
+			.watch.info	= sizeof(n),
+			.sb_id		= s->s_unique_id,
+		};
+
+		post_sb_notification(s, &n);
+	}
+#endif
+	return -EDQUOT;
+}
+
 #endif /* _LINUX_FS_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 7c2b66175f3c..204a6dbcc34a 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1003,6 +1003,8 @@ asmlinkage long sys_fsinfo(int dfd, const char __user *path,
 			   void __user *buffer, size_t buf_size);
 asmlinkage long sys_mount_notify(int dfd, const char __user *path,
 				 unsigned int at_flags, int watch_fd, int watch_id);
+asmlinkage long sys_sb_notify(int dfd, const char __user *path,
+			      unsigned int at_flags, int watch_fd, int watch_id);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/watch_queue.h b/include/uapi/linux/watch_queue.h
index 388b4141bcee..126afcc98cc6 100644
--- a/include/uapi/linux/watch_queue.h
+++ b/include/uapi/linux/watch_queue.h
@@ -128,4 +128,30 @@ struct mount_notification {
 	__u32	changed_mount;		/* The mount that got changed */
 };
 
+/*
+ * Type of superblock notification.
+ */
+enum superblock_notification_type {
+	NOTIFY_SUPERBLOCK_READONLY	= 0, /* Filesystem toggled between R/O and R/W */
+	NOTIFY_SUPERBLOCK_ERROR		= 1, /* Error in filesystem or blockdev */
+	NOTIFY_SUPERBLOCK_EDQUOT	= 2, /* EDQUOT notification */
+	NOTIFY_SUPERBLOCK_NETWORK	= 3, /* Network status change */
+};
+
+/*
+ * Superblock notification record.
+ * - watch.type = WATCH_TYPE_MOUNT_NOTIFY
+ * - watch.subtype = enum superblock_notification_subtype
+ */
+struct superblock_notification {
+	struct watch_notification watch; /* WATCH_TYPE_SB_NOTIFY */
+	__u64	sb_id;			/* 64-bit superblock ID [fsinfo_ids::f_sb_id] */
+};
+
+struct superblock_error_notification {
+	struct superblock_notification s; /* subtype = notify_superblock_error */
+	__u32	error_number;
+	__u32	error_cookie;
+};
+
 #endif /* _UAPI_LINUX_WATCH_QUEUE_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 97b025e7863c..565d1e3d1bed 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -108,6 +108,9 @@ COND_SYSCALL(quotactl);
 
 /* fs/read_write.c */
 
+/* fs/sb_notify.c */
+COND_SYSCALL(sb_notify);
+
 /* fs/sendfile.c */
 
 /* fs/select.c */


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 5/7] fsinfo: Export superblock notification counter
  2019-05-28 16:01 [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications David Howells
                   ` (3 preceding siblings ...)
  2019-05-28 16:02 ` [PATCH 4/7] vfs: Add superblock notifications David Howells
@ 2019-05-28 16:02 ` David Howells
  2019-05-28 16:02 ` [PATCH 6/7] block: Add block layer notifications David Howells
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2019-05-28 16:02 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel

Provide an fsinfo attribute to export the superblock notification counter
so that it can be polled in the case of a notification buffer overrun.
This is accessed with:

	struct fsinfo_params params = {
		.request = FSINFO_ATTR_SB_NOTIFICATIONS,
	};

and returns a structure that looks like:

	struct fsinfo_sb_notifications {
		__u64	watch_id;
		__u32	notify_counter;
		__u32	__reserved[1];
	};

Where watch_id is a number uniquely identifying the superblock in
notification records and notify_counter is incremented for each
superblock notification posted.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/fsinfo.c                      |   12 ++++++++++++
 fs/super.c                       |    1 +
 include/linux/fs.h               |    1 +
 include/uapi/linux/fsinfo.h      |   10 ++++++++++
 include/uapi/linux/watch_queue.h |    2 +-
 samples/vfs/test-fsinfo.c        |   13 +++++++++++++
 6 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/fs/fsinfo.c b/fs/fsinfo.c
index 3ec64d3cba08..1456e26d2f7c 100644
--- a/fs/fsinfo.c
+++ b/fs/fsinfo.c
@@ -284,6 +284,16 @@ static int fsinfo_generic_param_enum(struct file_system_type *f,
 	return sizeof(*p);
 }
 
+static int fsinfo_generic_sb_notifications(struct path *path,
+					   struct fsinfo_sb_notifications *p)
+{
+	struct super_block *sb = path->dentry->d_sb;
+
+	p->watch_id		= sb->s_unique_id;
+	p->notify_counter	= atomic_read(&sb->s_notify_counter);
+	return sizeof(*p);
+}
+
 static void fsinfo_insert_sb_flag_parameters(struct path *path,
 					     struct fsinfo_kparams *params)
 {
@@ -331,6 +341,7 @@ int generic_fsinfo(struct path *path, struct fsinfo_kparams *params)
 	case _genp(MOUNT_DEVNAME,	mount_devname);
 	case _genp(MOUNT_CHILDREN,	mount_children);
 	case _genp(MOUNT_SUBMOUNT,	mount_submount);
+	case _gen(SB_NOTIFICATIONS,	sb_notifications);
 	default:
 		return -EOPNOTSUPP;
 	}
@@ -606,6 +617,7 @@ static const struct fsinfo_attr_info fsinfo_buffer_info[FSINFO_ATTR__NR] = {
 	FSINFO_STRING_N		(SERVER_NAME,		server_name),
 	FSINFO_STRUCT_NM	(SERVER_ADDRESS,	server_address),
 	FSINFO_STRING		(CELL_NAME,		cell_name),
+	FSINFO_STRUCT		(SB_NOTIFICATIONS,	sb_notifications),
 };
 
 /**
diff --git a/fs/super.c b/fs/super.c
index 991d69d9dbed..c4bd0d131ef2 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1823,6 +1823,7 @@ EXPORT_SYMBOL(thaw_super);
  */
 void post_sb_notification(struct super_block *s, struct superblock_notification *n)
 {
+	atomic_inc(&s->s_notify_counter);
 	post_watch_notification(s->s_watchers, &n->watch, current_cred(),
 				s->s_unique_id);
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 79ede28f54cc..2c00e292b92b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1535,6 +1535,7 @@ struct super_block {
 #ifdef CONFIG_SB_NOTIFICATIONS
 	struct watch_list	*s_watchers;
 #endif
+	atomic_t		s_notify_counter;
 } __randomize_layout;
 
 /* Helper functions so that in most cases filesystems will
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index 7247088332c2..b4c9446305bb 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -39,6 +39,7 @@ enum fsinfo_attribute {
 	FSINFO_ATTR_SERVER_NAME		= 21,	/* Name of the Nth server (string) */
 	FSINFO_ATTR_SERVER_ADDRESS	= 22,	/* Mth address of the Nth server */
 	FSINFO_ATTR_CELL_NAME		= 23,	/* Cell name (string) */
+	FSINFO_ATTR_SB_NOTIFICATIONS	= 24,	/* sb_notify() information */
 	FSINFO_ATTR__NR
 };
 
@@ -308,4 +309,13 @@ struct fsinfo_server_address {
 	struct __kernel_sockaddr_storage address;
 };
 
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_SB_NOTIFICATIONS).
+ */
+struct fsinfo_sb_notifications {
+	__u64		watch_id;	/* Watch ID for superblock. */
+	__u32		notify_counter;	/* Number of notifications. */
+	__u32		__reserved[1];
+};
+
 #endif /* _UAPI_LINUX_FSINFO_H */
diff --git a/include/uapi/linux/watch_queue.h b/include/uapi/linux/watch_queue.h
index 126afcc98cc6..3b5770889bba 100644
--- a/include/uapi/linux/watch_queue.h
+++ b/include/uapi/linux/watch_queue.h
@@ -145,7 +145,7 @@ enum superblock_notification_type {
  */
 struct superblock_notification {
 	struct watch_notification watch; /* WATCH_TYPE_SB_NOTIFY */
-	__u64	sb_id;			/* 64-bit superblock ID [fsinfo_ids::f_sb_id] */
+	__u64	sb_id;		/* 64-bit superblock ID [fsinfo_sb_notifications::watch_id] */
 };
 
 struct superblock_error_notification {
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
index af29da74559e..0f8f9ded0925 100644
--- a/samples/vfs/test-fsinfo.c
+++ b/samples/vfs/test-fsinfo.c
@@ -90,6 +90,7 @@ static const struct fsinfo_attr_info fsinfo_buffer_info[FSINFO_ATTR__NR] = {
 	FSINFO_STRING_N		(SERVER_NAME,		server_name),
 	FSINFO_STRUCT_NM	(SERVER_ADDRESS,	server_address),
 	FSINFO_STRING		(CELL_NAME,		cell_name),
+	FSINFO_STRUCT		(SB_NOTIFICATIONS,	sb_notifications),
 };
 
 #define FSINFO_NAME(X,Y) [FSINFO_ATTR_##X] = #Y
@@ -118,6 +119,7 @@ static const char *fsinfo_attr_names[FSINFO_ATTR__NR] = {
 	FSINFO_NAME		(SERVER_NAME,		server_name),
 	FSINFO_NAME		(SERVER_ADDRESS,	server_address),
 	FSINFO_NAME		(CELL_NAME,		cell_name),
+	FSINFO_NAME		(SB_NOTIFICATIONS,	sb_notifications),
 };
 
 union reply {
@@ -133,6 +135,7 @@ union reply {
 	struct fsinfo_mount_info mount_info;
 	struct fsinfo_mount_child mount_children[1];
 	struct fsinfo_server_address srv_addr;
+	struct fsinfo_sb_notifications sb_notifications;
 };
 
 static void dump_hex(unsigned int *data, int from, int to)
@@ -377,6 +380,15 @@ static void dump_attr_MOUNT_CHILDREN(union reply *r, int size)
 		printf("\t[%u] %8x %8x\n", i++, f->mnt_id, f->notify_counter);
 }
 
+static void dump_attr_SB_NOTIFICATIONS(union reply *r, int size)
+{
+	struct fsinfo_sb_notifications *f = &r->sb_notifications;
+
+	printf("\n");
+	printf("\twatch_id: %llx\n", (unsigned long long)f->watch_id);
+	printf("\tnotifs  : %llx\n", (unsigned long long)f->notify_counter);
+}
+
 /*
  *
  */
@@ -395,6 +407,7 @@ static const dumper_t fsinfo_attr_dumper[FSINFO_ATTR__NR] = {
 	FSINFO_DUMPER(MOUNT_INFO),
 	FSINFO_DUMPER(MOUNT_CHILDREN),
 	FSINFO_DUMPER(SERVER_ADDRESS),
+	FSINFO_DUMPER(SB_NOTIFICATIONS),
 };
 
 static void dump_fsinfo(enum fsinfo_attribute attr,


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 6/7] block: Add block layer notifications
  2019-05-28 16:01 [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications David Howells
                   ` (4 preceding siblings ...)
  2019-05-28 16:02 ` [PATCH 5/7] fsinfo: Export superblock notification counter David Howells
@ 2019-05-28 16:02 ` David Howells
  2019-05-28 20:37   ` Jann Horn
  2019-05-28 16:02 ` [PATCH 7/7] Add sample notification program David Howells
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 65+ messages in thread
From: David Howells @ 2019-05-28 16:02 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel

Add a block layer notification mechanism whereby notifications about
block-layer events such as I/O errors, can be reported to a monitoring
process asynchronously.

Firstly, an event queue needs to be created:

	fd = open("/dev/event_queue", O_RDWR);
	ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, page_size << n);

then a notification can be set up to report block notifications via that
queue:

	struct watch_notification_filter filter = {
		.nr_filters = 1,
		.filters = {
			[0] = {
				.type = WATCH_TYPE_BLOCK_NOTIFY,
				.subtype_filter[0] = UINT_MAX;
			},
		},
	};
	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
	block_notify(fd, 12);

After that, records will be placed into the queue when, for example, errors
occur on a block device.  Records are of the following format:

	struct block_notification {
		struct watch_notification watch;
		__u64	dev;
		__u64	sector;
	} *n;

Where:

	n->watch.type will be WATCH_TYPE_BLOCK_NOTIFY

	n->watch.subtype will be the type of notification, such as
	NOTIFY_BLOCK_ERROR_CRITICAL_MEDIUM.

	n->watch.info & WATCH_INFO_LENGTH will indicate the length of the
	record.

	n->watch.info & WATCH_INFO_ID will be the second argument to
	block_notify(), shifted.

	n->dev will be the device numbers munged together.

	n->sector will indicate the affected sector (if appropriate for the
	event).

Note that it is permissible for event records to be of variable length -
or, at least, the length may be dependent on the subtype.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 block/Kconfig                          |    9 +++
 block/Makefile                         |    1 
 block/blk-core.c                       |   28 +++++++++++
 block/blk-notify.c                     |   83 ++++++++++++++++++++++++++++++++
 include/linux/blkdev.h                 |   10 ++++
 include/linux/syscalls.h               |    1 
 include/uapi/linux/watch_queue.h       |   28 +++++++++++
 9 files changed, 162 insertions(+)
 create mode 100644 block/blk-notify.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 429416ce60e1..22793f77c5f1 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -441,3 +441,4 @@
 434	i386	fsinfo			sys_fsinfo			__ia32_sys_fsinfo
 435	i386	mount_notify		sys_mount_notify		__ia32_sys_mount_notify
 436	i386	sb_notify		sys_sb_notify			__ia32_sys_sb_notify
+437	i386	block_notify		sys_block_notify		__ia32_sys_block_notify
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 4ae146e472db..3f0b82272a9f 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -358,6 +358,7 @@
 434	common	fsinfo			__x64_sys_fsinfo
 435	common	mount_notify		__x64_sys_mount_notify
 436	common	sb_notify		__x64_sys_sb_notify
+437	common	block_notify		__x64_sys_block_notify
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/block/Kconfig b/block/Kconfig
index 1b220101a9cb..3b0a0ddb83ef 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -163,6 +163,15 @@ config BLK_SED_OPAL
 	Enabling this option enables users to setup/unlock/lock
 	Locking ranges for SED devices using the Opal protocol.
 
+config BLK_NOTIFICATIONS
+	bool "Block layer event notifications"
+	select WATCH_QUEUE
+	help
+	  This option provides support for getting block layer event
+	  notifications.  This makes use of the /dev/watch_queue misc device to
+	  handle the notification buffer and provides the block_notify() system
+	  call to enable/disable watches.
+
 menu "Partition Types"
 
 source "block/partitions/Kconfig"
diff --git a/block/Makefile b/block/Makefile
index eee1b4ceecf9..2dca6273f8f3 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -35,3 +35,4 @@ obj-$(CONFIG_BLK_DEBUG_FS)	+= blk-mq-debugfs.o
 obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o
 obj-$(CONFIG_BLK_SED_OPAL)	+= sed-opal.o
 obj-$(CONFIG_BLK_PM)		+= blk-pm.o
+obj-$(CONFIG_BLK_NOTIFICATIONS)	+= blk-notify.o
diff --git a/block/blk-core.c b/block/blk-core.c
index 419d600e6637..8325e33f0bcc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -144,6 +144,21 @@ static const struct {
 	[BLK_STS_IOERR]		= { -EIO,	"I/O" },
 };
 
+#ifdef CONFIG_BLK_NOTIFICATIONS
+static const enum block_notification_type blk_notifications[] = {
+	[BLK_STS_TIMEOUT]	= NOTIFY_BLOCK_ERROR_TIMEOUT,
+	[BLK_STS_NOSPC]		= NOTIFY_BLOCK_ERROR_NO_SPACE,
+	[BLK_STS_TRANSPORT]	= NOTIFY_BLOCK_ERROR_RECOVERABLE_TRANSPORT,
+	[BLK_STS_TARGET]	= NOTIFY_BLOCK_ERROR_CRITICAL_TARGET,
+	[BLK_STS_NEXUS]		= NOTIFY_BLOCK_ERROR_CRITICAL_NEXUS,
+	[BLK_STS_MEDIUM]	= NOTIFY_BLOCK_ERROR_CRITICAL_MEDIUM,
+	[BLK_STS_PROTECTION]	= NOTIFY_BLOCK_ERROR_PROTECTION,
+	[BLK_STS_RESOURCE]	= NOTIFY_BLOCK_ERROR_KERNEL_RESOURCE,
+	[BLK_STS_DEV_RESOURCE]	= NOTIFY_BLOCK_ERROR_DEVICE_RESOURCE,
+	[BLK_STS_IOERR]		= NOTIFY_BLOCK_ERROR_IO,
+};
+#endif
+
 blk_status_t errno_to_blk_status(int errno)
 {
 	int i;
@@ -179,6 +194,19 @@ static void print_req_error(struct request *req, blk_status_t status)
 				req->rq_disk ?  req->rq_disk->disk_name : "?",
 				(unsigned long long)blk_rq_pos(req),
 				req->cmd_flags);
+
+#ifdef CONFIG_BLK_NOTIFICATIONS
+	if (blk_notifications[idx]) {
+		struct block_notification n = {
+			.watch.type	= WATCH_TYPE_BLOCK_NOTIFY,
+			.watch.subtype	= blk_notifications[idx],
+			.watch.info	= sizeof(n),
+			.dev		= req->rq_disk ? disk_devt(req->rq_disk) : 0,
+			.sector		= blk_rq_pos(req),
+		};
+		post_block_notification(&n);
+	}
+#endif
 }
 
 static void req_bio_endio(struct request *rq, struct bio *bio,
diff --git a/block/blk-notify.c b/block/blk-notify.c
new file mode 100644
index 000000000000..b310aaf37e7c
--- /dev/null
+++ b/block/blk-notify.c
@@ -0,0 +1,83 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Block layer event notifications.
+ *
+ * Copyright (C) 2019 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <linux/blkdev.h>
+#include <linux/watch_queue.h>
+#include <linux/syscalls.h>
+#include <linux/init_task.h>
+
+/*
+ * Global queue for watching for block layer events.
+ */
+static struct watch_list blk_watchers = {
+	.watchers	= HLIST_HEAD_INIT,
+	.lock		= __SPIN_LOCK_UNLOCKED(&blk_watchers.lock),
+};
+
+static DEFINE_SPINLOCK(blk_watchers_lock);
+
+/*
+ * Post superblock notifications.
+ *
+ * Note that there's only a global queue to which all events are posted.  Might
+ * want to provide per-dev queues also.
+ */
+void post_block_notification(struct block_notification *n)
+{
+	u64 id = 0; /* Might want to allow dev# here. */
+
+	post_watch_notification(&blk_watchers, &n->watch, &init_cred, id);
+}
+
+/**
+ * sys_block_notify - Watch for superblock events.
+ * @watch_fd: The watch queue to send notifications to.
+ * @watch_id: The watch ID to be placed in the notification (-1 to remove watch)
+ */
+SYSCALL_DEFINE2(block_notify, int, watch_fd, int, watch_id)
+{
+	struct watch_queue *wqueue;
+	struct watch_list *wlist = &blk_watchers;
+	struct watch *watch;
+	long ret = -ENOMEM;
+	u64 id = 0; /* Might want to allow dev# here. */
+
+	if (watch_id < -1 || watch_id > 0xff)
+		return -EINVAL;
+
+	wqueue = get_watch_queue(watch_fd);
+	if (IS_ERR(wqueue)) {
+		ret = PTR_ERR(wqueue);
+		goto err;
+	}
+
+	if (watch_id >= 0) {
+		watch = kzalloc(sizeof(*watch), GFP_KERNEL);
+		if (!watch)
+			goto err_wqueue;
+
+		init_watch(watch, wqueue);
+		watch->id	= id;
+		watch->info_id	= (u32)watch_id << WATCH_INFO_ID__SHIFT;
+
+		spin_lock(&blk_watchers_lock);
+		ret = add_watch_to_object(watch, wlist);
+		spin_unlock(&blk_watchers_lock);
+		if (ret < 0)
+			kfree(watch);
+	} else {
+		spin_lock(&blk_watchers_lock);
+		ret = remove_watch_from_object(wlist, wqueue, id, false);
+		spin_unlock(&blk_watchers_lock);
+	}
+
+err_wqueue:
+	put_watch_queue(wqueue);
+err:
+	return ret;
+}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1aafeb923e7b..c28f8647a76d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -43,6 +43,7 @@ struct pr_ops;
 struct rq_qos;
 struct blk_queue_stats;
 struct blk_stat_callback;
+struct block_notification;
 
 #define BLKDEV_MIN_RQ	4
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
@@ -1744,6 +1745,15 @@ static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
 }
 #endif /* CONFIG_BLK_DEV_ZONED */
 
+#ifdef CONFIG_BLK_NOTIFICATIONS
+extern void post_block_notification(struct block_notification *n);
+#else
+static inline void post_block_notification(struct block_notification *n)
+{
+}
+#endif
+
+
 #else /* CONFIG_BLOCK */
 
 struct block_device;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 204a6dbcc34a..77a9d84f1fbd 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1005,6 +1005,7 @@ asmlinkage long sys_mount_notify(int dfd, const char __user *path,
 				 unsigned int at_flags, int watch_fd, int watch_id);
 asmlinkage long sys_sb_notify(int dfd, const char __user *path,
 			      unsigned int at_flags, int watch_fd, int watch_id);
+asmlinkage long sys_block_notify(int watch_fd, int watch_id);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/watch_queue.h b/include/uapi/linux/watch_queue.h
index 3b5770889bba..fad276ffa2d0 100644
--- a/include/uapi/linux/watch_queue.h
+++ b/include/uapi/linux/watch_queue.h
@@ -44,6 +44,7 @@ struct watch_notification {
 #define WATCH_INFO_FLAG_6	0x00400000
 #define WATCH_INFO_FLAG_7	0x00800000
 #define WATCH_INFO_ID		0xff000000	/* ID of watchpoint */
+#define WATCH_INFO_ID__SHIFT	24
 };
 
 #define WATCH_LENGTH_SHIFT	3
@@ -154,4 +155,31 @@ struct superblock_error_notification {
 	__u32	error_cookie;
 };
 
+/*
+ * Type of block layer notification.
+ */
+enum block_notification_type {
+	NOTIFY_BLOCK_ERROR_TIMEOUT		= 1, /* Timeout error */
+	NOTIFY_BLOCK_ERROR_NO_SPACE		= 2, /* Critical space allocation error */
+	NOTIFY_BLOCK_ERROR_RECOVERABLE_TRANSPORT = 3, /* Recoverable transport error */
+	NOTIFY_BLOCK_ERROR_CRITICAL_TARGET	= 4, /* Critical target error */
+	NOTIFY_BLOCK_ERROR_CRITICAL_NEXUS	= 5, /* Critical nexus error */
+	NOTIFY_BLOCK_ERROR_CRITICAL_MEDIUM	= 6, /* Critical medium error */
+	NOTIFY_BLOCK_ERROR_PROTECTION		= 7, /* Protection error */
+	NOTIFY_BLOCK_ERROR_KERNEL_RESOURCE	= 8, /* Kernel resource error */
+	NOTIFY_BLOCK_ERROR_DEVICE_RESOURCE	= 9, /* Device resource error */
+	NOTIFY_BLOCK_ERROR_IO			= 10, /* Other I/O error */
+};
+
+/*
+ * Block notification record.
+ * - watch.type = WATCH_TYPE_BLOCK_NOTIFY
+ * - watch.subtype = enum block_notification_type
+ */
+struct block_notification {
+	struct watch_notification watch; /* WATCH_TYPE_SB_NOTIFY */
+	__u64	dev;			/* Device number */
+	__u64	sector;			/* Affected sector */
+};
+
 #endif /* _UAPI_LINUX_WATCH_QUEUE_H */


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 7/7] Add sample notification program
  2019-05-28 16:01 [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications David Howells
                   ` (5 preceding siblings ...)
  2019-05-28 16:02 ` [PATCH 6/7] block: Add block layer notifications David Howells
@ 2019-05-28 16:02 ` David Howells
  2019-05-28 23:58 ` [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications Greg KH
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2019-05-28 16:02 UTC (permalink / raw)
  To: viro
  Cc: dhowells, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel

This needs to be linked with -lkeyutils.

It is run like:

	./watch_test

and watches "/" for mount changes and the current session keyring for key
changes:

	# keyctl add user a a @s
	1035096409
	# keyctl unlink 1035096409 @s
	# mount -t tmpfs none /mnt/nfsv3tcp/
	# umount /mnt/nfsv3tcp

producing:

	# ./watch_test
	ptrs h=4 t=2 m=20003
	NOTIFY[00000004-00000002] ty=0003 sy=0002 i=01000010
	KEY 2ffc2e5d change=2[linked] aux=1035096409
	ptrs h=6 t=4 m=20003
	NOTIFY[00000006-00000004] ty=0003 sy=0003 i=01000010
	KEY 2ffc2e5d change=3[unlinked] aux=1035096409
	ptrs h=8 t=6 m=20003
	NOTIFY[00000008-00000006] ty=0001 sy=0000 i=02000010
	MOUNT 00000013 change=0[new_mount] aux=168
	ptrs h=a t=8 m=20003
	NOTIFY[0000000a-00000008] ty=0001 sy=0001 i=02000010
	MOUNT 00000013 change=1[unmount] aux=168

Other events may be produced, such as with a failing disk:

	ptrs h=5 t=2 m=6000004
	NOTIFY[00000005-00000002] ty=0004 sy=0006 i=04000018
	BLOCK 00800050 e=6[critical medium] s=5be8

This corresponds to:

	print_req_error: critical medium error, dev sdf, sector 23528 flags 0

in dmesg.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 samples/Kconfig                  |    6 +
 samples/Makefile                 |    1 
 samples/watch_queue/Makefile     |    9 +
 samples/watch_queue/watch_test.c |  284 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 300 insertions(+)
 create mode 100644 samples/watch_queue/Makefile
 create mode 100644 samples/watch_queue/watch_test.c

diff --git a/samples/Kconfig b/samples/Kconfig
index 0561a94f6fdb..a2b7a7babee5 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -160,4 +160,10 @@ config SAMPLE_VFS
 	  as mount API and statx().  Note that this is restricted to the x86
 	  arch whilst it accesses system calls that aren't yet in all arches.
 
+config SAMPLE_WATCH_QUEUE
+	bool "Build example /dev/watch_queue notification consumer"
+	help
+	  Build example userspace program to use the new mount_notify(),
+	  sb_notify() syscalls and the KEYCTL_WATCH_KEY keyctl() function.
+
 endif # SAMPLES
diff --git a/samples/Makefile b/samples/Makefile
index debf8925f06f..ed3b8bab6e9b 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -20,3 +20,4 @@ obj-$(CONFIG_SAMPLE_TRACE_PRINTK)	+= trace_printk/
 obj-$(CONFIG_VIDEO_PCI_SKELETON)	+= v4l/
 obj-y					+= vfio-mdev/
 subdir-$(CONFIG_SAMPLE_VFS)		+= vfs
+subdir-$(CONFIG_SAMPLE_WATCH_QUEUE)	+= watch_queue
diff --git a/samples/watch_queue/Makefile b/samples/watch_queue/Makefile
new file mode 100644
index 000000000000..42b694430d0f
--- /dev/null
+++ b/samples/watch_queue/Makefile
@@ -0,0 +1,9 @@
+# List of programs to build
+hostprogs-y := watch_test
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_watch_test.o += -I$(objtree)/usr/include
+
+HOSTLOADLIBES_watch_test += -lkeyutils
diff --git a/samples/watch_queue/watch_test.c b/samples/watch_queue/watch_test.c
new file mode 100644
index 000000000000..0bbab492e237
--- /dev/null
+++ b/samples/watch_queue/watch_test.c
@@ -0,0 +1,284 @@
+/* Use /dev/watch_queue to watch for keyring and mount topology changes.
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <stdbool.h>
+#include <stdarg.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <signal.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <dirent.h>
+#include <errno.h>
+#include <sys/wait.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <poll.h>
+#include <limits.h>
+#include <linux/watch_queue.h>
+#include <linux/unistd.h>
+#include <linux/keyctl.h>
+
+#ifndef __NR_mount_notify
+#define __NR_mount_notify -1
+#endif
+#ifndef __NR_sb_notify
+#define __NR_sb_notify -1
+#endif
+#ifndef __NR_block_notify
+#define __NR_block_notify -1
+#endif
+#ifndef KEYCTL_WATCH_KEY
+#define KEYCTL_WATCH_KEY -1
+#endif
+
+#define BUF_SIZE 4
+
+static const char *key_subtypes[256] = {
+	[NOTIFY_KEY_INSTANTIATED]	= "instantiated",
+	[NOTIFY_KEY_UPDATED]		= "updated",
+	[NOTIFY_KEY_LINKED]		= "linked",
+	[NOTIFY_KEY_UNLINKED]		= "unlinked",
+	[NOTIFY_KEY_CLEARED]		= "cleared",
+	[NOTIFY_KEY_REVOKED]		= "revoked",
+	[NOTIFY_KEY_INVALIDATED]	= "invalidated",
+	[NOTIFY_KEY_SETATTR]		= "setattr",
+};
+
+static void saw_key_change(struct watch_notification *n)
+{
+	struct key_notification *k = (struct key_notification *)n;
+	unsigned int len = n->info & WATCH_INFO_LENGTH;
+
+	if (len != sizeof(struct key_notification))
+		return;
+
+	printf("KEY %08x change=%u[%s] aux=%u\n",
+	       k->key_id, n->subtype, key_subtypes[n->subtype], k->aux);
+}
+
+static const char *mount_subtypes[256] = {
+	[NOTIFY_MOUNT_NEW_MOUNT]	= "new_mount",
+	[NOTIFY_MOUNT_UNMOUNT]		= "unmount",
+	[NOTIFY_MOUNT_EXPIRY]		= "expiry",
+	[NOTIFY_MOUNT_READONLY]		= "readonly",
+	[NOTIFY_MOUNT_SETATTR]		= "setattr",
+	[NOTIFY_MOUNT_MOVE_FROM]	= "move_from",
+	[NOTIFY_MOUNT_MOVE_TO]		= "move_to",
+};
+
+static long keyctl_watch_key(int key, int watch_fd, int watch_id)
+{
+	return syscall(__NR_keyctl, KEYCTL_WATCH_KEY, key, watch_fd, watch_id);
+}
+
+static void saw_mount_change(struct watch_notification *n)
+{
+	struct mount_notification *m = (struct mount_notification *)n;
+	unsigned int len = n->info & WATCH_INFO_LENGTH;
+
+	if (len != sizeof(struct mount_notification))
+		return;
+
+	printf("MOUNT %08x change=%u[%s] aux=%u\n",
+	       m->triggered_on, n->subtype, mount_subtypes[n->subtype], m->changed_mount);
+}
+
+static const char *super_subtypes[256] = {
+	[NOTIFY_SUPERBLOCK_READONLY]	= "readonly",
+	[NOTIFY_SUPERBLOCK_ERROR]	= "error",
+	[NOTIFY_SUPERBLOCK_EDQUOT]	= "edquot",
+	[NOTIFY_SUPERBLOCK_NETWORK]	= "network",
+};
+
+static void saw_super_change(struct watch_notification *n)
+{
+	struct superblock_notification *s = (struct superblock_notification *)n;
+	unsigned int len = n->info & WATCH_INFO_LENGTH;
+
+	if (len < sizeof(struct superblock_notification))
+		return;
+
+	printf("SUPER %08llx change=%u[%s]\n",
+	       s->sb_id, n->subtype, super_subtypes[n->subtype]);
+}
+
+static const char *block_subtypes[256] = {
+	[NOTIFY_BLOCK_ERROR_TIMEOUT]			= "timeout",
+	[NOTIFY_BLOCK_ERROR_NO_SPACE]			= "critical space allocation",
+	[NOTIFY_BLOCK_ERROR_RECOVERABLE_TRANSPORT]	= "recoverable transport",
+	[NOTIFY_BLOCK_ERROR_CRITICAL_TARGET]		= "critical target",
+	[NOTIFY_BLOCK_ERROR_CRITICAL_NEXUS]		= "critical nexus",
+	[NOTIFY_BLOCK_ERROR_CRITICAL_MEDIUM]		= "critical medium",
+	[NOTIFY_BLOCK_ERROR_PROTECTION]			= "protection",
+	[NOTIFY_BLOCK_ERROR_KERNEL_RESOURCE]		= "kernel resource",
+	[NOTIFY_BLOCK_ERROR_DEVICE_RESOURCE]		= "device resource",
+	[NOTIFY_BLOCK_ERROR_IO]				= "I/O",
+};
+
+static void saw_block_change(struct watch_notification *n)
+{
+	struct block_notification *b = (struct block_notification *)n;
+	unsigned int len = n->info & WATCH_INFO_LENGTH;
+
+	if (len < sizeof(struct block_notification))
+		return;
+
+	printf("BLOCK %08llx e=%u[%s] s=%llx\n",
+	       (unsigned long long)b->dev,
+	       n->subtype, block_subtypes[n->subtype],
+	       (unsigned long long)b->sector);
+}
+
+/*
+ * Consume and display events.
+ */
+static int consumer(int fd, struct watch_queue_buffer *buf)
+{
+	struct watch_notification *n;
+	struct pollfd p[1];
+	unsigned int head, tail, mask = buf->meta.mask;
+
+	for (;;) {
+		p[0].fd = fd;
+		p[0].events = POLLIN | POLLERR;
+		p[0].revents = 0;
+
+		if (poll(p, 1, -1) == -1) {
+			perror("poll");
+			break;
+		}
+
+		printf("ptrs h=%x t=%x m=%x\n",
+		       buf->meta.head, buf->meta.tail, buf->meta.mask);
+
+		while (head = buf->meta.head,
+		       tail = buf->meta.tail,
+		       tail != head
+		       ) {
+			asm ("lfence" : : : "memory" );
+			n = &buf->slots[tail & mask];
+			printf("NOTIFY[%08x-%08x] ty=%04x sy=%04x i=%08x\n",
+			       head, tail, n->type, n->subtype, n->info);
+			if ((n->info & WATCH_INFO_LENGTH) == 0)
+				goto out;
+
+			switch (n->type) {
+			case WATCH_TYPE_META:
+				if (n->subtype == WATCH_META_REMOVAL_NOTIFICATION)
+					printf("REMOVAL of watchpoint %08x\n",
+					       n->info & WATCH_INFO_ID);
+				break;
+			case WATCH_TYPE_MOUNT_NOTIFY:
+				saw_mount_change(n);
+				break;
+			case WATCH_TYPE_SB_NOTIFY:
+				saw_super_change(n);
+				break;
+			case WATCH_TYPE_KEY_NOTIFY:
+				saw_key_change(n);
+				break;
+			case WATCH_TYPE_BLOCK_NOTIFY:
+				saw_block_change(n);
+				break;
+			}
+
+			tail += (n->info & WATCH_INFO_LENGTH) >> WATCH_LENGTH_SHIFT;
+			asm("mfence" ::: "memory");
+			buf->meta.tail = tail;
+		}
+	}
+
+out:
+	return 0;
+}
+
+static struct watch_notification_filter filter = {
+	.nr_filters	= 4,
+	.__reserved	= 0,
+	.filters = {
+		[0] = {
+			.type			= WATCH_TYPE_MOUNT_NOTIFY,
+			// Reject move-from notifications
+			.subtype_filter[0]	= UINT_MAX & ~(1 << NOTIFY_MOUNT_MOVE_FROM),
+		},
+		[1]	= {
+			.type			= WATCH_TYPE_SB_NOTIFY,
+			// Only accept notification of changes to R/O state
+			.subtype_filter[0]	= (1 << NOTIFY_SUPERBLOCK_READONLY),
+			// Only accept notifications of change-to-R/O
+			.info_mask		= WATCH_INFO_FLAG_0,
+			.info_filter		= WATCH_INFO_FLAG_0,
+		},
+		[2]	= {
+			.type			= WATCH_TYPE_KEY_NOTIFY,
+			.subtype_filter[0]	= UINT_MAX,
+		},
+		[3]	= {
+			.type			= WATCH_TYPE_BLOCK_NOTIFY,
+			.subtype_filter[0]	= UINT_MAX,
+		},
+	},
+};
+
+int main(int argc, char **argv)
+{
+	struct watch_queue_buffer *buf;
+	size_t page_size;
+	int fd;
+
+	fd = open("/dev/watch_queue", O_RDWR);
+	if (fd == -1) {
+		perror("/dev/watch_queue");
+		exit(1);
+	}
+
+	if (ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE) == -1) {
+		perror("/dev/watch_queue(size)");
+		exit(1);
+	}
+
+	if (ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter) == -1) {
+		perror("/dev/watch_queue(filter)");
+		exit(1);
+	}
+
+	page_size = sysconf(_SC_PAGESIZE);
+	buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
+		   MAP_SHARED, fd, 0);
+	if (buf == MAP_FAILED) {
+		perror("mmap");
+		exit(1);
+	}
+
+	if (keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01) == -1) {
+		perror("keyctl");
+		exit(1);
+	}
+
+	if (syscall(__NR_mount_notify, AT_FDCWD, "/", 0, fd, 0x02) == -1) {
+		perror("mount_notify");
+		exit(1);
+	}
+
+	if (syscall(__NR_sb_notify, AT_FDCWD, "/mnt", 0, fd, 0x03) == -1) {
+		perror("sb_notify");
+		exit(1);
+	}
+
+	if (syscall(__NR_block_notify, fd, 0x04) == -1) {
+		perror("block_notify");
+		exit(1);
+	}
+
+	return consumer(fd, buf);
+}


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-28 16:01 ` [PATCH 1/7] General notification queue with user mmap()'able ring buffer David Howells
@ 2019-05-28 16:26   ` Greg KH
  2019-05-28 17:30   ` David Howells
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 65+ messages in thread
From: Greg KH @ 2019-05-28 16:26 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel

On Tue, May 28, 2019 at 05:01:55PM +0100, David Howells wrote:
> Implement a misc device that implements a general notification queue as a
> ring buffer that can be mmap()'d from userspace.

"general" but just for filesystems, right?  :(

> Each entry has a 1-slot header that describes it:
> 
> 	struct watch_notification {
> 		__u32	type:24;
> 		__u32	subtype:8;
> 		__u32	info;
> 	};

This doesn't match the structure definition in the documentation, so
something is out of sync.

> The type indicates the source (eg. mount tree changes, superblock events,
> keyring changes, block layer events) and the subtype indicates the event
> type (eg. mount, unmount; EIO, EDQUOT; link, unlink).  The info field
> indicates a number of things, including the entry length, an ID assigned to
> a watchpoint contributing to this buffer, type-specific flags and meta
> flags, such as an overrun indicator.
> 
> Supplementary data, such as the key ID that generated an event, are
> attached in additional slots.

I'm all for a "generic" event system for the kernel (heck, Solaris has
had one for decades), but it keeps getting shot down every time it comes
up.  What is different about this one?

> --- a/drivers/misc/Kconfig
> +++ b/drivers/misc/Kconfig
> @@ -4,6 +4,19 @@
>  
>  menu "Misc devices"
>  
> +config WATCH_QUEUE
> +	bool "Mappable notification queue"
> +	default n

Nit, not needed.

> +	depends on MMU
> +	help
> +	  This is a general notification queue for the kernel to pass events to
> +	  userspace through a mmap()'able ring buffer.  It can be used in
> +	  conjunction with watches for mount topology change notifications,
> +	  superblock change notifications and key/keyring change notifications.
> +
> +	  Note that in theory this should work fine with NOMMU, but I'm not
> +	  sure how to make that work.
> +
>  config SENSORS_LIS3LV02D
>  	tristate
>  	depends on INPUT
> diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
> index b9affcdaa3d6..bf16acd9f8cc 100644
> --- a/drivers/misc/Makefile
> +++ b/drivers/misc/Makefile
> @@ -3,6 +3,7 @@
>  # Makefile for misc devices that really don't fit anywhere else.
>  #
>  
> +obj-$(CONFIG_WATCH_QUEUE)	+= watch_queue.o
>  obj-$(CONFIG_IBM_ASM)		+= ibmasm/
>  obj-$(CONFIG_IBMVMC)		+= ibmvmc.o
>  obj-$(CONFIG_AD525X_DPOT)	+= ad525x_dpot.o
> diff --git a/drivers/misc/watch_queue.c b/drivers/misc/watch_queue.c
> new file mode 100644
> index 000000000000..39a09ea15d97
> --- /dev/null
> +++ b/drivers/misc/watch_queue.c
> @@ -0,0 +1,877 @@
> +/* User-mappable watch queue
> + *
> + * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.

You didn't touch the code this year?

> + * Written by David Howells (dhowells@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.

Please drop the boiler plate text and use a SPDX tag, checkpatch should
have caught this.  I don't want to have to go and change it again.

> + *
> + * See Documentation/watch_queue.rst
> + */
> +
> +#define pr_fmt(fmt) "watchq: " fmt
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/printk.h>
> +#include <linux/miscdevice.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/pagemap.h>
> +#include <linux/poll.h>
> +#include <linux/uaccess.h>
> +#include <linux/vmalloc.h>
> +#include <linux/file.h>
> +#include <linux/security.h>
> +#include <linux/cred.h>
> +#include <linux/watch_queue.h>
> +
> +#define DEBUG_WITH_WRITE /* Allow use of write() to record notifications */

debugging code left in?

> +
> +MODULE_DESCRIPTION("Watch queue");
> +MODULE_AUTHOR("Red Hat, Inc.");
> +MODULE_LICENSE("GPL");
> +
> +struct watch_type_filter {
> +	enum watch_notification_type type;
> +	__u32		subtype_filter[1];	/* Bitmask of subtypes to filter on */
> +	__u32		info_filter;		/* Filter on watch_notification::info */
> +	__u32		info_mask;		/* Mask of relevant bits in info_filter */
> +};
> +
> +struct watch_filter {
> +	union {
> +		struct rcu_head	rcu;
> +		unsigned long	type_filter[2];	/* Bitmask of accepted types */
> +	};
> +	u32		nr_filters;		/* Number of filters */
> +	struct watch_type_filter filters[];
> +};
> +
> +struct watch_queue {
> +	struct rcu_head		rcu;
> +	struct address_space	mapping;
> +	const struct cred	*cred;		/* Creds of the owner of the queue */
> +	struct watch_filter __rcu *filter;
> +	wait_queue_head_t	waiters;
> +	struct hlist_head	watches;	/* Contributory watches */
> +	refcount_t		usage;

Usage of what, this structure?  Or something else?

> +	spinlock_t		lock;
> +	bool			defunct;	/* T when queues closed */
> +	u8			nr_pages;	/* Size of pages[] */
> +	u8			flag_next;	/* Flag to apply to next item */
> +#ifdef DEBUG_WITH_WRITE
> +	u8			debug;
> +#endif
> +	u32			size;
> +	struct watch_queue_buffer *buffer;	/* Pointer to first record */
> +
> +	/* The mappable pages.  The zeroth page holds the ring pointers. */
> +	struct page		**pages;
> +};


> +EXPORT_SYMBOL(__post_watch_notification);

_GPL for new apis?  (I have to ask...)

> +static long watch_queue_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> +	struct watch_queue *wqueue = file->private_data;
> +	struct inode *inode = file_inode(file);
> +	long ret;
> +
> +	switch (cmd) {
> +	case IOC_WATCH_QUEUE_SET_SIZE:
> +		if (wqueue->buffer)
> +			return -EBUSY;
> +		inode_lock(inode);
> +		ret = watch_queue_set_size(wqueue, arg);
> +		inode_unlock(inode);
> +		return ret;
> +
> +	case IOC_WATCH_QUEUE_SET_FILTER:
> +		inode_lock(inode);
> +		ret = watch_queue_set_filter(
> +			inode, wqueue,
> +			(struct watch_notification_filter __user *)arg);
> +		inode_unlock(inode);
> +		return ret;
> +
> +	default:
> +		return -EOPNOTSUPP;

-ENOTTY is the correct "not a valid ioctl" error value, right?

> +	}
> +}

> +/**
> + * put_watch_queue - Dispose of a ref on a watchqueue.
> + * @wqueue: The watch queue to unref.
> + */
> +void put_watch_queue(struct watch_queue *wqueue)
> +{
> +	if (refcount_dec_and_test(&wqueue->usage))
> +		kfree_rcu(wqueue, rcu);

Why not just use a kref?

> +}
> +EXPORT_SYMBOL(put_watch_queue);


> +int add_watch_to_object(struct watch *watch, struct watch_list *wlist)
> +{
> +	struct watch_queue *wqueue = rcu_access_pointer(watch->queue);
> +	struct watch *w;
> +
> +	hlist_for_each_entry(w, &wlist->watchers, list_node) {
> +		if (watch->id == w->id)
> +			return -EBUSY;
> +	}
> +
> +	rcu_assign_pointer(watch->watch_list, wlist);
> +
> +	spin_lock_bh(&wqueue->lock);
> +	refcount_inc(&wqueue->usage);
> +	hlist_add_head(&watch->queue_node, &wqueue->watches);
> +	spin_unlock_bh(&wqueue->lock);
> +
> +	hlist_add_head(&watch->list_node, &wlist->watchers);
> +	return 0;
> +}
> +EXPORT_SYMBOL(add_watch_to_object);

Naming nit, shouldn't the "prefix" all be the same for these new
functions?

watch_queue_add_object()?  watch_queue_put()?  And so on?

> +static int __init watch_queue_init(void)
> +{
> +	int ret;
> +
> +	ret = misc_register(&watch_queue_dev);
> +	if (ret < 0)
> +		pr_err("Failed to register %d\n", ret);
> +	return ret;
> +}
> +fs_initcall(watch_queue_init);
> +
> +static void __exit watch_queue_exit(void)
> +{
> +	misc_deregister(&watch_queue_dev);
> +}
> +module_exit(watch_queue_exit);

module_misc_device()?


> --- /dev/null
> +++ b/include/linux/watch_queue.h
> @@ -0,0 +1,86 @@
> +/* User-mappable watch queue
> + *
> + * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.

Again, SPDX headers please.

> --- /dev/null
> +++ b/include/uapi/linux/watch_queue.h
> @@ -0,0 +1,82 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */

Yeah!!!

No copyright?  :(

> +#ifndef _UAPI_LINUX_WATCH_QUEUE_H
> +#define _UAPI_LINUX_WATCH_QUEUE_H
> +
> +#include <linux/types.h>
> +#include <linux/ioctl.h>
> +
> +#define IOC_WATCH_QUEUE_SET_SIZE	_IO('s', 0x01)	/* Set the size in pages */
> +#define IOC_WATCH_QUEUE_SET_FILTER	_IO('s', 0x02)	/* Set the filter */
> +
> +enum watch_notification_type {
> +	WATCH_TYPE_META		= 0,	/* Special record */
> +	WATCH_TYPE_MOUNT_NOTIFY	= 1,	/* Mount notification record */
> +	WATCH_TYPE_SB_NOTIFY	= 2,	/* Superblock notification */
> +	WATCH_TYPE_KEY_NOTIFY	= 3,	/* Key/keyring change notification */
> +	WATCH_TYPE_BLOCK_NOTIFY	= 4,	/* Block layer notifications */
> +#define WATCH_TYPE___NR 5
> +};
> +
> +enum watch_meta_notification_subtype {
> +	WATCH_META_SKIP_NOTIFICATION	= 0,	/* Just skip this record */
> +	WATCH_META_REMOVAL_NOTIFICATION	= 1,	/* Watched object was removed */
> +};
> +
> +/*
> + * Notification record
> + */
> +struct watch_notification {
> +	__u32			type:24;	/* enum watch_notification_type */
> +	__u32			subtype:8;	/* Type-specific subtype (filterable) */
> +	__u32			info;
> +#define WATCH_INFO_OVERRUN	0x00000001	/* Event(s) lost due to overrun */
> +#define WATCH_INFO_ENOMEM	0x00000002	/* Event(s) lost due to ENOMEM */
> +#define WATCH_INFO_RECURSIVE	0x00000004	/* Change was recursive */
> +#define WATCH_INFO_LENGTH	0x000001f8	/* Length of record / sizeof(watch_notification) */
> +#define WATCH_INFO_IN_SUBTREE	0x00000200	/* Change was not at watched root */
> +#define WATCH_INFO_TYPE_FLAGS	0x00ff0000	/* Type-specific flags */
> +#define WATCH_INFO_FLAG_0	0x00010000
> +#define WATCH_INFO_FLAG_1	0x00020000
> +#define WATCH_INFO_FLAG_2	0x00040000
> +#define WATCH_INFO_FLAG_3	0x00080000
> +#define WATCH_INFO_FLAG_4	0x00100000
> +#define WATCH_INFO_FLAG_5	0x00200000
> +#define WATCH_INFO_FLAG_6	0x00400000
> +#define WATCH_INFO_FLAG_7	0x00800000
> +#define WATCH_INFO_ID		0xff000000	/* ID of watchpoint */
> +};
> +
> +#define WATCH_LENGTH_SHIFT	3
> +
> +struct watch_queue_buffer {
> +	union {
> +		/* The first few entries are special, containing the
> +		 * ring management variables.
> +		 */
> +		struct {
> +			struct watch_notification watch; /* WATCH_TYPE_SKIP */
> +			volatile __u32	head;		/* Ring head index */
> +			volatile __u32	tail;		/* Ring tail index */

A uapi structure that has volatile in it?  Are you _SURE_ this is
correct?

That feels wrong to me...  This is not a backing-hardware register, it's
"just memory" and slapping volatile on it shouldn't be the correct
solution for telling the compiler to not to optimize away reads/flushes,
right?  You need a proper memory access type primitive for that to work
correctly everywhere I thought.

We only have 2 users of volatile in include/uapi, one for WMI structures
that are backed by firmware (seems correct), and one for DRM which I
have no idea how it works as it claims to be a lock.  Why is this new
addition the correct way to do this that no other ring-buffer that was
mmapped has needed to?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-28 16:01 ` [PATCH 1/7] General notification queue with user mmap()'able ring buffer David Howells
  2019-05-28 16:26   ` Greg KH
@ 2019-05-28 17:30   ` David Howells
  2019-05-28 23:12     ` Greg KH
  2019-05-29 16:06     ` David Howells
  2019-05-28 19:14   ` Jann Horn
  2019-05-28 22:28   ` David Howells
  3 siblings, 2 replies; 65+ messages in thread
From: David Howells @ 2019-05-28 17:30 UTC (permalink / raw)
  To: Greg KH
  Cc: dhowells, viro, raven, linux-fsdevel, linux-api, linux-block,
	keyrings, linux-security-module, linux-kernel

Greg KH <gregkh@linuxfoundation.org> wrote:

> > Implement a misc device that implements a general notification queue as a
> > ring buffer that can be mmap()'d from userspace.
> 
> "general" but just for filesystems, right?  :(

Whatever gave you that idea?  You can watch keyrings events, for example -
they're not exactly filesystems.  I've added the ability to watch for mount
topology changes and superblock events because those are something I've been
asked to do.  I've added something for block events because I've recently had
a problem with trying to recover data from a dodgy disk in that every time the
disk goes offline, the ddrecover goes "wheeeee!" as it just sees a lot of
EIO/ENODATA at a great rate of knots because it doesn't know the driver is now
ignoring the disk.

I don't know what else people might want to watch, but I've tried to make it
as generic as possible so as not to exclude it if possible.

> This doesn't match the structure definition in the documentation, so
> something is out of sync.

Ah, yes - I need to update that doc, thanks.

> I'm all for a "generic" event system for the kernel (heck, Solaris has
> had one for decades), but it keeps getting shot down every time it comes
> up.  What is different about this one?

Without studying all the other ones, I can't say - however, I need to add
something for keyrings and I would prefer to make something generic.

> > +#define DEBUG_WITH_WRITE /* Allow use of write() to record notifications */
> 
> debugging code left in?

I'll switch it to #undef.  I want to leave the code in there for testing
purposes.  Possibly I should make it a Kconfig option.

> > +	refcount_t		usage;
> 
> Usage of what, this structure?  Or something else?

This is the number of usages of this struct (references to if you prefer).  I
can add a comment to this effect.

> > +EXPORT_SYMBOL(__post_watch_notification);
> 
> _GPL for new apis?  (I have to ask...)

No.

> > +		return -EOPNOTSUPP;
> 
> -ENOTTY is the correct "not a valid ioctl" error value, right?

fs/ioctl.c does both, but I can switch it if it makes you happier.

> > +void put_watch_queue(struct watch_queue *wqueue)
> > +{
> > +	if (refcount_dec_and_test(&wqueue->usage))
> > +		kfree_rcu(wqueue, rcu);
> 
> Why not just use a kref?

Why use a kref?  It seems like an effort to be a C++ base class, but without
the C++ inheritance bit.  Using kref doesn't seem to gain anything.  It's just
a wrapper around refcount_t - so why not just use a refcount_t?

kref_put() could potentially add an unnecessary extra stack frame and would
seem to be best avoided, though an optimising compiler ought to be able to
inline if it can.

Are you now on the convert all refcounts to krefs path?

> > +EXPORT_SYMBOL(add_watch_to_object);
> 
> Naming nit, shouldn't the "prefix" all be the same for these new
> functions?
> 
> watch_queue_add_object()?  watch_queue_put()?  And so on?

Naming is fun.  watch_queue_add_object - that suggests something different to
what the function actually does.  I'll think about adjusting the names.

> > +module_exit(watch_queue_exit);
> 
> module_misc_device()?

	warthog>git grep module_misc_device -- Documentation/
	warthog1>

> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> 
> Yeah!!!

Blech.

> > +		struct {
> > +			struct watch_notification watch; /* WATCH_TYPE_SKIP */
> > +			volatile __u32	head;		/* Ring head index */
> > +			volatile __u32	tail;		/* Ring tail index */
> 
> A uapi structure that has volatile in it?  Are you _SURE_ this is
> correct?
> 
> That feels wrong to me...  This is not a backing-hardware register, it's
> "just memory" and slapping volatile on it shouldn't be the correct
> solution for telling the compiler to not to optimize away reads/flushes,
> right?  You need a proper memory access type primitive for that to work
> correctly everywhere I thought.
> 
> We only have 2 users of volatile in include/uapi, one for WMI structures
> that are backed by firmware (seems correct), and one for DRM which I
> have no idea how it works as it claims to be a lock.  Why is this new
> addition the correct way to do this that no other ring-buffer that was
> mmapped has needed to?

Yeah, I understand your concern with this.

The reason I put the volatiles in is that the kernel may be modifying the head
pointer on one CPU simultaneously with userspace modifying the tail pointer on
another CPU.

Note that userspace does not need to enter the kernel to find out if there's
anything in the buffer or to read stuff out of the buffer.  Userspace only
needs to enter the kernel, using poll() or similar, to wait for something to
appear in the buffer.

David

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-28 16:01 ` [PATCH 1/7] General notification queue with user mmap()'able ring buffer David Howells
  2019-05-28 16:26   ` Greg KH
  2019-05-28 17:30   ` David Howells
@ 2019-05-28 19:14   ` Jann Horn
  2019-05-28 22:28   ` David Howells
  3 siblings, 0 replies; 65+ messages in thread
From: Jann Horn @ 2019-05-28 19:14 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, raven, linux-fsdevel, Linux API, linux-block, keyrings,
	linux-security-module, kernel list

On Tue, May 28, 2019 at 6:03 PM David Howells <dhowells@redhat.com> wrote:
> Implement a misc device that implements a general notification queue as a
> ring buffer that can be mmap()'d from userspace.
[...]
> +receive notifications from the kernel.  This is can be used in conjunction

typo: s/is can/can/

[...]
> +Overview
> +========
> +
> +This facility appears as a misc device file that is opened and then mapped and
> +polled.  Each time it is opened, it creates a new buffer specific to the
> +returned file descriptor.  Then, when the opening process sets watches, it
> +indicates that particular buffer it wants notifications from that watch to be
> +written into. Note that there are no read() and write() methods (except for

s/that particular buffer/the particular buffer/

> +debugging).  The user is expected to access the ring directly and to use poll
> +to wait for new data.
[...]
> +/**
> + * __post_watch_notification - Post an event notification
> + * @wlist: The watch list to post the event to.
> + * @n: The notification record to post.
> + * @cred: The creds of the process that triggered the notification.
> + * @id: The ID to match on the watch.
> + *
> + * Post a notification of an event into a set of watch queues and let the users
> + * know.
> + *
> + * If @n is NULL then WATCH_INFO_LENGTH will be set on the next event posted.
> + *
> + * The size of the notification should be set in n->info & WATCH_INFO_LENGTH and
> + * should be in units of sizeof(*n).
> + */
> +void __post_watch_notification(struct watch_list *wlist,
> +                              struct watch_notification *n,
> +                              const struct cred *cred,
> +                              u64 id)
> +{
> +       const struct watch_filter *wf;
> +       struct watch_queue *wqueue;
> +       struct watch *watch;
> +
> +       rcu_read_lock();
> +
> +       hlist_for_each_entry_rcu(watch, &wlist->watchers, list_node) {
> +               if (watch->id != id)
> +                       continue;
> +               n->info &= ~(WATCH_INFO_ID | WATCH_INFO_OVERRUN);
> +               n->info |= watch->info_id;
> +
> +               wqueue = rcu_dereference(watch->queue);
> +               wf = rcu_dereference(wqueue->filter);
> +               if (wf && !filter_watch_notification(wf, n))
> +                       continue;
> +
> +               post_one_notification(wqueue, n, cred);
> +       }
> +
> +       rcu_read_unlock();
> +}
> +EXPORT_SYMBOL(__post_watch_notification);
[...]
> +static vm_fault_t watch_queue_fault(struct vm_fault *vmf)
> +{
> +       struct watch_queue *wqueue = vmf->vma->vm_file->private_data;
> +       struct page *page;
> +
> +       page = wqueue->pages[vmf->pgoff];

I don't see you setting any special properties on the VMA that would
prevent userspace from extending its size via mremap() - no
VM_DONTEXPAND or VM_PFNMAP. So I think you might get an out-of-bounds
access here?

> +       get_page(page);
> +       if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
> +               put_page(page);
> +               return VM_FAULT_RETRY;
> +       }
> +       vmf->page = page;
> +       return VM_FAULT_LOCKED;
> +}
> +
> +static void watch_queue_map_pages(struct vm_fault *vmf,
> +                                 pgoff_t start_pgoff, pgoff_t end_pgoff)
> +{
> +       struct watch_queue *wqueue = vmf->vma->vm_file->private_data;
> +       struct page *page;
> +
> +       rcu_read_lock();
> +
> +       do {
> +               page = wqueue->pages[start_pgoff];

Same as above.

> +               if (trylock_page(page)) {
> +                       vm_fault_t ret;
> +                       get_page(page);
> +                       ret = alloc_set_pte(vmf, NULL, page);
> +                       if (ret != 0)
> +                               put_page(page);
> +
> +                       unlock_page(page);
> +               }
> +       } while (++start_pgoff < end_pgoff);
> +
> +       rcu_read_unlock();
> +}
[...]
> +static int watch_queue_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +       struct watch_queue *wqueue = file->private_data;
> +
> +       if (vma->vm_pgoff != 0 ||
> +           vma->vm_end - vma->vm_start > wqueue->nr_pages * PAGE_SIZE ||
> +           !(pgprot_val(vma->vm_page_prot) & pgprot_val(PAGE_SHARED)))
> +               return -EINVAL;

This thing should probably have locking against concurrent
watch_queue_set_size()?

> +       vma->vm_ops = &watch_queue_vm_ops;
> +
> +       vma_interval_tree_insert(vma, &wqueue->mapping.i_mmap);
> +       return 0;
> +}
> +
> +/*
> + * Allocate the required number of pages.
> + */
> +static long watch_queue_set_size(struct watch_queue *wqueue, unsigned long nr_pages)
> +{
> +       struct watch_queue_buffer *buf;
> +       u32 len;
> +       int i;
> +
> +       if (nr_pages == 0 ||
> +           nr_pages > 16 || /* TODO: choose a better hard limit */
> +           !is_power_of_2(nr_pages))
> +               return -EINVAL;
> +
> +       wqueue->pages = kcalloc(nr_pages, sizeof(struct page *), GFP_KERNEL);
> +       if (!wqueue->pages)
> +               goto err;
> +
> +       for (i = 0; i < nr_pages; i++) {
> +               wqueue->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +               if (!wqueue->pages[i])
> +                       goto err_some_pages;
> +               wqueue->pages[i]->mapping = &wqueue->mapping;
> +               SetPageUptodate(wqueue->pages[i]);
> +       }
> +
> +       buf = vmap(wqueue->pages, nr_pages, VM_MAP, PAGE_SHARED);
> +       if (!buf)
> +               goto err_some_pages;
> +
> +       wqueue->buffer = buf;
> +       wqueue->nr_pages = nr_pages;
> +       wqueue->size = ((nr_pages * PAGE_SIZE) / sizeof(struct watch_notification));
> +
> +       /* The first four slots in the buffer contain metadata about the ring,
> +        * including the head and tail indices and mask.
> +        */
> +       len = sizeof(buf->meta) / sizeof(buf->slots[0]);
> +       buf->meta.watch.info    = len << WATCH_LENGTH_SHIFT;
> +       buf->meta.watch.type    = WATCH_TYPE_META;
> +       buf->meta.watch.subtype = WATCH_META_SKIP_NOTIFICATION;
> +       buf->meta.tail          = len;
> +       buf->meta.mask          = wqueue->size - 1;
> +       smp_store_release(&buf->meta.head, len);

Why is this an smp_store_release()? The entire buffer should not be visible to
userspace before this setup is complete, right?

> +       return 0;
> +
> +err_some_pages:
> +       for (i--; i >= 0; i--) {
> +               ClearPageUptodate(wqueue->pages[i]);
> +               wqueue->pages[i]->mapping = NULL;
> +               put_page(wqueue->pages[i]);
> +       }
> +
> +       kfree(wqueue->pages);
> +       wqueue->pages = NULL;
> +err:
> +       return -ENOMEM;
> +}
[...]
> +
> +/*
> + * Set parameters.
> + */
> +static long watch_queue_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> +       struct watch_queue *wqueue = file->private_data;
> +       struct inode *inode = file_inode(file);
> +       long ret;
> +
> +       switch (cmd) {
> +       case IOC_WATCH_QUEUE_SET_SIZE:
> +               if (wqueue->buffer)
> +                       return -EBUSY;

The preceding check occurs without any locks held and therefore has no
reliable effect. It should probably be moved below the
inode_lock(...).

> +               inode_lock(inode);
> +               ret = watch_queue_set_size(wqueue, arg);
> +               inode_unlock(inode);
> +               return ret;
> +
> +       case IOC_WATCH_QUEUE_SET_FILTER:
> +               inode_lock(inode);
> +               ret = watch_queue_set_filter(
> +                       inode, wqueue,
> +                       (struct watch_notification_filter __user *)arg);
> +               inode_unlock(inode);
> +               return ret;
> +
> +       default:
> +               return -EOPNOTSUPP;
> +       }
> +}
[...]
> +static void free_watch(struct rcu_head *rcu)
> +{
> +       struct watch *watch = container_of(rcu, struct watch, rcu);
> +
> +       put_watch_queue(rcu_access_pointer(watch->queue));

This should be rcu_dereference_protected(..., 1).

> +/**
> + * remove_watch_from_object - Remove a watch or all watches from an object.
> + * @wlist: The watch list to remove from
> + * @wq: The watch queue of interest (ignored if @all is true)
> + * @id: The ID of the watch to remove (ignored if @all is true)
> + * @all: True to remove all objects
> + *
> + * Remove a specific watch or all watches from an object.  A notification is
> + * sent to the watcher to tell them that this happened.
> + */
> +int remove_watch_from_object(struct watch_list *wlist, struct watch_queue *wq,
> +                            u64 id, bool all)
> +{
> +       struct watch_notification n;
> +       struct watch_queue *wqueue;
> +       struct watch *watch;
> +       int ret = -EBADSLT;
> +
> +       rcu_read_lock();
> +
> +again:
> +       spin_lock(&wlist->lock);
> +       hlist_for_each_entry(watch, &wlist->watchers, list_node) {
> +               if (all ||
> +                   (watch->id == id && rcu_access_pointer(watch->queue) == wq))
> +                       goto found;
> +       }
> +       spin_unlock(&wlist->lock);
> +       goto out;
> +
> +found:
> +       ret = 0;
> +       hlist_del_init_rcu(&watch->list_node);
> +       rcu_assign_pointer(watch->watch_list, NULL);
> +       spin_unlock(&wlist->lock);
> +
> +       n.type = WATCH_TYPE_META;
> +       n.subtype = WATCH_META_REMOVAL_NOTIFICATION;
> +       n.info = watch->info_id | sizeof(n);
> +
> +       wqueue = rcu_dereference(watch->queue);
> +       post_one_notification(wqueue, &n, wq ? wq->cred : NULL);
> +
> +       /* We don't need the watch list lock for the next bit as RCU is
> +        * protecting everything from being deallocated.

Does "everything" mean "the wqueue" or more than that?

> +        */
> +       if (wqueue) {
> +               spin_lock_bh(&wqueue->lock);
> +
> +               if (!hlist_unhashed(&watch->queue_node)) {
> +                       hlist_del_init_rcu(&watch->queue_node);
> +                       put_watch(watch);
> +               }
> +
> +               spin_unlock_bh(&wqueue->lock);
> +       }
> +
> +       if (wlist->release_watch) {
> +               rcu_read_unlock();
> +               wlist->release_watch(wlist, watch);
> +               rcu_read_lock();
> +       }
> +       put_watch(watch);
> +
> +       if (all && !hlist_empty(&wlist->watchers))
> +               goto again;
> +out:
> +       rcu_read_unlock();
> +       return ret;
> +}
> +EXPORT_SYMBOL(remove_watch_from_object);
> +
> +/*
> + * Remove all the watches that are contributory to a queue.  This will
> + * potentially race with removal of the watches by the destruction of the
> + * objects being watched or the distribution of notifications.
> + */
> +static void watch_queue_clear(struct watch_queue *wqueue)
> +{
> +       struct watch_list *wlist;
> +       struct watch *watch;
> +       bool release;
> +
> +       rcu_read_lock();
> +       spin_lock_bh(&wqueue->lock);
> +
> +       /* Prevent new additions and prevent notifications from happening */
> +       wqueue->defunct = true;
> +
> +       while (!hlist_empty(&wqueue->watches)) {
> +               watch = hlist_entry(wqueue->watches.first, struct watch, queue_node);
> +               hlist_del_init_rcu(&watch->queue_node);
> +               spin_unlock_bh(&wqueue->lock);
> +
> +               /* We can't do the next bit under the queue lock as we need to
> +                * get the list lock - which would cause a deadlock if someone
> +                * was removing from the opposite direction at the same time or
> +                * posting a notification.
> +                */
> +               wlist = rcu_dereference(watch->watch_list);
> +               if (wlist) {
> +                       spin_lock(&wlist->lock);
> +
> +                       release = !hlist_unhashed(&watch->list_node);
> +                       if (release) {
> +                               hlist_del_init_rcu(&watch->list_node);
> +                               rcu_assign_pointer(watch->watch_list, NULL);
> +                       }
> +
> +                       spin_unlock(&wlist->lock);
> +
> +                       if (release) {
> +                               if (wlist->release_watch) {
> +                                       rcu_read_unlock();
> +                                       /* This might need to call dput(), so
> +                                        * we have to drop all the locks.
> +                                        */
> +                                       wlist->release_watch(wlist, watch);

How are you holding a reference to `wlist` here? You got the reference through
rcu_dereference(), you've dropped the RCU read lock, and I don't see anything
that stabilizes the reference.

> +                                       rcu_read_lock();
> +                               }
> +                               put_watch(watch);
> +                       }
> +               }
> +
> +               put_watch(watch);
> +               spin_lock_bh(&wqueue->lock);
> +       }
> +
> +       spin_unlock_bh(&wqueue->lock);
> +       rcu_read_unlock();
> +}
> +
> +/*
> + * Release the file.
> + */
> +static int watch_queue_release(struct inode *inode, struct file *file)
> +{
> +       struct watch_filter *wfilter;
> +       struct watch_queue *wqueue = file->private_data;
> +       int i, pgref;
> +
> +       watch_queue_clear(wqueue);
> +
> +       if (wqueue->pages && wqueue->pages[0])
> +               WARN_ON(page_ref_count(wqueue->pages[0]) != 1);

Is there a reason why there couldn't still be references to the pages
from get_user_pages()/get_user_pages_fast()?

> +       if (wqueue->buffer)
> +               vfree(wqueue->buffer);
> +       for (i = 0; i < wqueue->nr_pages; i++) {
> +               ClearPageUptodate(wqueue->pages[i]);
> +               wqueue->pages[i]->mapping = NULL;
> +               pgref = page_ref_count(wqueue->pages[i]);
> +               WARN(pgref != 1,
> +                    "FREE PAGE[%d] refcount %d\n", i, page_ref_count(wqueue->pages[i]));
> +               __free_page(wqueue->pages[i]);
> +       }
> +
> +       wfilter = rcu_access_pointer(wqueue->filter);

Again, rcu_dereference_protected(..., 1).

> +       if (wfilter)
> +               kfree_rcu(wfilter, rcu);
> +       kfree(wqueue->pages);
> +       put_cred(wqueue->cred);
> +       put_watch_queue(wqueue);
> +       return 0;
> +}
> +
> +#ifdef DEBUG_WITH_WRITE
> +static ssize_t watch_queue_write(struct file *file,
> +                                const char __user *_buf, size_t len, loff_t *pos)
> +{
> +       struct watch_notification *n;
> +       struct watch_queue *wqueue = file->private_data;
> +       ssize_t ret;
> +
> +       if (!wqueue->buffer)
> +               return -ENOBUFS;
> +
> +       if (len & ~WATCH_INFO_LENGTH || len == 0 || !_buf)
> +               return -EINVAL;
> +
> +       n = memdup_user(_buf, len);
> +       if (IS_ERR(n))
> +               return PTR_ERR(n);
> +
> +       ret = -EINVAL;
> +       if ((n->info & WATCH_INFO_LENGTH) != len)
> +               goto error;
> +       n->info &= (WATCH_INFO_LENGTH | WATCH_INFO_TYPE_FLAGS | WATCH_INFO_ID);

Should the non-atomic modification of n->info (and perhaps also the
following uses of ->debug) be protected by some lock?

> +       if (post_one_notification(wqueue, n, file->f_cred))
> +               wqueue->debug = 0;
> +       else
> +               wqueue->debug++;
> +       ret = len;
> +       if (wqueue->debug > 20)
> +               ret = -EIO;
> +
> +error:
> +       kfree(n);
> +       return ret;
> +}
> +#endif
[...]
> +#define IOC_WATCH_QUEUE_SET_SIZE       _IO('s', 0x01)  /* Set the size in pages */
> +#define IOC_WATCH_QUEUE_SET_FILTER     _IO('s', 0x02)  /* Set the filter */

Should these ioctl numbers be registered in
Documentation/ioctl/ioctl-number.txt?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
  2019-05-28 16:02 ` [PATCH 3/7] vfs: Add a mount-notification facility David Howells
@ 2019-05-28 20:06   ` Jann Horn
  2019-05-28 23:04   ` David Howells
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 65+ messages in thread
From: Jann Horn @ 2019-05-28 20:06 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, raven, linux-fsdevel, Linux API, linux-block, keyrings,
	linux-security-module, kernel list

On Tue, May 28, 2019 at 6:05 PM David Howells <dhowells@redhat.com> wrote:
> Add a mount notification facility whereby notifications about changes in
> mount topology and configuration can be received.  Note that this only
> covers vfsmount topology changes and not superblock events.  A separate
> facility will be added for that.
[...]
> @@ -172,4 +167,18 @@ static inline void notify_mount(struct mount *changed,
>                                 u32 info_flags)
>  {
>         atomic_inc(&changed->mnt_notify_counter);
> +
> +#ifdef CONFIG_MOUNT_NOTIFICATIONS
> +       {
> +               struct mount_notification n = {
> +                       .watch.type     = WATCH_TYPE_MOUNT_NOTIFY,
> +                       .watch.subtype  = subtype,
> +                       .watch.info     = info_flags | sizeof(n),
> +                       .triggered_on   = changed->mnt_id,
> +                       .changed_mount  = aux ? aux->mnt_id : 0,
> +               };
> +
> +               post_mount_notification(changed, &n);
> +       }
> +#endif
>  }
[...]
> +void post_mount_notification(struct mount *changed,
> +                            struct mount_notification *notify)
> +{
> +       const struct cred *cred = current_cred();

This current_cred() looks bogus to me. Can't mount topology changes
come from all sorts of places? For example, umount_mnt() from
umount_tree() from dissolve_on_fput() from __fput(), which could
happen pretty much anywhere depending on where the last reference gets
dropped?

> +       struct path cursor;
> +       struct mount *mnt;
> +       unsigned seq;
> +
> +       seq = 0;
> +       rcu_read_lock();
> +restart:
> +       cursor.mnt = &changed->mnt;
> +       cursor.dentry = changed->mnt.mnt_root;
> +       mnt = real_mount(cursor.mnt);
> +       notify->watch.info &= ~WATCH_INFO_IN_SUBTREE;
> +
> +       read_seqbegin_or_lock(&rename_lock, &seq);
> +       for (;;) {
> +               if (mnt->mnt_watchers &&
> +                   !hlist_empty(&mnt->mnt_watchers->watchers)) {
> +                       if (cursor.dentry->d_flags & DCACHE_MOUNT_WATCH)
> +                               post_watch_notification(mnt->mnt_watchers,
> +                                                       &notify->watch, cred,
> +                                                       (unsigned long)cursor.dentry);
> +               } else {
> +                       cursor.dentry = mnt->mnt.mnt_root;
> +               }
> +               notify->watch.info |= WATCH_INFO_IN_SUBTREE;
> +
> +               if (cursor.dentry == cursor.mnt->mnt_root ||
> +                   IS_ROOT(cursor.dentry)) {
> +                       struct mount *parent = READ_ONCE(mnt->mnt_parent);
> +
> +                       /* Escaped? */
> +                       if (cursor.dentry != cursor.mnt->mnt_root)
> +                               break;
> +
> +                       /* Global root? */
> +                       if (mnt != parent) {
> +                               cursor.dentry = READ_ONCE(mnt->mnt_mountpoint);
> +                               mnt = parent;
> +                               cursor.mnt = &mnt->mnt;
> +                               continue;
> +                       }
> +                       break;

(nit: this would look clearer if you inverted the condition and wrote
it as "if (mnt == parent) break;", then you also wouldn't need that
"continue" or the braces)

> +               }
> +
> +               cursor.dentry = cursor.dentry->d_parent;
> +       }
> +
> +       if (need_seqretry(&rename_lock, seq)) {
> +               seq = 1;
> +               goto restart;
> +       }
> +
> +       done_seqretry(&rename_lock, seq);
> +       rcu_read_unlock();
> +}
[...]
> +SYSCALL_DEFINE5(mount_notify,
> +               int, dfd,
> +               const char __user *, filename,
> +               unsigned int, at_flags,
> +               int, watch_fd,
> +               int, watch_id)
> +{
> +       struct watch_queue *wqueue;
> +       struct watch_list *wlist = NULL;
> +       struct watch *watch;
> +       struct mount *m;
> +       struct path path;
> +       int ret;
> +
> +       if (watch_id < -1 || watch_id > 0xff)
> +               return -EINVAL;
> +
> +       ret = user_path_at(dfd, filename, at_flags, &path);

The third argument of user_path_at() contains kernel-private lookup
flags, I'm pretty sure userspace isn't supposed to be able to control
these directly.

> +       if (ret)
> +               return ret;
> +
> +       wqueue = get_watch_queue(watch_fd);
> +       if (IS_ERR(wqueue))
> +               goto err_path;
> +
> +       m = real_mount(path.mnt);
> +
> +       if (watch_id >= 0) {
> +               if (!m->mnt_watchers) {
> +                       wlist = kzalloc(sizeof(*wlist), GFP_KERNEL);
> +                       if (!wlist)
> +                               goto err_wqueue;
> +                       INIT_HLIST_HEAD(&wlist->watchers);
> +                       spin_lock_init(&wlist->lock);
> +                       wlist->release_watch = release_mount_watch;
> +               }
> +
> +               watch = kzalloc(sizeof(*watch), GFP_KERNEL);
> +               if (!watch)
> +                       goto err_wlist;
> +
> +               init_watch(watch, wqueue);
> +               watch->id               = (unsigned long)path.dentry;
> +               watch->private          = path.mnt;
> +               watch->info_id          = (u32)watch_id << 24;
> +
> +               down_write(&m->mnt.mnt_sb->s_umount);
> +               if (!m->mnt_watchers) {
> +                       m->mnt_watchers = wlist;
> +                       wlist = NULL;
> +               }
> +
> +               ret = add_watch_to_object(watch, m->mnt_watchers);
> +               if (ret == 0) {
> +                       spin_lock(&path.dentry->d_lock);
> +                       path.dentry->d_flags |= DCACHE_MOUNT_WATCH;
> +                       spin_unlock(&path.dentry->d_lock);
> +                       path_get(&path);

So... the watches on a mountpoint create references back to the
mountpoint? Is your plan that umount_tree() breaks the loop by getting
rid of the watches?

If so: Is there anything that prevents installing new watches after
umount_tree()? Because I don't see anything.

It might make sense to redesign this stuff so that watches don't hold
references on the object being watched.

> +               }
> +               up_write(&m->mnt.mnt_sb->s_umount);
> +               if (ret < 0)
> +                       kfree(watch);
> +       } else if (m->mnt_watchers) {
> +               down_write(&m->mnt.mnt_sb->s_umount);
> +               ret = remove_watch_from_object(m->mnt_watchers, wqueue,
> +                                              (unsigned long)path.dentry,
> +                                              false);
> +               up_write(&m->mnt.mnt_sb->s_umount);
> +       } else {
> +               ret = -EBADSLT;
> +       }
> +
> +err_wlist:
> +       kfree(wlist);
> +err_wqueue:
> +       put_watch_queue(wqueue);
> +err_path:
> +       path_put(&path);
> +       return ret;
> +}

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/7] vfs: Add superblock notifications
  2019-05-28 16:02 ` [PATCH 4/7] vfs: Add superblock notifications David Howells
@ 2019-05-28 20:27   ` Jann Horn
  2019-05-29 12:58   ` David Howells
  1 sibling, 0 replies; 65+ messages in thread
From: Jann Horn @ 2019-05-28 20:27 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, raven, linux-fsdevel, Linux API, linux-block, keyrings,
	linux-security-module, kernel list

On Tue, May 28, 2019 at 6:05 PM David Howells <dhowells@redhat.com> wrote:
> Add a superblock event notification facility whereby notifications about
> superblock events, such as I/O errors (EIO), quota limits being hit
> (EDQUOT) and running out of space (ENOSPC) can be reported to a monitoring
> process asynchronously.  Note that this does not cover vfsmount topology
> changes.  mount_notify() is used for that.
[...]
> +#ifdef CONFIG_SB_NOTIFICATIONS
> +/*
> + * Post superblock notifications.
> + */
> +void post_sb_notification(struct super_block *s, struct superblock_notification *n)
> +{
> +       post_watch_notification(s->s_watchers, &n->watch, current_cred(),
> +                               s->s_unique_id);
> +}

You're using current_cred() here? So the idea is that if some random
process runs into a disk I/O error, the I/O error will come from that
task's credentials? In general, you're not supposed to look at task
credentials in ->read/->write handlers.

> +static void release_sb_watch(struct watch_list *wlist, struct watch *watch)
> +{
> +       struct super_block *s = watch->private;
> +
> +       put_super(s);
> +}
> +
> +/**
> + * sys_sb_notify - Watch for superblock events.
> + * @dfd: Base directory to pathwalk from or fd referring to superblock.
> + * @filename: Path to superblock to place the watch upon
> + * @at_flags: Pathwalk control flags
> + * @watch_fd: The watch queue to send notifications to.
> + * @watch_id: The watch ID to be placed in the notification (-1 to remove watch)
> + */
> +SYSCALL_DEFINE5(sb_notify,
> +               int, dfd,
> +               const char __user *, filename,
> +               unsigned int, at_flags,
> +               int, watch_fd,
> +               int, watch_id)
> +{
> +       struct watch_queue *wqueue;
> +       struct super_block *s;
> +       struct watch_list *wlist = NULL;
> +       struct watch *watch;
> +       struct path path;
> +       int ret;
> +
> +       if (watch_id < -1 || watch_id > 0xff)
> +               return -EINVAL;
> +
> +       ret = user_path_at(dfd, filename, at_flags, &path);

As in the other patch, I don't think userspace is supposed to be able
to supply user_path_at()'s third argument.

It might make sense to require that the path points to the root inode
of the superblock? That way you wouldn't be able to do this on a bind
mount that exposes part of a shared filesystem to a container.

> +       if (ret)
> +               return ret;
> +
> +       wqueue = get_watch_queue(watch_fd);
> +       if (IS_ERR(wqueue))
> +               goto err_path;
> +
> +       s = path.dentry->d_sb;
> +       if (watch_id >= 0) {
> +               if (!s->s_watchers) {
> +                       wlist = kzalloc(sizeof(*wlist), GFP_KERNEL);
> +                       if (!wlist)
> +                               goto err_wqueue;
> +                       INIT_HLIST_HEAD(&wlist->watchers);
> +                       spin_lock_init(&wlist->lock);
> +                       wlist->release_watch = release_sb_watch;
> +               }
> +
> +               watch = kzalloc(sizeof(*watch), GFP_KERNEL);
> +               if (!watch)
> +                       goto err_wlist;
> +
> +               init_watch(watch, wqueue);
> +               watch->id               = s->s_unique_id;
> +               watch->private          = s;
> +               watch->info_id          = (u32)watch_id << 24;
> +
> +               down_write(&s->s_umount);
> +               ret = -EIO;
> +               if (atomic_read(&s->s_active)) {
> +                       if (!s->s_watchers) {
> +                               s->s_watchers = wlist;
> +                               wlist = NULL;
> +                       }
> +
> +                       ret = add_watch_to_object(watch, s->s_watchers);
> +                       if (ret == 0) {
> +                               spin_lock(&sb_lock);
> +                               s->s_count++;
> +                               spin_unlock(&sb_lock);

Why do watches hold references on the superblock they're watching?

> +                       }
> +               }
> +               up_write(&s->s_umount);
> +               if (ret < 0)
> +                       kfree(watch);
> +       } else if (s->s_watchers) {

This should probably have something like a READ_ONCE() for clarity?

> +               down_write(&s->s_umount);
> +               ret = remove_watch_from_object(s->s_watchers, wqueue,
> +                                              s->s_unique_id, false);
> +               up_write(&s->s_umount);
> +       } else {
> +               ret = -EBADSLT;
> +       }
> +
> +err_wlist:
> +       kfree(wlist);
> +err_wqueue:
> +       put_watch_queue(wqueue);
> +err_path:
> +       path_put(&path);
> +       return ret;
> +}
> +#endif

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 6/7] block: Add block layer notifications
  2019-05-28 16:02 ` [PATCH 6/7] block: Add block layer notifications David Howells
@ 2019-05-28 20:37   ` Jann Horn
  0 siblings, 0 replies; 65+ messages in thread
From: Jann Horn @ 2019-05-28 20:37 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, raven, linux-fsdevel, Linux API, linux-block, keyrings,
	linux-security-module, kernel list

On Tue, May 28, 2019 at 6:05 PM David Howells <dhowells@redhat.com> wrote:
> Add a block layer notification mechanism whereby notifications about
> block-layer events such as I/O errors, can be reported to a monitoring
> process asynchronously.
[...]
> +#ifdef CONFIG_BLK_NOTIFICATIONS
> +static const enum block_notification_type blk_notifications[] = {
> +       [BLK_STS_TIMEOUT]       = NOTIFY_BLOCK_ERROR_TIMEOUT,
> +       [BLK_STS_NOSPC]         = NOTIFY_BLOCK_ERROR_NO_SPACE,
> +       [BLK_STS_TRANSPORT]     = NOTIFY_BLOCK_ERROR_RECOVERABLE_TRANSPORT,
> +       [BLK_STS_TARGET]        = NOTIFY_BLOCK_ERROR_CRITICAL_TARGET,
> +       [BLK_STS_NEXUS]         = NOTIFY_BLOCK_ERROR_CRITICAL_NEXUS,
> +       [BLK_STS_MEDIUM]        = NOTIFY_BLOCK_ERROR_CRITICAL_MEDIUM,
> +       [BLK_STS_PROTECTION]    = NOTIFY_BLOCK_ERROR_PROTECTION,
> +       [BLK_STS_RESOURCE]      = NOTIFY_BLOCK_ERROR_KERNEL_RESOURCE,
> +       [BLK_STS_DEV_RESOURCE]  = NOTIFY_BLOCK_ERROR_DEVICE_RESOURCE,
> +       [BLK_STS_IOERR]         = NOTIFY_BLOCK_ERROR_IO,
> +};
> +#endif
> +
>  blk_status_t errno_to_blk_status(int errno)
>  {
>         int i;
> @@ -179,6 +194,19 @@ static void print_req_error(struct request *req, blk_status_t status)
>                                 req->rq_disk ?  req->rq_disk->disk_name : "?",
>                                 (unsigned long long)blk_rq_pos(req),
>                                 req->cmd_flags);
> +
> +#ifdef CONFIG_BLK_NOTIFICATIONS
> +       if (blk_notifications[idx]) {

If you have this branch here, that indicates that blk_notifications
might be sparse - but at the same time, blk_notifications is not
defined in a way that explicitly ensures that it has as many elements
as blk_errors. It might make sense to add an explicit length to the
definition of blk_notifications - something like "static const enum
block_notification_type blk_notifications[ARRAY_SIZE(blk_errors)]"
maybe?

> +               struct block_notification n = {
> +                       .watch.type     = WATCH_TYPE_BLOCK_NOTIFY,
> +                       .watch.subtype  = blk_notifications[idx],
> +                       .watch.info     = sizeof(n),
> +                       .dev            = req->rq_disk ? disk_devt(req->rq_disk) : 0,
> +                       .sector         = blk_rq_pos(req),
> +               };
> +               post_block_notification(&n);
> +       }
> +#endif
>  }

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-28 16:01 ` [PATCH 1/7] General notification queue with user mmap()'able ring buffer David Howells
                     ` (2 preceding siblings ...)
  2019-05-28 19:14   ` Jann Horn
@ 2019-05-28 22:28   ` David Howells
  2019-05-28 23:16     ` Jann Horn
  3 siblings, 1 reply; 65+ messages in thread
From: David Howells @ 2019-05-28 22:28 UTC (permalink / raw)
  To: Jann Horn
  Cc: dhowells, Al Viro, raven, linux-fsdevel, Linux API, linux-block,
	keyrings, linux-security-module, kernel list

Jann Horn <jannh@google.com> wrote:

> I don't see you setting any special properties on the VMA that would
> prevent userspace from extending its size via mremap() - no
> VM_DONTEXPAND or VM_PFNMAP. So I think you might get an out-of-bounds
> access here?

Should I just set VM_DONTEXPAND in watch_queue_mmap()?  Like so:

	vma->vm_flags |= VM_DONTEXPAND;

> > +static void watch_queue_map_pages(struct vm_fault *vmf,
> > +                                 pgoff_t start_pgoff, pgoff_t end_pgoff)
> ...
> 
> Same as above.

Same solution as above?  Or do I need ot check start/end_pgoff too?

> > +static int watch_queue_mmap(struct file *file, struct vm_area_struct *vma)
> > +{
> > +       struct watch_queue *wqueue = file->private_data;
> > +
> > +       if (vma->vm_pgoff != 0 ||
> > +           vma->vm_end - vma->vm_start > wqueue->nr_pages * PAGE_SIZE ||
> > +           !(pgprot_val(vma->vm_page_prot) & pgprot_val(PAGE_SHARED)))
> > +               return -EINVAL;
> 
> This thing should probably have locking against concurrent
> watch_queue_set_size()?

Yeah.

	static int watch_queue_mmap(struct file *file,
				    struct vm_area_struct *vma)
	{
		struct watch_queue *wqueue = file->private_data;
		struct inode *inode = file_inode(file);
		u8 nr_pages;

		inode_lock(inode);
		nr_pages = wqueue->nr_pages;
		inode_unlock(inode);

		if (nr_pages == 0 ||
		...
			return -EINVAL;

> > +       smp_store_release(&buf->meta.head, len);
> 
> Why is this an smp_store_release()? The entire buffer should not be visible to
> userspace before this setup is complete, right?

Yes - if I put the above locking in the mmap handler.  OTOH, it's a one-off
barrier...

> > +               if (wqueue->buffer)
> > +                       return -EBUSY;
> 
> The preceding check occurs without any locks held and therefore has no
> reliable effect. It should probably be moved below the
> inode_lock(...).

Yeah, it can race.  I'll move it into watch_queue_set_size().

> > +static void free_watch(struct rcu_head *rcu)
> > +{
> > +       struct watch *watch = container_of(rcu, struct watch, rcu);
> > +
> > +       put_watch_queue(rcu_access_pointer(watch->queue));
> 
> This should be rcu_dereference_protected(..., 1).

That shouldn't be necessary.  rcu_access_pointer()'s description says:

 * It is also permissible to use rcu_access_pointer() when read-side
 * access to the pointer was removed at least one grace period ago, as
 * is the case in the context of the RCU callback that is freeing up
 * the data, ...

It's in an rcu callback function, so accessing the __rcu pointers in the RCU'd
struct should be fine with rcu_access_pointer().

> > +       /* We don't need the watch list lock for the next bit as RCU is
> > +        * protecting everything from being deallocated.
> 
> Does "everything" mean "the wqueue" or more than that?

Actually, just 'wqueue' and its buffer.  'watch' is held by us once we've
dequeued it as we now own the ref 'wlist' had on it.  'wlist' and 'wq' must be
pinned by the caller.

> > +                       if (release) {
> > +                               if (wlist->release_watch) {
> > +                                       rcu_read_unlock();
> > +                                       /* This might need to call dput(), so
> > +                                        * we have to drop all the locks.
> > +                                        */
> > +                                       wlist->release_watch(wlist, watch);
> 
> How are you holding a reference to `wlist` here? You got the reference through
> rcu_dereference(), you've dropped the RCU read lock, and I don't see anything
> that stabilizes the reference.

The watch record must hold a ref on the watched object if the watch_list has a
->release_watch() method.  In the code snippet above, the watch record now
belongs to us because we unlinked it under the wlist->lock some lines prior.

However, you raise a good point, and I think the thing to do is to cache
->release_watch from it and not pass wlist into (*release_watch)().  We don't
need to concern ourselves with cleaning up *wlist as it will be cleaned up
when the target object is removed.

Keyrings don't have a ->release_watch method and neither does the block-layer
notification stuff.

> > +       if (wqueue->pages && wqueue->pages[0])
> > +               WARN_ON(page_ref_count(wqueue->pages[0]) != 1);
> 
> Is there a reason why there couldn't still be references to the pages
> from get_user_pages()/get_user_pages_fast()?

I'm not sure.  I'm not sure what to do if there are.  What do you suggest?

> > +       n->info &= (WATCH_INFO_LENGTH | WATCH_INFO_TYPE_FLAGS | WATCH_INFO_ID);
> 
> Should the non-atomic modification of n->info

n's an unpublished copy of some userspace data that's local to the function
instance.  There shouldn't be any way to race with it at this point.

> (and perhaps also the
> following uses of ->debug) be protected by some lock?
> 
> > +       if (post_one_notification(wqueue, n, file->f_cred))
> > +               wqueue->debug = 0;
> > +       else
> > +               wqueue->debug++;
> > +       ret = len;
> > +       if (wqueue->debug > 20)
> > +               ret = -EIO;

It's for debugging purposes, so the non-atomiticity doesn't matter.  I'll
#undef the symbol to disable the code.

> > +#define IOC_WATCH_QUEUE_SET_SIZE       _IO('s', 0x01)  /* Set the size in pages */
> > +#define IOC_WATCH_QUEUE_SET_FILTER     _IO('s', 0x02)  /* Set the filter */
> 
> Should these ioctl numbers be registered in
> Documentation/ioctl/ioctl-number.txt?

Quite possibly.  I'm not sure where's best to actually allocate it.  I picked
's' out of the air.

David

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
  2019-05-28 16:02 ` [PATCH 3/7] vfs: Add a mount-notification facility David Howells
  2019-05-28 20:06   ` Jann Horn
@ 2019-05-28 23:04   ` David Howells
  2019-05-28 23:23     ` Jann Horn
  2019-05-29 11:16     ` David Howells
  2019-05-28 23:08   ` David Howells
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 65+ messages in thread
From: David Howells @ 2019-05-28 23:04 UTC (permalink / raw)
  To: Jann Horn
  Cc: dhowells, Al Viro, raven, linux-fsdevel, Linux API, linux-block,
	keyrings, linux-security-module, kernel list

Jann Horn <jannh@google.com> wrote:

> It might make sense to redesign this stuff so that watches don't hold
> references on the object being watched.

I explicitly made it hold a reference so that if you place a watch on an
automounted mount it stops it from expiring.

Further, if I create a watch on something, *should* it be unmountable, just as
if I had a file open there or had chdir'd into there?

David

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
  2019-05-28 16:02 ` [PATCH 3/7] vfs: Add a mount-notification facility David Howells
  2019-05-28 20:06   ` Jann Horn
  2019-05-28 23:04   ` David Howells
@ 2019-05-28 23:08   ` David Howells
  2019-05-29 10:55   ` David Howells
  2019-05-29 11:00   ` David Howells
  4 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2019-05-28 23:08 UTC (permalink / raw)
  To: Jann Horn
  Cc: dhowells, Al Viro, raven, linux-fsdevel, Linux API, linux-block,
	keyrings, linux-security-module, kernel list

David Howells <dhowells@redhat.com> wrote:

> > It might make sense to redesign this stuff so that watches don't hold
> > references on the object being watched.
>
> I explicitly made it hold a reference so that if you place a watch on an
> automounted mount it stops it from expiring.
> 
> Further, if I create a watch on something, *should* it be unmountable, just as
> if I had a file open there or had chdir'd into there?

It gets trickier than that as I need a ref on the dentry on which the watch is
rooted to prevent it from getting culled.

David

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-28 17:30   ` David Howells
@ 2019-05-28 23:12     ` Greg KH
  2019-05-29 16:06     ` David Howells
  1 sibling, 0 replies; 65+ messages in thread
From: Greg KH @ 2019-05-28 23:12 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel

On Tue, May 28, 2019 at 06:30:20PM +0100, David Howells wrote:
> Greg KH <gregkh@linuxfoundation.org> wrote:
> 
> > > Implement a misc device that implements a general notification queue as a
> > > ring buffer that can be mmap()'d from userspace.
> > 
> > "general" but just for filesystems, right?  :(
> 
> Whatever gave you that idea?  You can watch keyrings events, for example -
> they're not exactly filesystems.  I've added the ability to watch for mount
> topology changes and superblock events because those are something I've been
> asked to do.  I've added something for block events because I've recently had
> a problem with trying to recover data from a dodgy disk in that every time the
> disk goes offline, the ddrecover goes "wheeeee!" as it just sees a lot of
> EIO/ENODATA at a great rate of knots because it doesn't know the driver is now
> ignoring the disk.
> 
> I don't know what else people might want to watch, but I've tried to make it
> as generic as possible so as not to exclude it if possible.

Ok, let me try to dig up some older proposals to see if this fits into
the same model to work with them as well.

> > > +	refcount_t		usage;
> > 
> > Usage of what, this structure?  Or something else?
> 
> This is the number of usages of this struct (references to if you prefer).  I
> can add a comment to this effect.

I think you answer this later on with the kref comment :)

> > > +		return -EOPNOTSUPP;
> > 
> > -ENOTTY is the correct "not a valid ioctl" error value, right?
> 
> fs/ioctl.c does both, but I can switch it if it makes you happier.

It does.

> > > +void put_watch_queue(struct watch_queue *wqueue)
> > > +{
> > > +	if (refcount_dec_and_test(&wqueue->usage))
> > > +		kfree_rcu(wqueue, rcu);
> > 
> > Why not just use a kref?
> 
> Why use a kref?  It seems like an effort to be a C++ base class, but without
> the C++ inheritance bit.  Using kref doesn't seem to gain anything.  It's just
> a wrapper around refcount_t - so why not just use a refcount_t?
> 
> kref_put() could potentially add an unnecessary extra stack frame and would
> seem to be best avoided, though an optimising compiler ought to be able to
> inline if it can.

If kref_put() is on your fast path, you have worse problems (kfree isn't
fast, right?)

Anyway, it's an inline function, how can it add an extra stack frame?
Don't try to optimize something that isn't needed yet.

> Are you now on the convert all refcounts to krefs path?

"now"?  Remember, I wrote kref all those years ago, everyone should use
it.  It saves us having to audit the same pattern over and over again.
And, even nicer, it uses a refcount now, and as you are trying to
reference count an object, it is exactly what this was written for.

So yes, I do think it should be used here, unless it is deemed to not
fit the pattern/usage model.

> > > +EXPORT_SYMBOL(add_watch_to_object);
> > 
> > Naming nit, shouldn't the "prefix" all be the same for these new
> > functions?
> > 
> > watch_queue_add_object()?  watch_queue_put()?  And so on?
> 
> Naming is fun.  watch_queue_add_object - that suggests something different to
> what the function actually does.  I'll think about adjusting the names.

Ok, just had to say something.  It's your call, and yes, naming is hard.

> > > +module_exit(watch_queue_exit);
> > 
> > module_misc_device()?
> 
> 	warthog>git grep module_misc_device -- Documentation/
> 	warthog1>

Do I have to document all helper macros?  Anyway, it saves you
boilerplate code, but if built in, it's at the module init level, not
the fs init level, like you are asking for here.  So that might not
work, it's your call.

> > > +		struct {
> > > +			struct watch_notification watch; /* WATCH_TYPE_SKIP */
> > > +			volatile __u32	head;		/* Ring head index */
> > > +			volatile __u32	tail;		/* Ring tail index */
> > 
> > A uapi structure that has volatile in it?  Are you _SURE_ this is
> > correct?
> > 
> > That feels wrong to me...  This is not a backing-hardware register, it's
> > "just memory" and slapping volatile on it shouldn't be the correct
> > solution for telling the compiler to not to optimize away reads/flushes,
> > right?  You need a proper memory access type primitive for that to work
> > correctly everywhere I thought.
> > 
> > We only have 2 users of volatile in include/uapi, one for WMI structures
> > that are backed by firmware (seems correct), and one for DRM which I
> > have no idea how it works as it claims to be a lock.  Why is this new
> > addition the correct way to do this that no other ring-buffer that was
> > mmapped has needed to?
> 
> Yeah, I understand your concern with this.
> 
> The reason I put the volatiles in is that the kernel may be modifying the head
> pointer on one CPU simultaneously with userspace modifying the tail pointer on
> another CPU.
> 
> Note that userspace does not need to enter the kernel to find out if there's
> anything in the buffer or to read stuff out of the buffer.  Userspace only
> needs to enter the kernel, using poll() or similar, to wait for something to
> appear in the buffer.

And how does the tracing and perf ring buffers do this without needing
volatile?  Why not use the same type of interface they provide, as it's
always good to share code that has already had all of the nasty corner
cases worked out.

Anyway, I don't want you to think I don't like this code/idea overall, I
do.  I just want to see it be something that everyone can use, and use
easily, as it has been something lots of people have been asking for for
a long time.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-28 22:28   ` David Howells
@ 2019-05-28 23:16     ` Jann Horn
  0 siblings, 0 replies; 65+ messages in thread
From: Jann Horn @ 2019-05-28 23:16 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, raven, linux-fsdevel, Linux API, linux-block, keyrings,
	linux-security-module, kernel list

On Wed, May 29, 2019 at 12:28 AM David Howells <dhowells@redhat.com> wrote:
> Jann Horn <jannh@google.com> wrote:
> > I don't see you setting any special properties on the VMA that would
> > prevent userspace from extending its size via mremap() - no
> > VM_DONTEXPAND or VM_PFNMAP. So I think you might get an out-of-bounds
> > access here?
>
> Should I just set VM_DONTEXPAND in watch_queue_mmap()?  Like so:
>
>         vma->vm_flags |= VM_DONTEXPAND;

Yeah, that should work.

> > > +static void watch_queue_map_pages(struct vm_fault *vmf,
> > > +                                 pgoff_t start_pgoff, pgoff_t end_pgoff)
> > ...
> >
> > Same as above.
>
> Same solution as above?  Or do I need ot check start/end_pgoff too?

Same solution.

> > > +static int watch_queue_mmap(struct file *file, struct vm_area_struct *vma)
> > > +{
> > > +       struct watch_queue *wqueue = file->private_data;
> > > +
> > > +       if (vma->vm_pgoff != 0 ||
> > > +           vma->vm_end - vma->vm_start > wqueue->nr_pages * PAGE_SIZE ||
> > > +           !(pgprot_val(vma->vm_page_prot) & pgprot_val(PAGE_SHARED)))
> > > +               return -EINVAL;
> >
> > This thing should probably have locking against concurrent
> > watch_queue_set_size()?
>
> Yeah.
>
>         static int watch_queue_mmap(struct file *file,
>                                     struct vm_area_struct *vma)
>         {
>                 struct watch_queue *wqueue = file->private_data;
>                 struct inode *inode = file_inode(file);
>                 u8 nr_pages;
>
>                 inode_lock(inode);
>                 nr_pages = wqueue->nr_pages;
>                 inode_unlock(inode);
>
>                 if (nr_pages == 0 ||
>                 ...
>                         return -EINVAL;

Looks reasonable.

> > > +       smp_store_release(&buf->meta.head, len);
> >
> > Why is this an smp_store_release()? The entire buffer should not be visible to
> > userspace before this setup is complete, right?
>
> Yes - if I put the above locking in the mmap handler.  OTOH, it's a one-off
> barrier...
>
> > > +               if (wqueue->buffer)
> > > +                       return -EBUSY;
> >
> > The preceding check occurs without any locks held and therefore has no
> > reliable effect. It should probably be moved below the
> > inode_lock(...).
>
> Yeah, it can race.  I'll move it into watch_queue_set_size().

Sounds good.

> > > +static void free_watch(struct rcu_head *rcu)
> > > +{
> > > +       struct watch *watch = container_of(rcu, struct watch, rcu);
> > > +
> > > +       put_watch_queue(rcu_access_pointer(watch->queue));
> >
> > This should be rcu_dereference_protected(..., 1).
>
> That shouldn't be necessary.  rcu_access_pointer()'s description says:
>
>  * It is also permissible to use rcu_access_pointer() when read-side
>  * access to the pointer was removed at least one grace period ago, as
>  * is the case in the context of the RCU callback that is freeing up
>  * the data, ...
>
> It's in an rcu callback function, so accessing the __rcu pointers in the RCU'd
> struct should be fine with rcu_access_pointer().

Aah, whoops, you're right, I missed that paragraph in the
documentation of rcu_access_pointer().

> > > +       /* We don't need the watch list lock for the next bit as RCU is
> > > +        * protecting everything from being deallocated.
> >
> > Does "everything" mean "the wqueue" or more than that?
>
> Actually, just 'wqueue' and its buffer.  'watch' is held by us once we've
> dequeued it as we now own the ref 'wlist' had on it.  'wlist' and 'wq' must be
> pinned by the caller.
>
> > > +                       if (release) {
> > > +                               if (wlist->release_watch) {
> > > +                                       rcu_read_unlock();
> > > +                                       /* This might need to call dput(), so
> > > +                                        * we have to drop all the locks.
> > > +                                        */
> > > +                                       wlist->release_watch(wlist, watch);
> >
> > How are you holding a reference to `wlist` here? You got the reference through
> > rcu_dereference(), you've dropped the RCU read lock, and I don't see anything
> > that stabilizes the reference.
>
> The watch record must hold a ref on the watched object if the watch_list has a
> ->release_watch() method.  In the code snippet above, the watch record now
> belongs to us because we unlinked it under the wlist->lock some lines prior.

Ah, of course.

> However, you raise a good point, and I think the thing to do is to cache
> ->release_watch from it and not pass wlist into (*release_watch)().  We don't
> need to concern ourselves with cleaning up *wlist as it will be cleaned up
> when the target object is removed.
>
> Keyrings don't have a ->release_watch method and neither does the block-layer
> notification stuff.
>
> > > +       if (wqueue->pages && wqueue->pages[0])
> > > +               WARN_ON(page_ref_count(wqueue->pages[0]) != 1);
> >
> > Is there a reason why there couldn't still be references to the pages
> > from get_user_pages()/get_user_pages_fast()?
>
> I'm not sure.  I'm not sure what to do if there are.  What do you suggest?

I would use put_page() instead of manually freeing it; I think that
should be enough? I'm not entirely sure though.

> > > +       n->info &= (WATCH_INFO_LENGTH | WATCH_INFO_TYPE_FLAGS | WATCH_INFO_ID);
> >
> > Should the non-atomic modification of n->info
>
> n's an unpublished copy of some userspace data that's local to the function
> instance.  There shouldn't be any way to race with it at this point.

Ah, derp. Yes.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
  2019-05-28 23:04   ` David Howells
@ 2019-05-28 23:23     ` Jann Horn
  2019-05-29 11:16     ` David Howells
  1 sibling, 0 replies; 65+ messages in thread
From: Jann Horn @ 2019-05-28 23:23 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, raven, linux-fsdevel, Linux API, linux-block, keyrings,
	linux-security-module, kernel list

On Wed, May 29, 2019 at 1:04 AM David Howells <dhowells@redhat.com> wrote:
> Jann Horn <jannh@google.com> wrote:
> > It might make sense to redesign this stuff so that watches don't hold
> > references on the object being watched.
>
> I explicitly made it hold a reference so that if you place a watch on an
> automounted mount it stops it from expiring.
>
> Further, if I create a watch on something, *should* it be unmountable, just as
> if I had a file open there or had chdir'd into there?

I don't really know. I guess it depends on how it's being used? If
someone decides to e.g. make a file browser that installs watches for
a bunch of mountpoints for some fancy sidebar showing the device
mounts on the system, or something like that, that probably shouldn't
inhibit unmounting... I don't know if that's a realistic use case.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
  2019-05-28 16:01 [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications David Howells
                   ` (6 preceding siblings ...)
  2019-05-28 16:02 ` [PATCH 7/7] Add sample notification program David Howells
@ 2019-05-28 23:58 ` Greg KH
  2019-05-29  6:33 ` Amir Goldstein
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 65+ messages in thread
From: Greg KH @ 2019-05-28 23:58 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel

On Tue, May 28, 2019 at 05:01:47PM +0100, David Howells wrote:
> Things I want to avoid:
> 
>  (1) Introducing features that make the core VFS dependent on the network
>      stack or networking namespaces (ie. usage of netlink).
> 
>  (2) Dumping all this stuff into dmesg and having a daemon that sits there
>      parsing the output and distributing it as this then puts the
>      responsibility for security into userspace and makes handling
>      namespaces tricky.  Further, dmesg might not exist or might be
>      inaccessible inside a container.
> 
>  (3) Letting users see events they shouldn't be able to see.

How are you handling namespaces then?  Are they determined by the
namespace of the process that opened the original device handle, or the
namespace that made the new syscall for the events to "start flowing"?

Am I missing the logic that determines this in the patches, or is that
not implemented yet?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
  2019-05-28 16:01 [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications David Howells
                   ` (7 preceding siblings ...)
  2019-05-28 23:58 ` [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications Greg KH
@ 2019-05-29  6:33 ` Amir Goldstein
  2019-05-29 14:25   ` Jan Kara
  2019-05-29  6:45 ` David Howells
  2019-05-29  9:09 ` David Howells
  10 siblings, 1 reply; 65+ messages in thread
From: Amir Goldstein @ 2019-05-29  6:33 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Ian Kent, linux-fsdevel, linux-api, linux-block,
	keyrings, LSM List, linux-kernel, Jan Kara

On Tue, May 28, 2019 at 7:03 PM David Howells <dhowells@redhat.com> wrote:
>
>
> Hi Al,
>
> Here's a set of patches to add a general variable-length notification queue
> concept and to add sources of events for:
>
>  (1) Mount topology events, such as mounting, unmounting, mount expiry,
>      mount reconfiguration.
>
>  (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>      errors (not complete yet).
>
>  (3) Block layer events, such as I/O errors.
>
>  (4) Key/keyring events, such as creating, linking and removal of keys.
>
> One of the reasons for this is so that we can remove the issue of processes
> having to repeatedly and regularly scan /proc/mounts, which has proven to
> be a system performance problem.  To further aid this, the fsinfo() syscall
> on which this patch series depends, provides a way to access superblock and
> mount information in binary form without the need to parse /proc/mounts.
>
>
> Design decisions:
>
>  (1) A misc chardev is used to create and open a ring buffer:
>
>         fd = open("/dev/watch_queue", O_RDWR);
>
>      which is then configured and mmap'd into userspace:
>
>         ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
>         ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
>         buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
>                    MAP_SHARED, fd, 0);
>
>      The fd cannot be read or written (though there is a facility to use
>      write to inject records for debugging) and userspace just pulls data
>      directly out of the buffer.
>
>  (2) The ring index pointers are stored inside the ring and are thus
>      accessible to userspace.  Userspace should only update the tail
>      pointer and never the head pointer or risk breaking the buffer.  The
>      kernel checks that the pointers appear valid before trying to use
>      them.  A 'skip' record is maintained around the pointers.
>
>  (3) poll() can be used to wait for data to appear in the buffer.
>
>  (4) Records in the buffer are binary, typed and have a length so that they
>      can be of varying size.
>
>      This means that multiple heterogeneous sources can share a common
>      buffer.  Tags may be specified when a watchpoint is created to help
>      distinguish the sources.
>
>  (5) The queue is reusable as there are 16 million types available, of
>      which I've used 4, so there is scope for others to be used.
>
>  (6) Records are filterable as types have up to 256 subtypes that can be
>      individually filtered.  Other filtration is also available.
>
>  (7) Each time the buffer is opened, a new buffer is created - this means
>      that there's no interference between watchers.
>
>  (8) When recording a notification, the kernel will not sleep, but will
>      rather mark a queue as overrun if there's insufficient space, thereby
>      avoiding userspace causing the kernel to hang.
>
>  (9) The 'watchpoint' should be specific where possible, meaning that you
>      specify the object that you want to watch.
>
> (10) The buffer is created and then watchpoints are attached to it, using
>      one of:
>
>         keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
>         mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
>         sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);
>
>      where in all three cases, fd indicates the queue and the number after
>      is a tag between 0 and 255.
>
> (11) The watch must be removed if either the watch buffer is destroyed or
>      the watched object is destroyed.
>
>
> Things I want to avoid:
>
>  (1) Introducing features that make the core VFS dependent on the network
>      stack or networking namespaces (ie. usage of netlink).
>
>  (2) Dumping all this stuff into dmesg and having a daemon that sits there
>      parsing the output and distributing it as this then puts the
>      responsibility for security into userspace and makes handling
>      namespaces tricky.  Further, dmesg might not exist or might be
>      inaccessible inside a container.
>
>  (3) Letting users see events they shouldn't be able to see.
>
>
> Further things that could be considered:
>
>  (1) Adding a keyctl call to allow a watch on a keyring to be extended to
>      "children" of that keyring, such that the watch is removed from the
>      child if it is unlinked from the keyring.
>
>  (2) Adding global superblock event queue.
>
>  (3) Propagating watches to child superblock over automounts.
>

David,

I am interested to know how you envision filesystem notifications would
look with this interface.

fanotify can certainly benefit from providing a ring buffer interface to read
events.

From what I have seen, a common practice of users is to monitor mounts
(somehow) and place FAN_MARK_MOUNT fanotify watches dynamically.
It'd be good if those users can use a single watch mechanism/API for
watching the mount namespace and filesystem events within mounts.

A similar usability concern is with sb_notify and FAN_MARK_FILESYSTEM.
It provides users with two complete different mechanisms to watch error
and filesystem events. That is generally not a good thing to have.

I am not asking that you implement fs_notify() before merging sb_notify()
and I understand that you have a use case for sb_notify().
I am asking that you show me the path towards a unified API (how a
typical program would look like), so that we know before merging your
new API that it could be extended to accommodate fsnotify events
where the final result will look wholesome to users.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
  2019-05-28 16:01 [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications David Howells
                   ` (8 preceding siblings ...)
  2019-05-29  6:33 ` Amir Goldstein
@ 2019-05-29  6:45 ` David Howells
  2019-05-29  7:40   ` Amir Goldstein
  2019-05-29  9:09 ` David Howells
  10 siblings, 1 reply; 65+ messages in thread
From: David Howells @ 2019-05-29  6:45 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: dhowells, Al Viro, Ian Kent, linux-fsdevel, linux-api,
	linux-block, keyrings, LSM List, linux-kernel, Jan Kara

Amir Goldstein <amir73il@gmail.com> wrote:

> I am interested to know how you envision filesystem notifications would
> look with this interface.

What sort of events are you thinking of by "filesystem notifications"?  You
mean things like file changes?

David

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
  2019-05-29  6:45 ` David Howells
@ 2019-05-29  7:40   ` Amir Goldstein
  0 siblings, 0 replies; 65+ messages in thread
From: Amir Goldstein @ 2019-05-29  7:40 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Ian Kent, linux-fsdevel, linux-api, linux-block,
	keyrings, LSM List, linux-kernel, Jan Kara

On Wed, May 29, 2019 at 9:45 AM David Howells <dhowells@redhat.com> wrote:
>
> Amir Goldstein <amir73il@gmail.com> wrote:
>
> > I am interested to know how you envision filesystem notifications would
> > look with this interface.
>
> What sort of events are you thinking of by "filesystem notifications"?  You
> mean things like file changes?

I mean all the events provided by
http://man7.org/linux/man-pages/man7/fanotify.7.html

Which was recently (v4.20) extended to support watching a super block
and more recently (v5.1) extended to support watching directory entry
modifications.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
  2019-05-28 16:01 [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications David Howells
                   ` (9 preceding siblings ...)
  2019-05-29  6:45 ` David Howells
@ 2019-05-29  9:09 ` David Howells
  2019-05-29 15:41   ` Casey Schaufler
  10 siblings, 1 reply; 65+ messages in thread
From: David Howells @ 2019-05-29  9:09 UTC (permalink / raw)
  To: Greg KH, casey
  Cc: dhowells, viro, raven, linux-fsdevel, linux-api, linux-block,
	keyrings, linux-security-module, linux-kernel

Greg KH <gregkh@linuxfoundation.org> wrote:

> >  (3) Letting users see events they shouldn't be able to see.
> 
> How are you handling namespaces then?  Are they determined by the
> namespace of the process that opened the original device handle, or the
> namespace that made the new syscall for the events to "start flowing"?

So far I haven't had to deal directly with namespaces.

mount_notify() requires you to have access to the mountpoint you want to watch
- and the entire tree rooted there is in one namespace, so your event sources
are restricted to that namespace.  Further, mount objects don't themselves
have any other namespaces, not even a user_ns.

sb_notify() requires you to have access to the superblock you want to watch.
superblocks aren't directly namespaced as a class, though individual
superblocks may participate in particular namespaces (ipc, net, etc.).  I'm
thinking some of these should be marked unwatchable (all pseudo superblocks,
kernfs-class, proc, for example).

Superblocks, however, do each have a user_ns - but you were allowed to access
the superblock by pathwalk, so you must have some access to the user_ns - I
think.

KEYCTL_NOTIFY requires you to have View access on the key you're watching.
Currently, keys have no real namespace restrictions, though I have patches to
include a namespace tag in the lookup criteria.

block_notify() doesn't require any direct access since you're watching a
global queue and there is no blockdev namespacing.  LSMs are given the option
to filter events, though.  The thought here is that if you can access dmesg,
you should be able to watch for blockdev events.


Actually, thinking further on this, restricting access to events is trickier
than I thought and than perhaps Casey was suggesting.

Say you're watching a mount object and someone in a different user_ns
namespace or with a different security label mounts on it.  What governs
whether you are allowed to see the event?

You're watching the object for changes - and it *has* changed.  Further, you
might be able to see the result of this change by other means (/proc/mounts,
for instance).

Should you be denied the event based on the security model?

On the other hand, if you're watching a tree of mount objects, it could be
argued that you should be denied access to events on any mount object you
can't reach by pathwalk.

On the third hand, if you can see it in /proc/mounts or by fsinfo(), you
should get an event for it.


> How are you handling namespaces then?

So to go back to the original question.  At the moment they haven't impinged
directly and I haven't had to deal with them directly.  There are indirect
namespace restrictions that I get for free just due to pathwalk, for instance.

David

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
  2019-05-28 16:02 ` [PATCH 3/7] vfs: Add a mount-notification facility David Howells
                     ` (2 preceding siblings ...)
  2019-05-28 23:08   ` David Howells
@ 2019-05-29 10:55   ` David Howells
  2019-05-29 11:00   ` David Howells
  4 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2019-05-29 10:55 UTC (permalink / raw)
  To: Jann Horn
  Cc: dhowells, Al Viro, raven, linux-fsdevel, Linux API, linux-block,
	keyrings, linux-security-module, kernel list

Jann Horn <jannh@google.com> wrote:

> > +                       /* Global root? */
> > +                       if (mnt != parent) {
> > +                               cursor.dentry = READ_ONCE(mnt->mnt_mountpoint);
> > +                               mnt = parent;
> > +                               cursor.mnt = &mnt->mnt;
> > +                               continue;
> > +                       }
> > +                       break;
> 
> (nit: this would look clearer if you inverted the condition and wrote
> it as "if (mnt == parent) break;", then you also wouldn't need that
> "continue" or the braces)

It does look better with the logic inverted, but you *do* still need the
continue.  After the if-statement, there is:

	cursor.dentry = cursor.dentry->d_parent;

which we need to skip.  It might make sense to move that into an
else-statement from an aesthetic point of view.

David

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
  2019-05-28 16:02 ` [PATCH 3/7] vfs: Add a mount-notification facility David Howells
                     ` (3 preceding siblings ...)
  2019-05-29 10:55   ` David Howells
@ 2019-05-29 11:00   ` David Howells
  2019-05-29 15:53     ` Casey Schaufler
  4 siblings, 1 reply; 65+ messages in thread
From: David Howells @ 2019-05-29 11:00 UTC (permalink / raw)
  To: Jann Horn, casey
  Cc: dhowells, Al Viro, raven, linux-fsdevel, Linux API, linux-block,
	keyrings, linux-security-module, kernel list

Jann Horn <jannh@google.com> wrote:

> > +void post_mount_notification(struct mount *changed,
> > +                            struct mount_notification *notify)
> > +{
> > +       const struct cred *cred = current_cred();
> 
> This current_cred() looks bogus to me. Can't mount topology changes
> come from all sorts of places? For example, umount_mnt() from
> umount_tree() from dissolve_on_fput() from __fput(), which could
> happen pretty much anywhere depending on where the last reference gets
> dropped?

IIRC, that's what Casey argued is the right thing to do from a security PoV.
Casey?

Maybe I should pass in NULL creds in the case that an event is being generated
because an object is being destroyed due to the last usage[*] being removed.

 [*] Usage, not ref - Superblocks are a bit weird in their accounting.

David

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
  2019-05-28 23:04   ` David Howells
  2019-05-28 23:23     ` Jann Horn
@ 2019-05-29 11:16     ` David Howells
  1 sibling, 0 replies; 65+ messages in thread
From: David Howells @ 2019-05-29 11:16 UTC (permalink / raw)
  To: Jann Horn
  Cc: dhowells, Al Viro, raven, linux-fsdevel, Linux API, linux-block,
	keyrings, linux-security-module, kernel list

Jann Horn <jannh@google.com> wrote:

> I don't really know. I guess it depends on how it's being used? If
> someone decides to e.g. make a file browser that installs watches for
> a bunch of mountpoints for some fancy sidebar showing the device
> mounts on the system, or something like that, that probably shouldn't
> inhibit unmounting... I don't know if that's a realistic use case.

In such a use case, I would envision the browser putting a watch on "/".  A
watch sees all events in the subtree rooted at that point and you must apply a
filter that filters them out if you're not interested (filter on
WATCH_INFO_IN_SUBTREE using info_filter and info_mask).

David

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/7] vfs: Add superblock notifications
  2019-05-28 16:02 ` [PATCH 4/7] vfs: Add superblock notifications David Howells
  2019-05-28 20:27   ` Jann Horn
@ 2019-05-29 12:58   ` David Howells
  2019-05-29 14:16     ` Jann Horn
  1 sibling, 1 reply; 65+ messages in thread
From: David Howells @ 2019-05-29 12:58 UTC (permalink / raw)
  To: Jann Horn
  Cc: dhowells, Al Viro, raven, linux-fsdevel, Linux API, linux-block,
	keyrings, linux-security-module, kernel list

Jann Horn <jannh@google.com> wrote:

> It might make sense to require that the path points to the root inode
> of the superblock? That way you wouldn't be able to do this on a bind
> mount that exposes part of a shared filesystem to a container.

Why prevent that?  It doesn't prevent the container denizen from watching a
bind mount that exposes the root of a shared filesystem into a container.

It probably makes sense to permit the LSM to rule on whether a watch may be
emplaced, however.

> > +                       ret = add_watch_to_object(watch, s->s_watchers);
> > +                       if (ret == 0) {
> > +                               spin_lock(&sb_lock);
> > +                               s->s_count++;
> > +                               spin_unlock(&sb_lock);
> 
> Why do watches hold references on the superblock they're watching?

Fair point.  It was necessary at one point, but I don't think it is now.  I'll
see if I can remove it.  Note that it doesn't stop a superblock from being
unmounted and destroyed.

> > +                       }
> > +               }
> > +               up_write(&s->s_umount);
> > +               if (ret < 0)
> > +                       kfree(watch);
> > +       } else if (s->s_watchers) {
> 
> This should probably have something like a READ_ONCE() for clarity?

Note that I think I'll rearrange this to:

	} else {
		ret = -EBADSLT;
		if (s->s_watchers) {
			down_write(&s->s_umount);
			ret = remove_watch_from_object(s->s_watchers, wqueue,
						       s->s_unique_id, false);
			up_write(&s->s_umount);
		}
	}

I'm not sure READ_ONCE() is necessary, since s_watchers can only be
instantiated once and the watch list then persists until the superblock is
deactivated.  Furthermore, by the time deactivate_locked_super() is called, we
can't be calling sb_notify() on it as it's become inaccessible.

So if we see s->s_watchers as non-NULL, we should not see anything different
inside the lock.  In fact, I should be able to rewrite the above to:

	} else {
		ret = -EBADSLT;
		wlist = s->s_watchers;
		if (wlist) {
			down_write(&s->s_umount);
			ret = remove_watch_from_object(wlist, wqueue,
						       s->s_unique_id, false);
			up_write(&s->s_umount);
		}
	}

David

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/7] vfs: Add superblock notifications
  2019-05-29 12:58   ` David Howells
@ 2019-05-29 14:16     ` Jann Horn
  0 siblings, 0 replies; 65+ messages in thread
From: Jann Horn @ 2019-05-29 14:16 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, raven, linux-fsdevel, Linux API, linux-block, keyrings,
	linux-security-module, kernel list

On Wed, May 29, 2019 at 2:58 PM David Howells <dhowells@redhat.com> wrote:
> Jann Horn <jannh@google.com> wrote:
> > It might make sense to require that the path points to the root inode
> > of the superblock? That way you wouldn't be able to do this on a bind
> > mount that exposes part of a shared filesystem to a container.
>
> Why prevent that?  It doesn't prevent the container denizen from watching a
> bind mount that exposes the root of a shared filesystem into a container.

Well, yes, but if you expose the root of the shared filesystem to the
container, the container is probably meant to have a higher level of
access than if only a bind mount is exposed? But I don't know.

> It probably makes sense to permit the LSM to rule on whether a watch may be
> emplaced, however.

We should have some sort of reasonable policy outside of LSM code
though - the kernel should still be secure even if no LSMs are built
into it.

> > > +                       }
> > > +               }
> > > +               up_write(&s->s_umount);
> > > +               if (ret < 0)
> > > +                       kfree(watch);
> > > +       } else if (s->s_watchers) {
> >
> > This should probably have something like a READ_ONCE() for clarity?
>
> Note that I think I'll rearrange this to:
>
>         } else {
>                 ret = -EBADSLT;
>                 if (s->s_watchers) {
>                         down_write(&s->s_umount);
>                         ret = remove_watch_from_object(s->s_watchers, wqueue,
>                                                        s->s_unique_id, false);
>                         up_write(&s->s_umount);
>                 }
>         }
>
> I'm not sure READ_ONCE() is necessary, since s_watchers can only be
> instantiated once and the watch list then persists until the superblock is
> deactivated.  Furthermore, by the time deactivate_locked_super() is called, we
> can't be calling sb_notify() on it as it's become inaccessible.
>
> So if we see s->s_watchers as non-NULL, we should not see anything different
> inside the lock.  In fact, I should be able to rewrite the above to:
>
>         } else {
>                 ret = -EBADSLT;
>                 wlist = s->s_watchers;
>                 if (wlist) {
>                         down_write(&s->s_umount);
>                         ret = remove_watch_from_object(wlist, wqueue,
>                                                        s->s_unique_id, false);
>                         up_write(&s->s_umount);
>                 }
>         }

I'm extremely twitchy when it comes to code like this because AFAIK
gcc at least used to sometimes turn code that read a value from memory
and then used it multiple times into something with multiple memory
reads, leading to critical security vulnerabilities; see e.g. slide 36
of <https://www.blackhat.com/docs/us-16/materials/us-16-Wilhelm-Xenpwn-Breaking-Paravirtualized-Devices.pdf>.
I am not aware of any spec that requires the compiler to only perform
one read from the memory location in code like this.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
  2019-05-29  6:33 ` Amir Goldstein
@ 2019-05-29 14:25   ` Jan Kara
  2019-05-29 15:10     ` Greg KH
                       ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Jan Kara @ 2019-05-29 14:25 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: David Howells, Al Viro, Ian Kent, linux-fsdevel, linux-api,
	linux-block, keyrings, LSM List, linux-kernel, Jan Kara

On Wed 29-05-19 09:33:35, Amir Goldstein wrote:
> On Tue, May 28, 2019 at 7:03 PM David Howells <dhowells@redhat.com> wrote:
> >
> >
> > Hi Al,
> >
> > Here's a set of patches to add a general variable-length notification queue
> > concept and to add sources of events for:
> >
> >  (1) Mount topology events, such as mounting, unmounting, mount expiry,
> >      mount reconfiguration.
> >
> >  (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
> >      errors (not complete yet).
> >
> >  (3) Block layer events, such as I/O errors.
> >
> >  (4) Key/keyring events, such as creating, linking and removal of keys.
> >
> > One of the reasons for this is so that we can remove the issue of processes
> > having to repeatedly and regularly scan /proc/mounts, which has proven to
> > be a system performance problem.  To further aid this, the fsinfo() syscall
> > on which this patch series depends, provides a way to access superblock and
> > mount information in binary form without the need to parse /proc/mounts.
> >
> >
> > Design decisions:
> >
> >  (1) A misc chardev is used to create and open a ring buffer:
> >
> >         fd = open("/dev/watch_queue", O_RDWR);
> >
> >      which is then configured and mmap'd into userspace:
> >
> >         ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
> >         ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
> >         buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
> >                    MAP_SHARED, fd, 0);
> >
> >      The fd cannot be read or written (though there is a facility to use
> >      write to inject records for debugging) and userspace just pulls data
> >      directly out of the buffer.
> >
> >  (2) The ring index pointers are stored inside the ring and are thus
> >      accessible to userspace.  Userspace should only update the tail
> >      pointer and never the head pointer or risk breaking the buffer.  The
> >      kernel checks that the pointers appear valid before trying to use
> >      them.  A 'skip' record is maintained around the pointers.
> >
> >  (3) poll() can be used to wait for data to appear in the buffer.
> >
> >  (4) Records in the buffer are binary, typed and have a length so that they
> >      can be of varying size.
> >
> >      This means that multiple heterogeneous sources can share a common
> >      buffer.  Tags may be specified when a watchpoint is created to help
> >      distinguish the sources.
> >
> >  (5) The queue is reusable as there are 16 million types available, of
> >      which I've used 4, so there is scope for others to be used.
> >
> >  (6) Records are filterable as types have up to 256 subtypes that can be
> >      individually filtered.  Other filtration is also available.
> >
> >  (7) Each time the buffer is opened, a new buffer is created - this means
> >      that there's no interference between watchers.
> >
> >  (8) When recording a notification, the kernel will not sleep, but will
> >      rather mark a queue as overrun if there's insufficient space, thereby
> >      avoiding userspace causing the kernel to hang.
> >
> >  (9) The 'watchpoint' should be specific where possible, meaning that you
> >      specify the object that you want to watch.
> >
> > (10) The buffer is created and then watchpoints are attached to it, using
> >      one of:
> >
> >         keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
> >         mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
> >         sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);
> >
> >      where in all three cases, fd indicates the queue and the number after
> >      is a tag between 0 and 255.
> >
> > (11) The watch must be removed if either the watch buffer is destroyed or
> >      the watched object is destroyed.
> >
> >
> > Things I want to avoid:
> >
> >  (1) Introducing features that make the core VFS dependent on the network
> >      stack or networking namespaces (ie. usage of netlink).
> >
> >  (2) Dumping all this stuff into dmesg and having a daemon that sits there
> >      parsing the output and distributing it as this then puts the
> >      responsibility for security into userspace and makes handling
> >      namespaces tricky.  Further, dmesg might not exist or might be
> >      inaccessible inside a container.
> >
> >  (3) Letting users see events they shouldn't be able to see.
> >
> >
> > Further things that could be considered:
> >
> >  (1) Adding a keyctl call to allow a watch on a keyring to be extended to
> >      "children" of that keyring, such that the watch is removed from the
> >      child if it is unlinked from the keyring.
> >
> >  (2) Adding global superblock event queue.
> >
> >  (3) Propagating watches to child superblock over automounts.
> >
> 
> David,
> 
> I am interested to know how you envision filesystem notifications would
> look with this interface.
> 
> fanotify can certainly benefit from providing a ring buffer interface to read
> events.
> 
> From what I have seen, a common practice of users is to monitor mounts
> (somehow) and place FAN_MARK_MOUNT fanotify watches dynamically.
> It'd be good if those users can use a single watch mechanism/API for
> watching the mount namespace and filesystem events within mounts.
> 
> A similar usability concern is with sb_notify and FAN_MARK_FILESYSTEM.
> It provides users with two complete different mechanisms to watch error
> and filesystem events. That is generally not a good thing to have.
> 
> I am not asking that you implement fs_notify() before merging sb_notify()
> and I understand that you have a use case for sb_notify().
> I am asking that you show me the path towards a unified API (how a
> typical program would look like), so that we know before merging your
> new API that it could be extended to accommodate fsnotify events
> where the final result will look wholesome to users.

Are you sure we want to combine notification about file changes etc. with
administrator-type notifications about the filesystem? To me these two
sound like rather different (although sometimes related) things.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
  2019-05-29 14:25   ` Jan Kara
@ 2019-05-29 15:10     ` Greg KH
  2019-05-29 15:53     ` Amir Goldstein
  2019-06-04 12:33     ` David Howells
  2 siblings, 0 replies; 65+ messages in thread
From: Greg KH @ 2019-05-29 15:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, David Howells, Al Viro, Ian Kent, linux-fsdevel,
	linux-api, linux-block, keyrings, LSM List, linux-kernel

On Wed, May 29, 2019 at 04:25:04PM +0200, Jan Kara wrote:
> > I am not asking that you implement fs_notify() before merging sb_notify()
> > and I understand that you have a use case for sb_notify().
> > I am asking that you show me the path towards a unified API (how a
> > typical program would look like), so that we know before merging your
> > new API that it could be extended to accommodate fsnotify events
> > where the final result will look wholesome to users.
> 
> Are you sure we want to combine notification about file changes etc. with
> administrator-type notifications about the filesystem? To me these two
> sound like rather different (although sometimes related) things.

This patchset is looking to create a "generic" kernel notification
system, so I think the question is valid.  It's up to the requestor to
ask for the specific type of notification.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
  2019-05-29  9:09 ` David Howells
@ 2019-05-29 15:41   ` Casey Schaufler
  0 siblings, 0 replies; 65+ messages in thread
From: Casey Schaufler @ 2019-05-29 15:41 UTC (permalink / raw)
  To: David Howells, Greg KH
  Cc: viro, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel, casey

On 5/29/2019 2:09 AM, David Howells wrote:
> Greg KH <gregkh@linuxfoundation.org> wrote:
>
>>>  (3) Letting users see events they shouldn't be able to see.
>> How are you handling namespaces then?  Are they determined by the
>> namespace of the process that opened the original device handle, or the
>> namespace that made the new syscall for the events to "start flowing"?
> So far I haven't had to deal directly with namespaces.
>
> mount_notify() requires you to have access to the mountpoint you want to watch
> - and the entire tree rooted there is in one namespace, so your event sources
> are restricted to that namespace.  Further, mount objects don't themselves
> have any other namespaces, not even a user_ns.
>
> sb_notify() requires you to have access to the superblock you want to watch.
> superblocks aren't directly namespaced as a class, though individual
> superblocks may participate in particular namespaces (ipc, net, etc.).  I'm
> thinking some of these should be marked unwatchable (all pseudo superblocks,
> kernfs-class, proc, for example).
>
> Superblocks, however, do each have a user_ns - but you were allowed to access
> the superblock by pathwalk, so you must have some access to the user_ns - I
> think.
>
> KEYCTL_NOTIFY requires you to have View access on the key you're watching.
> Currently, keys have no real namespace restrictions, though I have patches to
> include a namespace tag in the lookup criteria.
>
> block_notify() doesn't require any direct access since you're watching a
> global queue and there is no blockdev namespacing.  LSMs are given the option
> to filter events, though.  The thought here is that if you can access dmesg,
> you should be able to watch for blockdev events.
>
>
> Actually, thinking further on this, restricting access to events is trickier
> than I thought and than perhaps Casey was suggesting.
>
> Say you're watching a mount object and someone in a different user_ns
> namespace or with a different security label mounts on it.  What governs
> whether you are allowed to see the event?

Conceptually it should be simple, but we have a variety of different
policies in the core OS, never mind what goes on inside the LSMs.
If you want to treat a notification like a signal you would only deliver
it if the process that performed the action that triggered the event
has the same UID as the process receiving the notification. Should you
decide to treat it like an IP packet only the LSMs would filter delivery.
If there are mode bits on the thing being watched shouldn't you respect
them?

> You're watching the object for changes - and it *has* changed.  Further, you
> might be able to see the result of this change by other means (/proc/mounts,
> for instance).
>
> Should you be denied the event based on the security model?

From a subject/object model view there are two objects and one
subject involved. The subject (active entity) is the process that
changes the first object, triggering an event. The watching process
(that will receive the notification) is the second object, because
its state will change (be written to) when the notification is
delivered. For the watching process to receive the notification
the changing process needs write access to the watching process.
The indirection of the notification mechanism isn't relevant.
If the changing process couldn't directly notify the watching process
it shouldn't be able to do it indirectly, either.

> On the other hand, if you're watching a tree of mount objects, it could be
> argued that you should be denied access to events on any mount object you
> can't reach by pathwalk.
>
> On the third hand, if you can see it in /proc/mounts or by fsinfo(), you
> should get an event for it.

Right. We've done a pretty good job of muddling the security
landscape by adding spiffy features to make life easier for
particular use cases. /proc is chuck full of examples. Objects
that can be viewed in many different ways make for confusing
security models. Try explaining /proc/234/fd/2 to a security
theory student.

>> How are you handling namespaces then?
> So to go back to the original question.  At the moment they haven't impinged
> directly and I haven't had to deal with them directly.  There are indirect
> namespace restrictions that I get for free just due to pathwalk, for instance.
>
> David


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
  2019-05-29 14:25   ` Jan Kara
  2019-05-29 15:10     ` Greg KH
@ 2019-05-29 15:53     ` Amir Goldstein
  2019-05-30 11:00       ` Jan Kara
  2019-06-04 12:33     ` David Howells
  2 siblings, 1 reply; 65+ messages in thread
From: Amir Goldstein @ 2019-05-29 15:53 UTC (permalink / raw)
  To: Jan Kara
  Cc: David Howells, Al Viro, Ian Kent, linux-fsdevel, linux-api,
	linux-block, keyrings, LSM List, linux-kernel

> > David,
> >
> > I am interested to know how you envision filesystem notifications would
> > look with this interface.
> >
> > fanotify can certainly benefit from providing a ring buffer interface to read
> > events.
> >
> > From what I have seen, a common practice of users is to monitor mounts
> > (somehow) and place FAN_MARK_MOUNT fanotify watches dynamically.
> > It'd be good if those users can use a single watch mechanism/API for
> > watching the mount namespace and filesystem events within mounts.
> >
> > A similar usability concern is with sb_notify and FAN_MARK_FILESYSTEM.
> > It provides users with two complete different mechanisms to watch error
> > and filesystem events. That is generally not a good thing to have.
> >
> > I am not asking that you implement fs_notify() before merging sb_notify()
> > and I understand that you have a use case for sb_notify().
> > I am asking that you show me the path towards a unified API (how a
> > typical program would look like), so that we know before merging your
> > new API that it could be extended to accommodate fsnotify events
> > where the final result will look wholesome to users.
>
> Are you sure we want to combine notification about file changes etc. with
> administrator-type notifications about the filesystem? To me these two
> sound like rather different (although sometimes related) things.
>

Well I am sure that ring buffer for fanotify events would be useful, so
seeing that David is proposing a generic notification mechanism, I wanted
to know how that mechanism could best share infrastructure with fsnotify.

But apart from that I foresee the questions from users about why the
mount notification API and filesystem events API do not have better
integration.

The way I see it, the notification queue can serve several classes
of notifications and fsnotify could be one of those classes
(at least FAN_CLASS_NOTIF fits nicely to the model).

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
  2019-05-29 11:00   ` David Howells
@ 2019-05-29 15:53     ` Casey Schaufler
  2019-05-29 16:12       ` Jann Horn
  2019-05-29 17:13       ` Andy Lutomirski
  0 siblings, 2 replies; 65+ messages in thread
From: Casey Schaufler @ 2019-05-29 15:53 UTC (permalink / raw)
  To: David Howells, Jann Horn
  Cc: Al Viro, raven, linux-fsdevel, Linux API, linux-block, keyrings,
	linux-security-module, kernel list


[-- Attachment #1.1: Type: text/plain, Size: 1386 bytes --]

On 5/29/2019 4:00 AM, David Howells wrote:
> Jann Horn <jannh@google.com> wrote:
>
>>> +void post_mount_notification(struct mount *changed,
>>> +                            struct mount_notification *notify)
>>> +{
>>> +       const struct cred *cred = current_cred();
>> This current_cred() looks bogus to me. Can't mount topology changes
>> come from all sorts of places? For example, umount_mnt() from
>> umount_tree() from dissolve_on_fput() from __fput(), which could
>> happen pretty much anywhere depending on where the last reference gets
>> dropped?
> IIRC, that's what Casey argued is the right thing to do from a security PoV.
> Casey?

You need to identify the credential of the subject that triggered
the event. If it isn't current_cred(), the cred needs to be passed
in to post_mount_notification(), or derived by some other means.

> Maybe I should pass in NULL creds in the case that an event is being generated
> because an object is being destroyed due to the last usage[*] being removed.

You should pass the cred of the process that removed the
last usage. If the last usage was removed by something like
the power being turned off on a disk drive a system cred
should be used. Someone or something caused the event. It can
be important who it was.

>  [*] Usage, not ref - Superblocks are a bit weird in their accounting.
>
> David


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-28 17:30   ` David Howells
  2019-05-28 23:12     ` Greg KH
@ 2019-05-29 16:06     ` David Howells
  2019-05-29 17:46       ` Jann Horn
                         ` (6 more replies)
  1 sibling, 7 replies; 65+ messages in thread
From: David Howells @ 2019-05-29 16:06 UTC (permalink / raw)
  To: Greg KH
  Cc: dhowells, viro, raven, linux-fsdevel, linux-api, linux-block,
	keyrings, linux-security-module, linux-kernel

Greg KH <gregkh@linuxfoundation.org> wrote:

> > kref_put() could potentially add an unnecessary extra stack frame and would
> > seem to be best avoided, though an optimising compiler ought to be able to
> > inline if it can.
> 
> If kref_put() is on your fast path, you have worse problems (kfree isn't
> fast, right?)
> 
> Anyway, it's an inline function, how can it add an extra stack frame?

The call to the function pointer.  Hopefully the compiler will optimise that
away for an inlineable function.

> > Are you now on the convert all refcounts to krefs path?
> 
> "now"?  Remember, I wrote kref all those years ago,

Yes - and I thought it wasn't a good idea at the time.  But this is the first
time you've mentioned it to me, let alone pushed to change to it, that I
recall.

> everyone should use
> it.  It saves us having to audit the same pattern over and over again.
> And, even nicer, it uses a refcount now, and as you are trying to
> reference count an object, it is exactly what this was written for.
> 
> So yes, I do think it should be used here, unless it is deemed to not
> fit the pattern/usage model.

kref_put() enforces a very specific destructor signature.  I know of places
where that doesn't work because the destructor takes more than one argument
(granted that this is not the case here).  So why does kref_put() exist at
all?  Why not kref_dec_and_test()?

Why doesn't refcount_t get merged into kref, or vice versa?  Having both would
seem redundant.

Mind you, I've been gradually reverting atomic_t-to-refcount_t conversions
because it seems I'm not allowed refcount_inc/dec_return() and I want to get
at the point refcount for tracing purposes.

> > > > +module_exit(watch_queue_exit);
> > > 
> > > module_misc_device()?
> > 
> > 	warthog>git grep module_misc_device -- Documentation/
> > 	warthog1>
> 
> Do I have to document all helper macros?

If you add an API, documenting it is your privilege ;-)  It's an important
test of the API - if you can't describe it, it's probably wrong.

Now I will grant that you didn't add that function...

> Anyway, it saves you boilerplate code, but if built in, it's at the module
> init level, not the fs init level, like you are asking for here.  So that
> might not work, it's your call.

Actually, I probably shouldn't have a module exit function.  It can't be a
module as it's called by core code.  I'll switch to builtin_misc_device().

> And how does the tracing and perf ring buffers do this without needing
> volatile?  Why not use the same type of interface they provide, as it's
> always good to share code that has already had all of the nasty corner
> cases worked out.

I've no idea how trace does it - or even where - or even if.  As far as I can
see, grepping for mmap in kernel/trace/*, there's no mmap support.

Reading Documentation/trace/ring-buffer-design.txt the trace subsystem has
some sort of transient page fifo which is a lot more complicated than what I
want and doesn't look like it'll be mmap'able.

Looking at the perf ring buffer, there appears to be a missing barrier in
perf_aux_output_end():

	rb->user_page->aux_head = rb->aux_head;

should be:

	smp_store_release(&rb->user_page->aux_head, rb->aux_head);

It should also be using smp_load_acquire().  See
Documentation/core-api/circular-buffers.rst

And a (partial) patch has been proposed: https://lkml.org/lkml/2018/5/10/249

David

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
  2019-05-29 15:53     ` Casey Schaufler
@ 2019-05-29 16:12       ` Jann Horn
  2019-05-29 17:04         ` Casey Schaufler
  2019-06-03 16:30         ` David Howells
  2019-05-29 17:13       ` Andy Lutomirski
  1 sibling, 2 replies; 65+ messages in thread
From: Jann Horn @ 2019-05-29 16:12 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: David Howells, Al Viro, raven, linux-fsdevel, Linux API,
	linux-block, keyrings, linux-security-module, kernel list,
	Andy Lutomirski

On Wed, May 29, 2019 at 5:53 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> On 5/29/2019 4:00 AM, David Howells wrote:
> > Jann Horn <jannh@google.com> wrote:
> >
> >>> +void post_mount_notification(struct mount *changed,
> >>> +                            struct mount_notification *notify)
> >>> +{
> >>> +       const struct cred *cred = current_cred();
> >> This current_cred() looks bogus to me. Can't mount topology changes
> >> come from all sorts of places? For example, umount_mnt() from
> >> umount_tree() from dissolve_on_fput() from __fput(), which could
> >> happen pretty much anywhere depending on where the last reference gets
> >> dropped?
> > IIRC, that's what Casey argued is the right thing to do from a security PoV.
> > Casey?
>
> You need to identify the credential of the subject that triggered
> the event. If it isn't current_cred(), the cred needs to be passed
> in to post_mount_notification(), or derived by some other means.
>
> > Maybe I should pass in NULL creds in the case that an event is being generated
> > because an object is being destroyed due to the last usage[*] being removed.
>
> You should pass the cred of the process that removed the
> last usage. If the last usage was removed by something like
> the power being turned off on a disk drive a system cred
> should be used. Someone or something caused the event. It can
> be important who it was.

The kernel's normal security model means that you should be able to
e.g. accept FDs that random processes send you and perform
read()/write() calls on them without acting as a subject in any
security checks; let alone close(). If you send a file descriptor over
a unix domain socket and the unix domain socket is garbage collected,
for example, I think the close() will just come from some random,
completely unrelated task that happens to trigger the garbage
collector?

Also, I think if someone does I/O via io_uring, I think the caller's
credentials for read/write operations will probably just be normal
kernel creds?

Here the checks probably aren't all that important, but in other
places, when people try to use an LSM as the primary line of defense,
checks that don't align with the kernel's normal security model might
lead to a bunch of problems.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
  2019-05-29 16:12       ` Jann Horn
@ 2019-05-29 17:04         ` Casey Schaufler
  2019-06-03 16:30         ` David Howells
  1 sibling, 0 replies; 65+ messages in thread
From: Casey Schaufler @ 2019-05-29 17:04 UTC (permalink / raw)
  To: Jann Horn
  Cc: David Howells, Al Viro, raven, linux-fsdevel, Linux API,
	linux-block, keyrings, linux-security-module, kernel list,
	Andy Lutomirski, casey

On 5/29/2019 9:12 AM, Jann Horn wrote:
> On Wed, May 29, 2019 at 5:53 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>> On 5/29/2019 4:00 AM, David Howells wrote:
>>> Jann Horn <jannh@google.com> wrote:
>>>
>>>>> +void post_mount_notification(struct mount *changed,
>>>>> +                            struct mount_notification *notify)
>>>>> +{
>>>>> +       const struct cred *cred = current_cred();
>>>> This current_cred() looks bogus to me. Can't mount topology changes
>>>> come from all sorts of places? For example, umount_mnt() from
>>>> umount_tree() from dissolve_on_fput() from __fput(), which could
>>>> happen pretty much anywhere depending on where the last reference gets
>>>> dropped?
>>> IIRC, that's what Casey argued is the right thing to do from a security PoV.
>>> Casey?
>> You need to identify the credential of the subject that triggered
>> the event. If it isn't current_cred(), the cred needs to be passed
>> in to post_mount_notification(), or derived by some other means.
>>
>>> Maybe I should pass in NULL creds in the case that an event is being generated
>>> because an object is being destroyed due to the last usage[*] being removed.
>> You should pass the cred of the process that removed the
>> last usage. If the last usage was removed by something like
>> the power being turned off on a disk drive a system cred
>> should be used. Someone or something caused the event. It can
>> be important who it was.
> The kernel's normal security model means that you should be able to
> e.g. accept FDs that random processes send you and perform
> read()/write() calls on them without acting as a subject in any
> security checks; let alone close().

Passed file descriptors are an anomaly in the security model
that (in this developer's opinion) should have never been
included. More than one of the "B" level UNIX systems disabled
them outright. 

>  If you send a file descriptor over
> a unix domain socket and the unix domain socket is garbage collected,
> for example, I think the close() will just come from some random,
> completely unrelated task that happens to trigger the garbage
> collector?

I never said this was going to be easy or pleasant.
Who destroyed the UDS? It didn't just spontaneously become
garbage. Well, not on modern Linux filesystems, anyway. 

> Also, I think if someone does I/O via io_uring, I think the caller's
> credentials for read/write operations will probably just be normal
> kernel creds?
>
> Here the checks probably aren't all that important, but in other
> places, when people try to use an LSM as the primary line of defense,
> checks that don't align with the kernel's normal security model might
> lead to a bunch of problems.

The kernel does not have a "normal security model". It has a
collection of disparate and almost but not quite contradictory
models for the various objects and mechanisms it implements.
It already has a bunch of problems, we're just used to them.

I can only send a signal to a process with the same UID. Why doesn't
a process have mode bits so that I could get signals from my group?

Why do IPC object have creator bits, while files don't?

Why can I send a file descriptor over a UDS, but not a message queue?

Why can't I set the mode bits on a symlink?

What can go wrong if I don't map groups into a user namespace?

LSMs (SELinux and Smack, which are classic mandatory access control
systems in particular) are more consistent, but still have to deal with
some of these differences. A symlink gets a Smack label, for example.

The point being that it's very easy to add new mechanisms that do
wonderful things but that introduce unforeseen ways to bypass one
or more of the existing protections.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
  2019-05-29 15:53     ` Casey Schaufler
  2019-05-29 16:12       ` Jann Horn
@ 2019-05-29 17:13       ` Andy Lutomirski
  2019-05-29 17:46         ` Casey Schaufler
  1 sibling, 1 reply; 65+ messages in thread
From: Andy Lutomirski @ 2019-05-29 17:13 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: David Howells, Jann Horn, Al Viro, raven, linux-fsdevel,
	Linux API, linux-block, keyrings, linux-security-module,
	kernel list



> On May 29, 2019, at 8:53 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> 
>> On 5/29/2019 4:00 AM, David Howells wrote:
>> Jann Horn <jannh@google.com> wrote:
>> 
>>>> +void post_mount_notification(struct mount *changed,
>>>> +                            struct mount_notification *notify)
>>>> +{
>>>> +       const struct cred *cred = current_cred();
>>> This current_cred() looks bogus to me. Can't mount topology changes
>>> come from all sorts of places? For example, umount_mnt() from
>>> umount_tree() from dissolve_on_fput() from __fput(), which could
>>> happen pretty much anywhere depending on where the last reference gets
>>> dropped?
>> IIRC, that's what Casey argued is the right thing to do from a security PoV.
>> Casey?
> 
> You need to identify the credential of the subject that triggered
> the event. If it isn't current_cred(), the cred needs to be passed
> in to post_mount_notification(), or derived by some other means.

Taking a step back, why do we care who triggered the event?  It seems to me that we should care whether the event happened and whether the *receiver* is permitted to know that.

(And receiver means whoever subscribed, presumably, not whoever called read() or mmap().)


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
  2019-05-29 17:13       ` Andy Lutomirski
@ 2019-05-29 17:46         ` Casey Schaufler
  2019-05-29 18:11           ` Jann Horn
  2019-05-29 23:12           ` Andy Lutomirski
  0 siblings, 2 replies; 65+ messages in thread
From: Casey Schaufler @ 2019-05-29 17:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Howells, Jann Horn, Al Viro, raven, linux-fsdevel,
	Linux API, linux-block, keyrings, linux-security-module,
	kernel list, casey

On 5/29/2019 10:13 AM, Andy Lutomirski wrote:
>
>> On May 29, 2019, at 8:53 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>
>>> On 5/29/2019 4:00 AM, David Howells wrote:
>>> Jann Horn <jannh@google.com> wrote:
>>>
>>>>> +void post_mount_notification(struct mount *changed,
>>>>> +                            struct mount_notification *notify)
>>>>> +{
>>>>> +       const struct cred *cred = current_cred();
>>>> This current_cred() looks bogus to me. Can't mount topology changes
>>>> come from all sorts of places? For example, umount_mnt() from
>>>> umount_tree() from dissolve_on_fput() from __fput(), which could
>>>> happen pretty much anywhere depending on where the last reference gets
>>>> dropped?
>>> IIRC, that's what Casey argued is the right thing to do from a security PoV.
>>> Casey?
>> You need to identify the credential of the subject that triggered
>> the event. If it isn't current_cred(), the cred needs to be passed
>> in to post_mount_notification(), or derived by some other means.
> Taking a step back, why do we care who triggered the event?  It seems to me that we should care whether the event happened and whether the *receiver* is permitted to know that.

There are two filesystems, "dot" and "dash". I am not allowed
to communicate with Fred on the system, and all precautions have
been taken to ensure I cannot. Fred asks for notifications on
all mount activity. I perform actions that result in notifications
on "dot" and "dash". Fred receives notifications and interprets
them using Morse code. This is not OK. If Wilma, who *is* allowed
to communicate with Fred, does the same actions, he should be
allowed to get the messages via Morse.

The event is information. The information is generated as a
result of my or Wilma's action. Fred is passive in this access.
Fred is not "reading" the event. The event is being written to
Fred. My process is the subject, and Fred's the object.

Other security modelers may disagree. The models they produce
are going to be *very* complicated and will introduce agents and
intermediate objects to justify Fred's reception of an event as
a read operation.

> (And receiver means whoever subscribed, presumably, not whoever called read() or mmap().)

The receiver is the process that gets the event. There may
be more than one receiver, and the receivers may have different
credentials. Each needs to be checked separately.

Isn't this starting to sound like the discussions on kdbus?
I'm not sure if that deserves a :) or a :( but probably one of the two.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-29 16:06     ` David Howells
@ 2019-05-29 17:46       ` Jann Horn
  2019-05-29 21:02       ` David Howells
                         ` (5 subsequent siblings)
  6 siblings, 0 replies; 65+ messages in thread
From: Jann Horn @ 2019-05-29 17:46 UTC (permalink / raw)
  To: David Howells
  Cc: Greg KH, Al Viro, raven, linux-fsdevel, Linux API, linux-block,
	keyrings, linux-security-module, kernel list, Kees Cook,
	Kernel Hardening

On Wed, May 29, 2019 at 6:07 PM David Howells <dhowells@redhat.com> wrote:
> Greg KH <gregkh@linuxfoundation.org> wrote:
> > everyone should use
> > it.  It saves us having to audit the same pattern over and over again.
> > And, even nicer, it uses a refcount now, and as you are trying to
> > reference count an object, it is exactly what this was written for.
> >
> > So yes, I do think it should be used here, unless it is deemed to not
> > fit the pattern/usage model.
>
> kref_put() enforces a very specific destructor signature.  I know of places
> where that doesn't work because the destructor takes more than one argument
> (granted that this is not the case here).  So why does kref_put() exist at
> all?  Why not kref_dec_and_test()?
>
> Why doesn't refcount_t get merged into kref, or vice versa?  Having both would
> seem redundant.
>
> Mind you, I've been gradually reverting atomic_t-to-refcount_t conversions
> because it seems I'm not allowed refcount_inc/dec_return() and I want to get
> at the point refcount for tracing purposes.

Yeeech, that's horrible, please don't do that.

Does this mean that refcount_read() isn't sufficient for what you want
to do with tracing (because for some reason you actually need to know
the values atomically at the time of increment/decrement)?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
  2019-05-29 17:46         ` Casey Schaufler
@ 2019-05-29 18:11           ` Jann Horn
  2019-05-29 19:28             ` Casey Schaufler
  2019-05-29 23:12           ` Andy Lutomirski
  1 sibling, 1 reply; 65+ messages in thread
From: Jann Horn @ 2019-05-29 18:11 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Andy Lutomirski, David Howells, Al Viro, raven, linux-fsdevel,
	Linux API, linux-block, keyrings, linux-security-module,
	kernel list

On Wed, May 29, 2019 at 7:46 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> On 5/29/2019 10:13 AM, Andy Lutomirski wrote:
> >> On May 29, 2019, at 8:53 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> >>> On 5/29/2019 4:00 AM, David Howells wrote:
> >>> Jann Horn <jannh@google.com> wrote:
> >>>
> >>>>> +void post_mount_notification(struct mount *changed,
> >>>>> +                            struct mount_notification *notify)
> >>>>> +{
> >>>>> +       const struct cred *cred = current_cred();
> >>>> This current_cred() looks bogus to me. Can't mount topology changes
> >>>> come from all sorts of places? For example, umount_mnt() from
> >>>> umount_tree() from dissolve_on_fput() from __fput(), which could
> >>>> happen pretty much anywhere depending on where the last reference gets
> >>>> dropped?
> >>> IIRC, that's what Casey argued is the right thing to do from a security PoV.
> >>> Casey?
> >> You need to identify the credential of the subject that triggered
> >> the event. If it isn't current_cred(), the cred needs to be passed
> >> in to post_mount_notification(), or derived by some other means.
> > Taking a step back, why do we care who triggered the event?  It seems to me that we should care whether the event happened and whether the *receiver* is permitted to know that.
>
> There are two filesystems, "dot" and "dash". I am not allowed
> to communicate with Fred on the system, and all precautions have
> been taken to ensure I cannot. Fred asks for notifications on
> all mount activity. I perform actions that result in notifications
> on "dot" and "dash". Fred receives notifications and interprets
> them using Morse code. This is not OK. If Wilma, who *is* allowed
> to communicate with Fred, does the same actions, he should be
> allowed to get the messages via Morse.

In other words, a classic covert channel. You can't really prevent two
cooperating processes from communicating through a covert channel on a
modern computer. You can transmit information through the scheduler,
through hyperthread resource sharing, through CPU data caches, through
disk contention, through page cache state, through RAM contention, and
probably dozens of other ways that I can't think of right now. There
have been plenty of papers that demonstrated things like an SSH
connection between two virtual machines without network access running
on the same physical host (<https://gruss.cc/files/hello.pdf>),
communication between a VM and a browser running on the host system,
and so on.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
  2019-05-29 18:11           ` Jann Horn
@ 2019-05-29 19:28             ` Casey Schaufler
  2019-05-29 19:47               ` Jann Horn
  0 siblings, 1 reply; 65+ messages in thread
From: Casey Schaufler @ 2019-05-29 19:28 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, David Howells, Al Viro, raven, linux-fsdevel,
	Linux API, linux-block, keyrings, linux-security-module,
	kernel list, casey

On 5/29/2019 11:11 AM, Jann Horn wrote:
> On Wed, May 29, 2019 at 7:46 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>> On 5/29/2019 10:13 AM, Andy Lutomirski wrote:
>>>> On May 29, 2019, at 8:53 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>> On 5/29/2019 4:00 AM, David Howells wrote:
>>>>> Jann Horn <jannh@google.com> wrote:
>>>>>
>>>>>>> +void post_mount_notification(struct mount *changed,
>>>>>>> +                            struct mount_notification *notify)
>>>>>>> +{
>>>>>>> +       const struct cred *cred = current_cred();
>>>>>> This current_cred() looks bogus to me. Can't mount topology changes
>>>>>> come from all sorts of places? For example, umount_mnt() from
>>>>>> umount_tree() from dissolve_on_fput() from __fput(), which could
>>>>>> happen pretty much anywhere depending on where the last reference gets
>>>>>> dropped?
>>>>> IIRC, that's what Casey argued is the right thing to do from a security PoV.
>>>>> Casey?
>>>> You need to identify the credential of the subject that triggered
>>>> the event. If it isn't current_cred(), the cred needs to be passed
>>>> in to post_mount_notification(), or derived by some other means.
>>> Taking a step back, why do we care who triggered the event?  It seems to me that we should care whether the event happened and whether the *receiver* is permitted to know that.
>> There are two filesystems, "dot" and "dash". I am not allowed
>> to communicate with Fred on the system, and all precautions have
>> been taken to ensure I cannot. Fred asks for notifications on
>> all mount activity. I perform actions that result in notifications
>> on "dot" and "dash". Fred receives notifications and interprets
>> them using Morse code. This is not OK. If Wilma, who *is* allowed
>> to communicate with Fred, does the same actions, he should be
>> allowed to get the messages via Morse.
> In other words, a classic covert channel. You can't really prevent two
> cooperating processes from communicating through a covert channel on a
> modern computer.

That doesn't give you permission to design them in.
Plus, the LSMs that implement mandatory access controls
are going to want to intervene. No unclassified user
should see notifications caused by Top Secret users.

>  You can transmit information through the scheduler,
> through hyperthread resource sharing, through CPU data caches, through
> disk contention, through page cache state, through RAM contention, and
> probably dozens of other ways that I can't think of right now.

Yeah, and there's been a lot of activity to reduce those,
which are hard to exploit, as opposed to this, which would
be trivial and obvious.

> There
> have been plenty of papers that demonstrated things like an SSH
> connection between two virtual machines without network access running
> on the same physical host (<https://gruss.cc/files/hello.pdf>),
> communication between a VM and a browser running on the host system,
> and so on.

So you're saying we shouldn't have mode bits on files because
spectre/meltdown makes them pointless?



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
  2019-05-29 19:28             ` Casey Schaufler
@ 2019-05-29 19:47               ` Jann Horn
  2019-05-29 20:50                 ` Casey Schaufler
  0 siblings, 1 reply; 65+ messages in thread
From: Jann Horn @ 2019-05-29 19:47 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Andy Lutomirski, David Howells, Al Viro, raven, linux-fsdevel,
	Linux API, linux-block, keyrings, linux-security-module,
	kernel list

On Wed, May 29, 2019 at 9:28 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> On 5/29/2019 11:11 AM, Jann Horn wrote:
> > On Wed, May 29, 2019 at 7:46 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> >> On 5/29/2019 10:13 AM, Andy Lutomirski wrote:
> >>>> On May 29, 2019, at 8:53 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> >>>>> On 5/29/2019 4:00 AM, David Howells wrote:
> >>>>> Jann Horn <jannh@google.com> wrote:
> >>>>>
> >>>>>>> +void post_mount_notification(struct mount *changed,
> >>>>>>> +                            struct mount_notification *notify)
> >>>>>>> +{
> >>>>>>> +       const struct cred *cred = current_cred();
> >>>>>> This current_cred() looks bogus to me. Can't mount topology changes
> >>>>>> come from all sorts of places? For example, umount_mnt() from
> >>>>>> umount_tree() from dissolve_on_fput() from __fput(), which could
> >>>>>> happen pretty much anywhere depending on where the last reference gets
> >>>>>> dropped?
> >>>>> IIRC, that's what Casey argued is the right thing to do from a security PoV.
> >>>>> Casey?
> >>>> You need to identify the credential of the subject that triggered
> >>>> the event. If it isn't current_cred(), the cred needs to be passed
> >>>> in to post_mount_notification(), or derived by some other means.
> >>> Taking a step back, why do we care who triggered the event?  It seems to me that we should care whether the event happened and whether the *receiver* is permitted to know that.
> >> There are two filesystems, "dot" and "dash". I am not allowed
> >> to communicate with Fred on the system, and all precautions have
> >> been taken to ensure I cannot. Fred asks for notifications on
> >> all mount activity. I perform actions that result in notifications
> >> on "dot" and "dash". Fred receives notifications and interprets
> >> them using Morse code. This is not OK. If Wilma, who *is* allowed
> >> to communicate with Fred, does the same actions, he should be
> >> allowed to get the messages via Morse.
> > In other words, a classic covert channel. You can't really prevent two
> > cooperating processes from communicating through a covert channel on a
> > modern computer.
>
> That doesn't give you permission to design them in.
> Plus, the LSMs that implement mandatory access controls
> are going to want to intervene. No unclassified user
> should see notifications caused by Top Secret users.

But that's probably because they're worried about *side* channels, not
covert channels?

Talking about this in the context of (small) side channels: The
notification types introduced in this patch are mostly things that a
user would be able to observe anyway if they polled /proc/self/mounts,
right? It might make sense to align access controls based on that - if
you don't want it to be possible to observe events happening on some
mount points through this API, you should probably lock down
/proc/*/mounts equivalently, by introducing an LSM hook for "is @cred
allowed to see @mnt" or something like that - and if you want to
compare two cred structures, you could record the cred structure that
is responsible for the creation of the mount point, or something like
that.

For some of the other patches, I guess things get more tricky because
the notification exposes new information that wasn't really available
before.

> >  You can transmit information through the scheduler,
> > through hyperthread resource sharing, through CPU data caches, through
> > disk contention, through page cache state, through RAM contention, and
> > probably dozens of other ways that I can't think of right now.
>
> Yeah, and there's been a lot of activity to reduce those,
> which are hard to exploit, as opposed to this, which would
> be trivial and obvious.
>
> > There
> > have been plenty of papers that demonstrated things like an SSH
> > connection between two virtual machines without network access running
> > on the same physical host (<https://gruss.cc/files/hello.pdf>),
> > communication between a VM and a browser running on the host system,
> > and so on.
>
> So you're saying we shouldn't have mode bits on files because
> spectre/meltdown makes them pointless?

spectre/meltdown are vulnerabilities that are being mitigated.
Microarchitectural covert channels are an accepted fact and I haven't
heard of anyone seriously considering trying to get rid of them all.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
  2019-05-29 19:47               ` Jann Horn
@ 2019-05-29 20:50                 ` Casey Schaufler
  0 siblings, 0 replies; 65+ messages in thread
From: Casey Schaufler @ 2019-05-29 20:50 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, David Howells, Al Viro, raven, linux-fsdevel,
	Linux API, linux-block, keyrings, linux-security-module,
	kernel list, casey

On 5/29/2019 12:47 PM, Jann Horn wrote:
> On Wed, May 29, 2019 at 9:28 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>> On 5/29/2019 11:11 AM, Jann Horn wrote:
>>> On Wed, May 29, 2019 at 7:46 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>> On 5/29/2019 10:13 AM, Andy Lutomirski wrote:
>>>>>> On May 29, 2019, at 8:53 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>>>> On 5/29/2019 4:00 AM, David Howells wrote:
>>>>>>> Jann Horn <jannh@google.com> wrote:
>>>>>>>
>>>>>>>>> +void post_mount_notification(struct mount *changed,
>>>>>>>>> +                            struct mount_notification *notify)
>>>>>>>>> +{
>>>>>>>>> +       const struct cred *cred = current_cred();
>>>>>>>> This current_cred() looks bogus to me. Can't mount topology changes
>>>>>>>> come from all sorts of places? For example, umount_mnt() from
>>>>>>>> umount_tree() from dissolve_on_fput() from __fput(), which could
>>>>>>>> happen pretty much anywhere depending on where the last reference gets
>>>>>>>> dropped?
>>>>>>> IIRC, that's what Casey argued is the right thing to do from a security PoV.
>>>>>>> Casey?
>>>>>> You need to identify the credential of the subject that triggered
>>>>>> the event. If it isn't current_cred(), the cred needs to be passed
>>>>>> in to post_mount_notification(), or derived by some other means.
>>>>> Taking a step back, why do we care who triggered the event?  It seems to me that we should care whether the event happened and whether the *receiver* is permitted to know that.
>>>> There are two filesystems, "dot" and "dash". I am not allowed
>>>> to communicate with Fred on the system, and all precautions have
>>>> been taken to ensure I cannot. Fred asks for notifications on
>>>> all mount activity. I perform actions that result in notifications
>>>> on "dot" and "dash". Fred receives notifications and interprets
>>>> them using Morse code. This is not OK. If Wilma, who *is* allowed
>>>> to communicate with Fred, does the same actions, he should be
>>>> allowed to get the messages via Morse.
>>> In other words, a classic covert channel. You can't really prevent two
>>> cooperating processes from communicating through a covert channel on a
>>> modern computer.
>> That doesn't give you permission to design them in.
>> Plus, the LSMs that implement mandatory access controls
>> are going to want to intervene. No unclassified user
>> should see notifications caused by Top Secret users.
> But that's probably because they're worried about *side* channels, not
> covert channels?

The security evaluators from the 1990's considered any channel
with greater than 1 bit/second bandwidth a show-stopper. That was
true for covert and side channels. Further, if you knew that a
mechanism had a channel, as this one does, and you didn't fix it,
you didn't get your certificate. If you know about a problem
during the design/implementation phase it's really inexcusable not
to fix it before "completing" the code.

> Talking about this in the context of (small) side channels: The
> notification types introduced in this patch are mostly things that a
> user would be able to observe anyway if they polled /proc/self/mounts,
> right?

It's supposed to be a general mechanism. Of course it would
be simpler if is was restricted to things you can get at via
/proc/self.

>  It might make sense to align access controls based on that - if
> you don't want it to be possible to observe events happening on some
> mount points through this API, you should probably lock down
> /proc/*/mounts equivalently, by introducing an LSM hook for "is @cred
> allowed to see @mnt" or something like that - and if you want to
> compare two cred structures, you could record the cred structure that
> is responsible for the creation of the mount point, or something like
> that.

I'm not going to argue against that.

> For some of the other patches, I guess things get more tricky because
> the notification exposes new information that wasn't really available
> before.

We have to look not just at the information being available,
but the mechanism used. Being able to look at information about
a process in /proc doesn't mean I should be able to look at it
using ptrace(). Access control isn't done on data, it's done on
objects. That I can get information by looking in one object provides
no assurance that I can get it through a different object containing
the same information. This happens in /dev all over the place. A
file with hard links may be accessible by one path but not another.

>
>>>  You can transmit information through the scheduler,
>>> through hyperthread resource sharing, through CPU data caches, through
>>> disk contention, through page cache state, through RAM contention, and
>>> probably dozens of other ways that I can't think of right now.
>> Yeah, and there's been a lot of activity to reduce those,
>> which are hard to exploit, as opposed to this, which would
>> be trivial and obvious.
>>
>>> There
>>> have been plenty of papers that demonstrated things like an SSH
>>> connection between two virtual machines without network access running
>>> on the same physical host (<https://gruss.cc/files/hello.pdf>),
>>> communication between a VM and a browser running on the host system,
>>> and so on.
>> So you're saying we shouldn't have mode bits on files because
>> spectre/meltdown makes them pointless?
> spectre/meltdown are vulnerabilities that are being mitigated.
> Microarchitectural covert channels are an accepted fact and I haven't
> heard of anyone seriously considering trying to get rid of them all.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-29 16:06     ` David Howells
  2019-05-29 17:46       ` Jann Horn
@ 2019-05-29 21:02       ` David Howells
  2019-05-31 11:14         ` Peter Zijlstra
  2019-05-31 12:02         ` David Howells
  2019-05-29 23:09       ` Greg KH
                         ` (4 subsequent siblings)
  6 siblings, 2 replies; 65+ messages in thread
From: David Howells @ 2019-05-29 21:02 UTC (permalink / raw)
  To: Jann Horn
  Cc: dhowells, Greg KH, Al Viro, raven, linux-fsdevel, Linux API,
	linux-block, keyrings, linux-security-module, kernel list,
	Kees Cook, Kernel Hardening

Jann Horn <jannh@google.com> wrote:

> Does this mean that refcount_read() isn't sufficient for what you want
> to do with tracing (because for some reason you actually need to know
> the values atomically at the time of increment/decrement)?

Correct.  There's a gap and if an interrupt or something occurs, it's
sufficiently big for the refcount trace to go weird.

I've seen it in afs/rxrpc where the incoming network packets that are part of
the rxrpc call flow disrupt the refcounts noted in trace lines.

David

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-29 16:06     ` David Howells
  2019-05-29 17:46       ` Jann Horn
  2019-05-29 21:02       ` David Howells
@ 2019-05-29 23:09       ` Greg KH
  2019-05-29 23:11       ` Greg KH
                         ` (3 subsequent siblings)
  6 siblings, 0 replies; 65+ messages in thread
From: Greg KH @ 2019-05-29 23:09 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel

On Wed, May 29, 2019 at 05:06:40PM +0100, David Howells wrote:
> Greg KH <gregkh@linuxfoundation.org> wrote:
> 
> > > kref_put() could potentially add an unnecessary extra stack frame and would
> > > seem to be best avoided, though an optimising compiler ought to be able to
> > > inline if it can.
> > 
> > If kref_put() is on your fast path, you have worse problems (kfree isn't
> > fast, right?)
> > 
> > Anyway, it's an inline function, how can it add an extra stack frame?
> 
> The call to the function pointer.  Hopefully the compiler will optimise that
> away for an inlineable function.

The function pointer only gets called for the last "put", and then kfree
will be called so you should not have to worry about speed/stack frames
at that point in time.

> > > Are you now on the convert all refcounts to krefs path?
> > 
> > "now"?  Remember, I wrote kref all those years ago,
> 
> Yes - and I thought it wasn't a good idea at the time.  But this is the first
> time you've mentioned it to me, let alone pushed to change to it, that I
> recall.

I bring up using a kref any time I see a usage that could use it as it
makes it easier for people to understand and "know" you are doing your
reference counting for your object "correctly".  It's an abstraction
that is used to make it easier for us developers to understand.
Otherwise you have to hand-roll the same logic here.  Yes, refcounts
have made it easier to do it in your own (which was their goal), but you
still don't have to do it "on your own".

Anyway, I'll not push the issue here, if you want to stick to a
refcount_t, that's enough for now.  We can worry about changing this
later after you have debugged all the corner conditions :)

> > everyone should use
> > it.  It saves us having to audit the same pattern over and over again.
> > And, even nicer, it uses a refcount now, and as you are trying to
> > reference count an object, it is exactly what this was written for.
> > 
> > So yes, I do think it should be used here, unless it is deemed to not
> > fit the pattern/usage model.
> 
> kref_put() enforces a very specific destructor signature.  I know of places
> where that doesn't work because the destructor takes more than one argument
> (granted that this is not the case here).  So why does kref_put() exist at
> all?  Why not kref_dec_and_test()?

The destructor only takes one object pointer as you are finally freeing
that object.  What more do you need/want to "know" at that point in
time?

What would kref_dec_and_test() be needed for?

> Why doesn't refcount_t get merged into kref, or vice versa?  Having both would
> seem redundant.

kref uses refcount_t and provides a different functionality on top of
it.  Not all uses of a refcount in the kernel is for object lifecycle
reference counting, as you know :)

> Mind you, I've been gradually reverting atomic_t-to-refcount_t conversions
> because it seems I'm not allowed refcount_inc/dec_return() and I want to get
> at the point refcount for tracing purposes.

That's not good, we should address that independently as you are loosing
functionality/protection when doing that.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-29 16:06     ` David Howells
                         ` (2 preceding siblings ...)
  2019-05-29 23:09       ` Greg KH
@ 2019-05-29 23:11       ` Greg KH
  2019-05-30  9:50         ` Andrea Parri
  2019-05-31  8:47       ` Peter Zijlstra
                         ` (2 subsequent siblings)
  6 siblings, 1 reply; 65+ messages in thread
From: Greg KH @ 2019-05-29 23:11 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel

On Wed, May 29, 2019 at 05:06:40PM +0100, David Howells wrote:
> Greg KH <gregkh@linuxfoundation.org> wrote:
> > And how does the tracing and perf ring buffers do this without needing
> > volatile?  Why not use the same type of interface they provide, as it's
> > always good to share code that has already had all of the nasty corner
> > cases worked out.
> 
> I've no idea how trace does it - or even where - or even if.  As far as I can
> see, grepping for mmap in kernel/trace/*, there's no mmap support.
> 
> Reading Documentation/trace/ring-buffer-design.txt the trace subsystem has
> some sort of transient page fifo which is a lot more complicated than what I
> want and doesn't look like it'll be mmap'able.
> 
> Looking at the perf ring buffer, there appears to be a missing barrier in
> perf_aux_output_end():
> 
> 	rb->user_page->aux_head = rb->aux_head;
> 
> should be:
> 
> 	smp_store_release(&rb->user_page->aux_head, rb->aux_head);
> 
> It should also be using smp_load_acquire().  See
> Documentation/core-api/circular-buffers.rst
> 
> And a (partial) patch has been proposed: https://lkml.org/lkml/2018/5/10/249

So, if that's all that needs to be fixed, can you use the same
buffer/code if that patch is merged?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
  2019-05-29 17:46         ` Casey Schaufler
  2019-05-29 18:11           ` Jann Horn
@ 2019-05-29 23:12           ` Andy Lutomirski
  2019-05-29 23:56             ` Casey Schaufler
  1 sibling, 1 reply; 65+ messages in thread
From: Andy Lutomirski @ 2019-05-29 23:12 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: David Howells, Jann Horn, Al Viro, raven, linux-fsdevel,
	Linux API, linux-block, keyrings, linux-security-module,
	kernel list



> On May 29, 2019, at 10:46 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> 
>> On 5/29/2019 10:13 AM, Andy Lutomirski wrote:
>> 
>>>> On May 29, 2019, at 8:53 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>> 
>>>> On 5/29/2019 4:00 AM, David Howells wrote:
>>>> Jann Horn <jannh@google.com> wrote:
>>>> 
>>>>>> +void post_mount_notification(struct mount *changed,
>>>>>> +                            struct mount_notification *notify)
>>>>>> +{
>>>>>> +       const struct cred *cred = current_cred();
>>>>> This current_cred() looks bogus to me. Can't mount topology changes
>>>>> come from all sorts of places? For example, umount_mnt() from
>>>>> umount_tree() from dissolve_on_fput() from __fput(), which could
>>>>> happen pretty much anywhere depending on where the last reference gets
>>>>> dropped?
>>>> IIRC, that's what Casey argued is the right thing to do from a security PoV.
>>>> Casey?
>>> You need to identify the credential of the subject that triggered
>>> the event. If it isn't current_cred(), the cred needs to be passed
>>> in to post_mount_notification(), or derived by some other means.
>> Taking a step back, why do we care who triggered the event?  It seems to me that we should care whether the event happened and whether the *receiver* is permitted to know that.
> 
> There are two filesystems, "dot" and "dash". I am not allowed
> to communicate with Fred on the system, and all precautions have
> been taken to ensure I cannot. Fred asks for notifications on
> all mount activity. I perform actions that result in notifications
> on "dot" and "dash". Fred receives notifications and interprets
> them using Morse code. This is not OK. If Wilma, who *is* allowed
> to communicate with Fred, does the same actions, he should be
> allowed to get the messages via Morse.

Under this scenario, Fred should not be allowed to enable these watches. If you give yourself and Fred unconstrained access to the same FS, then can communicate.

> 
> Other security modelers may disagree. The models they produce
> are going to be *very* complicated and will introduce agents and
> intermediate objects to justify Fred's reception of an event as
> a read operation.

I disagree. They’ll model the watch as something to prevent if they want to restrict communication.

> 
>> (And receiver means whoever subscribed, presumably, not whoever called read() or mmap().)
> 
> The receiver is the process that gets the event. There may
> be more than one receiver, and the receivers may have different
> credentials. Each needs to be checked separately.

I think it’s a bit crazy to have the same event queue with two readers who read different things.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
  2019-05-29 23:12           ` Andy Lutomirski
@ 2019-05-29 23:56             ` Casey Schaufler
  0 siblings, 0 replies; 65+ messages in thread
From: Casey Schaufler @ 2019-05-29 23:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Howells, Jann Horn, Al Viro, raven, linux-fsdevel,
	Linux API, linux-block, keyrings, linux-security-module,
	kernel list, casey

On 5/29/2019 4:12 PM, Andy Lutomirski wrote:
>
>> On May 29, 2019, at 10:46 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>
>>> On 5/29/2019 10:13 AM, Andy Lutomirski wrote:
>>>
>>>>> On May 29, 2019, at 8:53 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>>
>>>>> On 5/29/2019 4:00 AM, David Howells wrote:
>>>>> Jann Horn <jannh@google.com> wrote:
>>>>>
>>>>>>> +void post_mount_notification(struct mount *changed,
>>>>>>> +                            struct mount_notification *notify)
>>>>>>> +{
>>>>>>> +       const struct cred *cred = current_cred();
>>>>>> This current_cred() looks bogus to me. Can't mount topology changes
>>>>>> come from all sorts of places? For example, umount_mnt() from
>>>>>> umount_tree() from dissolve_on_fput() from __fput(), which could
>>>>>> happen pretty much anywhere depending on where the last reference gets
>>>>>> dropped?
>>>>> IIRC, that's what Casey argued is the right thing to do from a security PoV.
>>>>> Casey?
>>>> You need to identify the credential of the subject that triggered
>>>> the event. If it isn't current_cred(), the cred needs to be passed
>>>> in to post_mount_notification(), or derived by some other means.
>>> Taking a step back, why do we care who triggered the event?  It seems to me that we should care whether the event happened and whether the *receiver* is permitted to know that.
>> There are two filesystems, "dot" and "dash". I am not allowed
>> to communicate with Fred on the system, and all precautions have
>> been taken to ensure I cannot. Fred asks for notifications on
>> all mount activity. I perform actions that result in notifications
>> on "dot" and "dash". Fred receives notifications and interprets
>> them using Morse code. This is not OK. If Wilma, who *is* allowed
>> to communicate with Fred, does the same actions, he should be
>> allowed to get the messages via Morse.
> Under this scenario, Fred should not be allowed to enable these watches. If you give yourself and Fred unconstrained access to the same FS, then can communicate.

How are you going to determine at the time Fred tries to enable the watches
that I am going to do something that will trigger them? I'm not saying it isn't
possible, I'm curious how you would propose doing it. If you deny Fred the ability
to set watches because it is possible for me to trigger them, he can't use them
to get information from Wilma, either.

>
>> Other security modelers may disagree. The models they produce
>> are going to be *very* complicated and will introduce agents and
>> intermediate objects to justify Fred's reception of an event as
>> a read operation.
> I disagree. They’ll model the watch as something to prevent if they want to restrict communication.

Sorry, but that isn't sufficiently detailed to be meaningful.

>>> (And receiver means whoever subscribed, presumably, not whoever called read() or mmap().)
>> The receiver is the process that gets the event. There may
>> be more than one receiver, and the receivers may have different
>> credentials. Each needs to be checked separately.
> I think it’s a bit crazy to have the same event queue with two readers who read different things.

Look at killpg(3).

The process that creates the event has to be involved in the
access decision. Otherwise you have an uncontrolled data channel.
When the receiver reads the event queue it knows nothing about the
sender, and hence cannot make the decision unless the credential of
the sender is kept with the event message, and used when the
receiver tries to access it. I don't think that wold work well with
the mechanism as designed.
 



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-29 23:11       ` Greg KH
@ 2019-05-30  9:50         ` Andrea Parri
  2019-05-31  8:35           ` Peter Zijlstra
  0 siblings, 1 reply; 65+ messages in thread
From: Andrea Parri @ 2019-05-30  9:50 UTC (permalink / raw)
  To: Greg KH
  Cc: David Howells, viro, raven, linux-fsdevel, linux-api,
	linux-block, keyrings, linux-security-module, linux-kernel,
	Peter Zijlstra, Will Deacon, Paul E. McKenney, Mark Rutland

> > Looking at the perf ring buffer, there appears to be a missing barrier in
> > perf_aux_output_end():
> > 
> > 	rb->user_page->aux_head = rb->aux_head;
> > 
> > should be:
> > 
> > 	smp_store_release(&rb->user_page->aux_head, rb->aux_head);
> > 
> > It should also be using smp_load_acquire().  See
> > Documentation/core-api/circular-buffers.rst
> > 
> > And a (partial) patch has been proposed: https://lkml.org/lkml/2018/5/10/249
> 
> So, if that's all that needs to be fixed, can you use the same
> buffer/code if that patch is merged?

That's about one year old...: let me add the usual suspects in Cc:  ;-)
since I'm not sure what the plan was (or if I'm missing something) ...

Speaking of ring buffer implementations (and relatively "old" patches),
here's another quite interesting:

  https://lkml.kernel.org/r/20181211034032.32338-1-yuleixzhang@tencent.com

Thanks,
  Andrea

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
  2019-05-29 15:53     ` Amir Goldstein
@ 2019-05-30 11:00       ` Jan Kara
  0 siblings, 0 replies; 65+ messages in thread
From: Jan Kara @ 2019-05-30 11:00 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, David Howells, Al Viro, Ian Kent, linux-fsdevel,
	linux-api, linux-block, keyrings, LSM List, linux-kernel

On Wed 29-05-19 18:53:21, Amir Goldstein wrote:
> > > David,
> > >
> > > I am interested to know how you envision filesystem notifications would
> > > look with this interface.
> > >
> > > fanotify can certainly benefit from providing a ring buffer interface to read
> > > events.
> > >
> > > From what I have seen, a common practice of users is to monitor mounts
> > > (somehow) and place FAN_MARK_MOUNT fanotify watches dynamically.
> > > It'd be good if those users can use a single watch mechanism/API for
> > > watching the mount namespace and filesystem events within mounts.
> > >
> > > A similar usability concern is with sb_notify and FAN_MARK_FILESYSTEM.
> > > It provides users with two complete different mechanisms to watch error
> > > and filesystem events. That is generally not a good thing to have.
> > >
> > > I am not asking that you implement fs_notify() before merging sb_notify()
> > > and I understand that you have a use case for sb_notify().
> > > I am asking that you show me the path towards a unified API (how a
> > > typical program would look like), so that we know before merging your
> > > new API that it could be extended to accommodate fsnotify events
> > > where the final result will look wholesome to users.
> >
> > Are you sure we want to combine notification about file changes etc. with
> > administrator-type notifications about the filesystem? To me these two
> > sound like rather different (although sometimes related) things.
> >
> 
> Well I am sure that ring buffer for fanotify events would be useful, so
> seeing that David is proposing a generic notification mechanism, I wanted
> to know how that mechanism could best share infrastructure with fsnotify.
> 
> But apart from that I foresee the questions from users about why the
> mount notification API and filesystem events API do not have better
> integration.
> 
> The way I see it, the notification queue can serve several classes
> of notifications and fsnotify could be one of those classes
> (at least FAN_CLASS_NOTIF fits nicely to the model).

I agree that for some type of fsnotify uses a ring buffer would make sense.
But for others - such as permission events or unlimited queues - you cannot
really use the ring buffer and I don't like the idea of having different
ways of passing fsnotify events to userspace based on notification group
type...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-30  9:50         ` Andrea Parri
@ 2019-05-31  8:35           ` Peter Zijlstra
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Zijlstra @ 2019-05-31  8:35 UTC (permalink / raw)
  To: Andrea Parri
  Cc: Greg KH, David Howells, viro, raven, linux-fsdevel, linux-api,
	linux-block, keyrings, linux-security-module, linux-kernel,
	Will Deacon, Paul E. McKenney, Mark Rutland

On Thu, May 30, 2019 at 11:50:39AM +0200, Andrea Parri wrote:
> > > Looking at the perf ring buffer, there appears to be a missing barrier in
> > > perf_aux_output_end():
> > > 
> > > 	rb->user_page->aux_head = rb->aux_head;
> > > 
> > > should be:
> > > 
> > > 	smp_store_release(&rb->user_page->aux_head, rb->aux_head);
> > > 
> > > It should also be using smp_load_acquire().  See
> > > Documentation/core-api/circular-buffers.rst
> > > 
> > > And a (partial) patch has been proposed: https://lkml.org/lkml/2018/5/10/249
> > 
> > So, if that's all that needs to be fixed, can you use the same
> > buffer/code if that patch is merged?
> 
> That's about one year old...: let me add the usual suspects in Cc:  ;-)
> since I'm not sure what the plan was (or if I'm missing something) ...

The AUX crud is 'special' and smp_store_release() doesn't really help in
many cases. Notable, AUX is typically used in combination with a
hardware writer. The driver is in charge of odering here, the generic
code doesn't know what the appropriate barrier (if any) is and would
have to resort to the most expensive/heavy one available.

Also see the comment right above this function:

 "It is the
  pmu driver's responsibility to observe ordering rules of the hardware,
  so that all the data is externally visible before this is called."



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-29 16:06     ` David Howells
                         ` (3 preceding siblings ...)
  2019-05-29 23:11       ` Greg KH
@ 2019-05-31  8:47       ` Peter Zijlstra
  2019-05-31 12:42       ` David Howells
  2019-05-31 14:55       ` David Howells
  6 siblings, 0 replies; 65+ messages in thread
From: Peter Zijlstra @ 2019-05-31  8:47 UTC (permalink / raw)
  To: David Howells
  Cc: Greg KH, viro, raven, linux-fsdevel, linux-api, linux-block,
	keyrings, linux-security-module, linux-kernel

On Wed, May 29, 2019 at 05:06:40PM +0100, David Howells wrote:

> Looking at the perf ring buffer, there appears to be a missing barrier in
> perf_aux_output_end():
> 
> 	rb->user_page->aux_head = rb->aux_head;
> 
> should be:
> 
> 	smp_store_release(&rb->user_page->aux_head, rb->aux_head);

I've answered that in another email; the aux bit is 'magic'.

> It should also be using smp_load_acquire().  See
> Documentation/core-api/circular-buffers.rst

We use the control dependency instead, as described in the comment of
perf_output_put_handle():

	 *   kernel				user
	 *
	 *   if (LOAD ->data_tail) {		LOAD ->data_head
	 *			(A)		smp_rmb()	(C)
	 *	STORE $data			LOAD $data
	 *	smp_wmb()	(B)		smp_mb()	(D)
	 *	STORE ->data_head		STORE ->data_tail
	 *   }
	 *
	 * Where A pairs with D, and B pairs with C.
	 *
	 * In our case (A) is a control dependency that separates the load of
	 * the ->data_tail and the stores of $data. In case ->data_tail
	 * indicates there is no room in the buffer to store $data we do not.
	 *
	 * D needs to be a full barrier since it separates the data READ
	 * from the tail WRITE.
	 *
	 * For B a WMB is sufficient since it separates two WRITEs, and for C
	 * an RMB is sufficient since it separates two READs.

Userspace can choose to use smp_load_acquire() over the first smp_rmb()
if that is efficient for the architecture (for w ahole bunch of archs
load-acquire would end up using mb() while rmb() is adequate and
cheaper).

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-29 21:02       ` David Howells
@ 2019-05-31 11:14         ` Peter Zijlstra
  2019-05-31 12:02         ` David Howells
  1 sibling, 0 replies; 65+ messages in thread
From: Peter Zijlstra @ 2019-05-31 11:14 UTC (permalink / raw)
  To: David Howells
  Cc: Jann Horn, Greg KH, Al Viro, raven, linux-fsdevel, Linux API,
	linux-block, keyrings, linux-security-module, kernel list,
	Kees Cook, Kernel Hardening

On Wed, May 29, 2019 at 10:02:43PM +0100, David Howells wrote:
> Jann Horn <jannh@google.com> wrote:
> 
> > Does this mean that refcount_read() isn't sufficient for what you want
> > to do with tracing (because for some reason you actually need to know
> > the values atomically at the time of increment/decrement)?
> 
> Correct.  There's a gap and if an interrupt or something occurs, it's
> sufficiently big for the refcount trace to go weird.
> 
> I've seen it in afs/rxrpc where the incoming network packets that are part of
> the rxrpc call flow disrupt the refcounts noted in trace lines.

Can you re-iterate the exact problem? I konw we talked about this in the
past, but I seem to have misplaced those memories :/

FWIW I agree that kref is useless fluff, but I've long ago given up on
that fight.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-29 21:02       ` David Howells
  2019-05-31 11:14         ` Peter Zijlstra
@ 2019-05-31 12:02         ` David Howells
  2019-05-31 13:26           ` Peter Zijlstra
  2019-05-31 14:20           ` David Howells
  1 sibling, 2 replies; 65+ messages in thread
From: David Howells @ 2019-05-31 12:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dhowells, Jann Horn, Greg KH, Al Viro, raven, linux-fsdevel,
	Linux API, linux-block, keyrings, linux-security-module,
	kernel list, Kees Cook, Kernel Hardening

Peter Zijlstra <peterz@infradead.org> wrote:

> Can you re-iterate the exact problem? I konw we talked about this in the
> past, but I seem to have misplaced those memories :/

Take this for example:

	void afs_put_call(struct afs_call *call)
	{
		struct afs_net *net = call->net;
		int n = atomic_dec_return(&call->usage);
		int o = atomic_read(&net->nr_outstanding_calls);

		trace_afs_call(call, afs_call_trace_put, n + 1, o,
			       __builtin_return_address(0));

		ASSERTCMP(n, >=, 0);
		if (n == 0) {
			...
		}
	}

I am printing the usage count in the afs_call tracepoint so that I can use it
to debug refcount bugs.  If I do it like this:

	void afs_put_call(struct afs_call *call)
	{
		int n = refcount_read(&call->usage);
		int o = atomic_read(&net->nr_outstanding_calls);

		trace_afs_call(call, afs_call_trace_put, n, o,
			       __builtin_return_address(0));

		if (refcount_dec_and_test(&call->usage)) {
			...
		}
	}

then there's a temporal gap between the usage count being read and the actual
atomic decrement in which another CPU can alter the count.  This can be
exacerbated by an interrupt occurring, a softirq occurring or someone enabling
the tracepoint.

I can't do the tracepoint after the decrement if refcount_dec_and_test()
returns false unless I save all the values from the object that I might need
as the object could be destroyed any time from that point on.  In this
particular case, that's just call->debug_id, but it could be other things in
other cases.

Note that I also can't touch the afs_net object in that situation either, and
the outstanding calls count that I record will potentially be out of date -
but there's not a lot I can do about that.

David

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-29 16:06     ` David Howells
                         ` (4 preceding siblings ...)
  2019-05-31  8:47       ` Peter Zijlstra
@ 2019-05-31 12:42       ` David Howells
  2019-05-31 14:55       ` David Howells
  6 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2019-05-31 12:42 UTC (permalink / raw)
  To: Greg KH
  Cc: dhowells, viro, raven, linux-fsdevel, linux-api, linux-block,
	keyrings, linux-security-module, linux-kernel

Greg KH <gregkh@linuxfoundation.org> wrote:

> > kref_put() enforces a very specific destructor signature.  I know of places
> > where that doesn't work because the destructor takes more than one argument
> > (granted that this is not the case here).  So why does kref_put() exist at
> > all?  Why not kref_dec_and_test()?
>
> The destructor only takes one object pointer as you are finally freeing
> that object.  What more do you need/want to "know" at that point in
> time?

Imagine that I have an object that's on a list rooted in a namespace and that
I have a lot of these objects.  Imagine further that any time I want to put a
ref on one of these objects, it's in a context that has the namespace pinned.
I therefore don't need to store a pointer to the namespace in every object
because I can pass that in to the put function

Indeed, I can still access the namespace even after the decrement didn't
reduce the usage count to 0 - say for doing statistics.

> What would kref_dec_and_test() be needed for?

Why do you need kref_put() to take a destructor function pointer?  Why cannot
that be replaced with, say:

	static inline bool __kref_put(struct kref *k)
	{
		return refcount_dec_and_test(&k->refcount);
	}

and then one could do:

	void put_foo(struct foo_net *ns, struct foo *f)
	{
		if (__kref_put(&f->refcount)) {
			// destroy foo
		}
	}

that way the destruction code does not have to be offloaded into its own
function and you still have your pattern to look for.

For tracing purposes, I could live with something like:

	static inline
	bool __kref_put_return(struct kref *k, unsigned int *_usage)
	{
		return refcount_dec_and_test_return(&k->refcount, _usage);
	}

and then I could do:

	void put_foo(struct foo_net *ns, struct foo *f)
	{
		unsigned int u;
		bool is_zero = __kref_put_return(&f->refcount, &u);

		trace_foo_refcount(f, u);
		if (is_zero) {
			// destroy foo
		}
	}

then it could be made such that you can disable the ability of
refcount_dec_and_test_return() to pass back a useful refcount value if you
want a bit of extra speed.

Or even if refcount_dec_return() is guaranteed to return 0 if the count hits
the floor and non-zero otherwise and there's a config switch to impose a
stronger guarantee that it will return a value that's appropriately
transformed to look as if I was using atomic_dec_return().

Similarly for refcount_inc_return() - it could just return gibberish unless
the same config switch is enabled.

Question for AMD/Intel guys: I'm curious if LOCK DECL faster than LOCK XADD -1
on x86_64?

> > Why doesn't refcount_t get merged into kref, or vice versa?  Having both
> > would seem redundant.
>
> kref uses refcount_t and provides a different functionality on top of
> it.  Not all uses of a refcount in the kernel is for object lifecycle
> reference counting, as you know :)

I do?  I can't think of one offhand.  Not that I'm saying you're wrong on
that - there's an awful lot of kernel.

David

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-31 12:02         ` David Howells
@ 2019-05-31 13:26           ` Peter Zijlstra
  2019-05-31 14:20           ` David Howells
  1 sibling, 0 replies; 65+ messages in thread
From: Peter Zijlstra @ 2019-05-31 13:26 UTC (permalink / raw)
  To: David Howells
  Cc: Jann Horn, Greg KH, Al Viro, raven, linux-fsdevel, Linux API,
	linux-block, keyrings, linux-security-module, kernel list,
	Kees Cook, Kernel Hardening

On Fri, May 31, 2019 at 01:02:15PM +0100, David Howells wrote:
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Can you re-iterate the exact problem? I konw we talked about this in the
> > past, but I seem to have misplaced those memories :/
> 
> Take this for example:
> 
> 	void afs_put_call(struct afs_call *call)
> 	{
> 		struct afs_net *net = call->net;
> 		int n = atomic_dec_return(&call->usage);
> 		int o = atomic_read(&net->nr_outstanding_calls);
> 
> 		trace_afs_call(call, afs_call_trace_put, n + 1, o,
> 			       __builtin_return_address(0));
> 
> 		ASSERTCMP(n, >=, 0);
> 		if (n == 0) {
> 			...
> 		}
> 	}
> 
> I am printing the usage count in the afs_call tracepoint so that I can use it
> to debug refcount bugs.  If I do it like this:
> 
> 	void afs_put_call(struct afs_call *call)
> 	{
> 		int n = refcount_read(&call->usage);
> 		int o = atomic_read(&net->nr_outstanding_calls);
> 
> 		trace_afs_call(call, afs_call_trace_put, n, o,
> 			       __builtin_return_address(0));
> 
> 		if (refcount_dec_and_test(&call->usage)) {
> 			...
> 		}
> 	}
> 
> then there's a temporal gap between the usage count being read and the actual
> atomic decrement in which another CPU can alter the count.  This can be
> exacerbated by an interrupt occurring, a softirq occurring or someone enabling
> the tracepoint.
> 
> I can't do the tracepoint after the decrement if refcount_dec_and_test()
> returns false unless I save all the values from the object that I might need
> as the object could be destroyed any time from that point on.

Is it not the responsibility of the task that affects the 1->0
transition to actually free the memory?

That is, I'm expecting the '...' in both cases above the include the
actual freeing of the object. If this is not the case, then @usage is
not a reference count.

(and it has already been established that refcount_t doesn't work for
usage count scenarios)

Aside from that, is the problem that refcount_dec_and_test() returns a
boolean (true - last put, false - not last) instead of the refcount
value? This does indeed make it hard to print the exact count value for
the event.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-31 12:02         ` David Howells
  2019-05-31 13:26           ` Peter Zijlstra
@ 2019-05-31 14:20           ` David Howells
  2019-05-31 16:44             ` Peter Zijlstra
  2019-05-31 17:12             ` David Howells
  1 sibling, 2 replies; 65+ messages in thread
From: David Howells @ 2019-05-31 14:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dhowells, Jann Horn, Greg KH, Al Viro, raven, linux-fsdevel,
	Linux API, linux-block, keyrings, linux-security-module,
	kernel list, Kees Cook, Kernel Hardening

Peter Zijlstra <peterz@infradead.org> wrote:

> Is it not the responsibility of the task that affects the 1->0
> transition to actually free the memory?
> 
> That is, I'm expecting the '...' in both cases above the include the
> actual freeing of the object. If this is not the case, then @usage is
> not a reference count.

Yes.  The '...' does the freeing.  It seemed unnecessary to include the code
ellipsised there since it's not the point of the discussion, but if you want
the full function:

	void afs_put_call(struct afs_call *call)
	{
		struct afs_net *net = call->net;
		int n = atomic_dec_return(&call->usage);
		int o = atomic_read(&net->nr_outstanding_calls);

		trace_afs_call(call, afs_call_trace_put, n + 1, o,
			       __builtin_return_address(0));

		ASSERTCMP(n, >=, 0);
		if (n == 0) {
			ASSERT(!work_pending(&call->async_work));
			ASSERT(call->type->name != NULL);

			if (call->rxcall) {
				rxrpc_kernel_end_call(net->socket, call->rxcall);
				call->rxcall = NULL;
			}
			if (call->type->destructor)
				call->type->destructor(call);

			afs_put_server(call->net, call->server);
			afs_put_cb_interest(call->net, call->cbi);
			afs_put_addrlist(call->alist);
			kfree(call->request);

			trace_afs_call(call, afs_call_trace_free, 0, o,
				       __builtin_return_address(0));
			kfree(call);

			o = atomic_dec_return(&net->nr_outstanding_calls);
			if (o == 0)
				wake_up_var(&net->nr_outstanding_calls);
		}
	}

You can see the kfree(call) in there.
Peter Zijlstra <peterz@infradead.org> wrote:

> (and it has already been established that refcount_t doesn't work for
> usage count scenarios)

?

Does that mean struct kref doesn't either?

> Aside from that, is the problem that refcount_dec_and_test() returns a
> boolean (true - last put, false - not last) instead of the refcount
> value? This does indeed make it hard to print the exact count value for
> the event.

That is the problem, yes - well, one of them: refcount_inc() doesn't either.

David

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-29 16:06     ` David Howells
                         ` (5 preceding siblings ...)
  2019-05-31 12:42       ` David Howells
@ 2019-05-31 14:55       ` David Howells
  6 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2019-05-31 14:55 UTC (permalink / raw)
  To: Greg KH
  Cc: dhowells, viro, raven, linux-fsdevel, linux-api, linux-block,
	keyrings, linux-security-module, linux-kernel

Greg KH <gregkh@linuxfoundation.org> wrote:

> So, if that's all that needs to be fixed, can you use the same
> buffer/code if that patch is merged?

I really don't know.  The perf code is complex, partially in hardware drivers
and is tricky to understand - though a chunk of that is the "aux" buffer part;
PeterZ used words like "special" and "magic" and the comments in the code talk
about the hardware writing into the buffer.

__perf_output_begin() does not appear to be SMP safe.  It uses local_cmpxchg()
and local_add() which on x86 lack the LOCK prefix.

stracing the perf command on my test machine, it calls perf_event_open(2) four
times and mmap's each fd it gets back.  I'm guessing that each one maps a
separate buffer for each CPU.

So to use watch_queue based on perf's buffering, you would have to have a
(2^N)+1 pages-sized buffer for each CPU.  So that would be a minimum of 64K of
unswappable memory for my desktop machine, say).  Multiply that by each
process that wants to listen for events...

What I'm aiming for is something that has a single buffer used by all CPUs for
each instance of /dev/watch_queue opened and I'd also like to avoid having to
allocate the metadata page and the aux buffer to save space.  This is locked
memory and cannot be swapped.

Also, perf has to leave a gap in the ring because it uses CIRC_SPACE(), though
that's a minor detail that I guess can't be fixed now.

I'm also slightly concerned that __perf_output_begin() doesn't check if
rb->user->tail has got ahead of rb->user->head or that it's lagging too far
behind.  I doubt it's a serious problem for the kernel since it won't write
outside of the buffer, but userspace might screw up.  I think the worst that
will happen is that userspace will get confused.

One thing I would like is to waive the 2^N size requirement.  I understand
*why* we do that, but I wonder how expensive DIV instructions are for
relatively small divisors.

David

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-31 14:20           ` David Howells
@ 2019-05-31 16:44             ` Peter Zijlstra
  2019-05-31 17:12             ` David Howells
  1 sibling, 0 replies; 65+ messages in thread
From: Peter Zijlstra @ 2019-05-31 16:44 UTC (permalink / raw)
  To: David Howells
  Cc: Jann Horn, Greg KH, Al Viro, raven, linux-fsdevel, Linux API,
	linux-block, keyrings, linux-security-module, kernel list,
	Kees Cook, Kernel Hardening

On Fri, May 31, 2019 at 03:20:12PM +0100, David Howells wrote:
> Peter Zijlstra <peterz@infradead.org> wrote:

> > (and it has already been established that refcount_t doesn't work for
> > usage count scenarios)
> 
> ?
> 
> Does that mean struct kref doesn't either?

Indeed, since kref is just a pointless wrapper around refcount_t it does
not either.

The main distinction between a reference count and a usage count is that
0 means different things. For a refcount 0 means dead. For a usage count
0 is merely unused but valid.

Incrementing a 0 refcount is a serious bug -- use-after-free (and hence
refcount_t will refuse this and splat), for a usage count this is no
problem.

Now, it is sort-of possible to merge the two, by basically stating
something like: usage = refcount - 1. But that can get tricky and people
have not really liked the result much for the few times I tried.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-31 14:20           ` David Howells
  2019-05-31 16:44             ` Peter Zijlstra
@ 2019-05-31 17:12             ` David Howells
  2019-06-17 16:24               ` Peter Zijlstra
  1 sibling, 1 reply; 65+ messages in thread
From: David Howells @ 2019-05-31 17:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dhowells, Jann Horn, Greg KH, Al Viro, raven, linux-fsdevel,
	Linux API, linux-block, keyrings, linux-security-module,
	kernel list, Kees Cook, Kernel Hardening

Peter Zijlstra <peterz@infradead.org> wrote:

> > > (and it has already been established that refcount_t doesn't work for
> > > usage count scenarios)
> > 
> > ?
> > 
> > Does that mean struct kref doesn't either?
> 
> Indeed, since kref is just a pointless wrapper around refcount_t it does
> not either.
> 
> The main distinction between a reference count and a usage count is that
> 0 means different things. For a refcount 0 means dead. For a usage count
> 0 is merely unused but valid.

Ah - I consider the terms interchangeable.

Take Documentation/filesystems/vfs.txt for instance:

  dget: open a new handle for an existing dentry (this just increments
	the usage count)

  dput: close a handle for a dentry (decrements the usage count). ...

  ...

  d_lookup: look up a dentry given its parent and path name component
	It looks up the child of that given name from the dcache
	hash table. If it is found, the reference count is incremented
	and the dentry is returned. The caller must use dput()
	to free the dentry when it finishes using it.

Here we interchange the terms.

Or https://www.kernel.org/doc/gorman/html/understand/understand013.html
which seems to interchange the terms in reference to struct page.

David

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
  2019-05-29 16:12       ` Jann Horn
  2019-05-29 17:04         ` Casey Schaufler
@ 2019-06-03 16:30         ` David Howells
  1 sibling, 0 replies; 65+ messages in thread
From: David Howells @ 2019-06-03 16:30 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: dhowells, Jann Horn, Al Viro, raven, linux-fsdevel, Linux API,
	linux-block, keyrings, linux-security-module, kernel list,
	Andy Lutomirski

Casey Schaufler <casey@schaufler-ca.com> wrote:

> >> should be used. Someone or something caused the event. It can
> >> be important who it was.
> > The kernel's normal security model means that you should be able to
> > e.g. accept FDs that random processes send you and perform
> > read()/write() calls on them without acting as a subject in any
> > security checks; let alone close().
> 
> Passed file descriptors are an anomaly in the security model
> that (in this developer's opinion) should have never been
> included. More than one of the "B" level UNIX systems disabled
> them outright. 

Considering further on this, I think the only way to implement what you're
suggesting is to add a field to struct file to record the last fputter's creds
as the procedure of fputting is offloaded to a workqueue.

Note that's last fputter, not the last closer, as we don't track the number of
open fds linked to a file struct.

In the case of AF_UNIX sockets that contain in-the-process-of-being-passed fds
at the time of closure, this is further complicated by the socket fput being
achieved in the work item - thereby adding layers of indirection.

It might be possible to replace f_cred rather than adding a new field, but
that might get used somewhere after that point.

Note also that fsnotify_close() doesn't appear to use the last fputter's path
since it's not available if called from deferred fput.

David

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
  2019-05-29 14:25   ` Jan Kara
  2019-05-29 15:10     ` Greg KH
  2019-05-29 15:53     ` Amir Goldstein
@ 2019-06-04 12:33     ` David Howells
  2 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2019-06-04 12:33 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: dhowells, Jan Kara, Al Viro, Ian Kent, linux-fsdevel, linux-api,
	linux-block, keyrings, LSM List, linux-kernel

Amir Goldstein <amir73il@gmail.com> wrote:

> Well I am sure that ring buffer for fanotify events would be useful, so
> seeing that David is proposing a generic notification mechanism, I wanted
> to know how that mechanism could best share infrastructure with fsnotify.
>
> But apart from that I foresee the questions from users about why the
> mount notification API and filesystem events API do not have better
> integration.
>
> The way I see it, the notification queue can serve several classes
> of notifications and fsnotify could be one of those classes
> (at least FAN_CLASS_NOTIF fits nicely to the model).

It could be done; the main thing that concerns me is that the buffer is of
limited capacity.

However, I could take this:

	struct fanotify_event_metadata {
		__u32 event_len;
		__u8 vers;
		__u8 reserved;
		__u16 metadata_len;
		__aligned_u64 mask;
		__s32 fd;
		__s32 pid;
	};

and map it to:

	struct fanotify_notification {
		struct watch_notification watch; /* WATCH_TYPE_FANOTIFY */
		__aligned_u64	mask;
		__u16		metadata_len;
		__u8		vers;
		__u8		reserved;
		__u32		reserved2;
		__s32		fd;
		__s32		pid;
	};

and some of the watch::info bit could be used:

	n->watch.info & WATCH_INFO_OVERRUN	watch queue overran
	n->watch.info & WATCH_INFO_LENGTH	event_len
	n->watch.info & WATCH_INFO_RECURSIVE	FAN_EVENT_ON_CHILD
	n->watch.info & WATCH_INFO_FLAG_0	FAN_*_PERM
	n->watch.info & WATCH_INFO_FLAG_1	FAN_Q_OVERFLOW
	n->watch.info & WATCH_INFO_FLAG_2	FAN_ON_DIR
	n->subtype				ffs(n->mask)

Ideally, I'd dispense with metadata_len, vers, reserved* and set the version
when setting the watch.

	fanotify_watch(int watchfd, unsigned int flags, u64 *mask,
		       int dirfd, const char *pathname, unsigned int at_flags);

We might also want to extend the watch_filter to allow you to, say, filter on
the first __u64 after the watch member so that you could filter on specific
events:

	struct watch_notification_type_filter {
		__u32	type;
		__u32	info_filter;
		__u32	info_mask;
		__u32	subtype_filter[8];
		__u64	payload_mask[1];
		__u64	payload_set[1];
	};

So, in this case, it would require:

	n->mask & wf->payload_mask[0] == wf->payload_set[0]

to be true to record the message.

David

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
  2019-05-31 17:12             ` David Howells
@ 2019-06-17 16:24               ` Peter Zijlstra
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Zijlstra @ 2019-06-17 16:24 UTC (permalink / raw)
  To: David Howells
  Cc: Jann Horn, Greg KH, Al Viro, raven, linux-fsdevel, Linux API,
	linux-block, keyrings, linux-security-module, kernel list,
	Kees Cook, Kernel Hardening

On Fri, May 31, 2019 at 06:12:42PM +0100, David Howells wrote:
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > > (and it has already been established that refcount_t doesn't work for
> > > > usage count scenarios)
> > > 
> > > ?
> > > 
> > > Does that mean struct kref doesn't either?
> > 
> > Indeed, since kref is just a pointless wrapper around refcount_t it does
> > not either.
> > 
> > The main distinction between a reference count and a usage count is that
> > 0 means different things. For a refcount 0 means dead. For a usage count
> > 0 is merely unused but valid.
> 
> Ah - I consider the terms interchangeable.
> 
> Take Documentation/filesystems/vfs.txt for instance:
> 
>   dget: open a new handle for an existing dentry (this just increments
> 	the usage count)
> 
>   dput: close a handle for a dentry (decrements the usage count). ...
> 
>   ...
> 
>   d_lookup: look up a dentry given its parent and path name component
> 	It looks up the child of that given name from the dcache
> 	hash table. If it is found, the reference count is incremented
> 	and the dentry is returned. The caller must use dput()
> 	to free the dentry when it finishes using it.
> 
> Here we interchange the terms.
> 
> Or https://www.kernel.org/doc/gorman/html/understand/understand013.html
> which seems to interchange the terms in reference to struct page.

Right, but we have two distinct set of semantics, I figured it makes
sense to have two different names for them. Do you have an alternative
naming scheme we could use?

Or should we better document our distinction between reference and usage
count?

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2019-06-17 16:25 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-28 16:01 [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications David Howells
2019-05-28 16:01 ` [PATCH 1/7] General notification queue with user mmap()'able ring buffer David Howells
2019-05-28 16:26   ` Greg KH
2019-05-28 17:30   ` David Howells
2019-05-28 23:12     ` Greg KH
2019-05-29 16:06     ` David Howells
2019-05-29 17:46       ` Jann Horn
2019-05-29 21:02       ` David Howells
2019-05-31 11:14         ` Peter Zijlstra
2019-05-31 12:02         ` David Howells
2019-05-31 13:26           ` Peter Zijlstra
2019-05-31 14:20           ` David Howells
2019-05-31 16:44             ` Peter Zijlstra
2019-05-31 17:12             ` David Howells
2019-06-17 16:24               ` Peter Zijlstra
2019-05-29 23:09       ` Greg KH
2019-05-29 23:11       ` Greg KH
2019-05-30  9:50         ` Andrea Parri
2019-05-31  8:35           ` Peter Zijlstra
2019-05-31  8:47       ` Peter Zijlstra
2019-05-31 12:42       ` David Howells
2019-05-31 14:55       ` David Howells
2019-05-28 19:14   ` Jann Horn
2019-05-28 22:28   ` David Howells
2019-05-28 23:16     ` Jann Horn
2019-05-28 16:02 ` [PATCH 2/7] keys: Add a notification facility David Howells
2019-05-28 16:02 ` [PATCH 3/7] vfs: Add a mount-notification facility David Howells
2019-05-28 20:06   ` Jann Horn
2019-05-28 23:04   ` David Howells
2019-05-28 23:23     ` Jann Horn
2019-05-29 11:16     ` David Howells
2019-05-28 23:08   ` David Howells
2019-05-29 10:55   ` David Howells
2019-05-29 11:00   ` David Howells
2019-05-29 15:53     ` Casey Schaufler
2019-05-29 16:12       ` Jann Horn
2019-05-29 17:04         ` Casey Schaufler
2019-06-03 16:30         ` David Howells
2019-05-29 17:13       ` Andy Lutomirski
2019-05-29 17:46         ` Casey Schaufler
2019-05-29 18:11           ` Jann Horn
2019-05-29 19:28             ` Casey Schaufler
2019-05-29 19:47               ` Jann Horn
2019-05-29 20:50                 ` Casey Schaufler
2019-05-29 23:12           ` Andy Lutomirski
2019-05-29 23:56             ` Casey Schaufler
2019-05-28 16:02 ` [PATCH 4/7] vfs: Add superblock notifications David Howells
2019-05-28 20:27   ` Jann Horn
2019-05-29 12:58   ` David Howells
2019-05-29 14:16     ` Jann Horn
2019-05-28 16:02 ` [PATCH 5/7] fsinfo: Export superblock notification counter David Howells
2019-05-28 16:02 ` [PATCH 6/7] block: Add block layer notifications David Howells
2019-05-28 20:37   ` Jann Horn
2019-05-28 16:02 ` [PATCH 7/7] Add sample notification program David Howells
2019-05-28 23:58 ` [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications Greg KH
2019-05-29  6:33 ` Amir Goldstein
2019-05-29 14:25   ` Jan Kara
2019-05-29 15:10     ` Greg KH
2019-05-29 15:53     ` Amir Goldstein
2019-05-30 11:00       ` Jan Kara
2019-06-04 12:33     ` David Howells
2019-05-29  6:45 ` David Howells
2019-05-29  7:40   ` Amir Goldstein
2019-05-29  9:09 ` David Howells
2019-05-29 15:41   ` Casey Schaufler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).