linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
@ 2021-10-25 20:46 Ioannis Angelakopoulos
  2021-10-25 20:46 ` [RFC PATCH 1/7] FUSE: Add the fsnotify opcode and in/out structs to FUSE Ioannis Angelakopoulos
                   ` (7 more replies)
  0 siblings, 8 replies; 47+ messages in thread
From: Ioannis Angelakopoulos @ 2021-10-25 20:46 UTC (permalink / raw)
  To: linux-fsdevel, virtio-fs, linux-kernel, jack, amir73il, viro,
	miklos, vgoyal
  Cc: Ioannis Angelakopoulos

Hello,

I am a PhD student currently interning at Red Hat and working on the
virtiofs file system. I am trying to add support for the Inotify
notification subsystem in virtiofs. I seek your feedback and
suggestions on what is the right direction to take.

Currently, virtiofs does not support the Inotify API and there are
applications which look for the Inotify support in virtiofs (e.g., Kata
containers).

However, all the event notification subsystems (Dnotify/Inotify/Fanotify)
are supported by only by local kernel file systems.

--Proposed solution

With this RFC patch we add the inotify support to the FUSE kernel module
so that the remote virtiofs file system (based on FUSE) used by a QEMU
guest VM can make use of this feature.

Specifically, we enhance FUSE to add/modify/delete watches on the FUSE
server and also receive remote inotify events. To achieve this we modify
the fsnotify subsystem so that it calls specific hooks in FUSE when a
remote watch is added/modified/deleted and FUSE calls into fsnotify when
a remote event is received to send the event to user space.

In our case the FUSE server is virtiofsd.

We also considered an out of band approach for implementing the remote
notifications (e.g., FAM, Gamin), however this approach would break
applications that are already compatible with inotify, and thus would
require an update.

These kernel patches depend on the patch series posted by Vivek Goyal:
https://lore.kernel.org/linux-fsdevel/20210930143850.1188628-1-vgoyal@redhat.com/

My PoC Linux kernel patches are here:
https://github.com/iangelak/linux/commits/inotify_v1

My PoC virtiofsd corresponding patches are here:
https://github.com/iangelak/qemu/commits/inotify_v1

--Advantages

1) Our approach is compatible with existing applications that rely on
Inotify, thus improves portability.

2) Everything is implemented in one place (virtiofs and virtiofsd) and
there is no need to run additional processes (daemons) specifically to
handle the remote notifications.

--Weaknesses

1) Both a local (QEMU guest) and a remote (Host/Virtiofsd) watch on the
target inode have to be active at the same time. The local watch
guarantees that events are going to be sent to the guest user space while
the remote watch captures events occurring on the host (and will be sent
to the guest).

As a result, when an event occures on a inode within the exported
directory by virtiofs, two events will be generated at the same time; a
local event (generated by the guest kernel) and a remote event (generated
by the host), thus the guest will receive duplicate events.

To account for this issue we implemented two modes; one where local events
function as expected (when virtiofsd does not support the remote
inotify) and one where the local events are suppressed and only the
remote events originating from the host side are let through (when
virtiofsd supports the remote inotify).

3) The lifetime of the local watch in the guest kernel is very
important. Specifically, there is a possibility that the guest does not
receive remote events on time, if it removes its local watch on the
target or deletes the inode (and thus the guest kernel removes the watch).
In these cases the guest kernel removes the local watch before the
remote events arrive from the host (virtiofsd) and as such the guest
kernel drops all the remote events for the target inode (since the
corresponding local watch does not exist anymore). On top of that,
virtiofsd keeps an open proc file descriptor for each inode that is not
immediately closed on a inode deletion request by the guest. As a result
no IN_DELETE_SELF is generated by virtiofsd and sent to the guest kernel
in this case.

4) Because virtiofsd implements additional operations during the
servicing of a request from the guest, additional inotify events might
be generated and sent to the guest other than the ones the guest
expects. However, this is not technically a limitation and it is dependent
on the implementation of the remote file system server (in this case
virtiofsd).

5) The current implementation only supports Inotify, due to its
simplicity and not Fanotify. Fanotify's complexity requires support from
virtiofsd that is not currently available. One such example is
Fsnotify's access permission decision capabilities, which could
conflict with virtiofsd's current access permission implementation.

Ioannis Angelakopoulos (7):
  FUSE: Add the fsnotify opcode and in/out structs to FUSE
  FUSE: Add the remote inotify support capability to FUSE
  FUSE,Inotify,Fsnotify,VFS: Add the fuse_fsnotify_update_mark inode
    operation
  FUSE: Add the fuse_fsnotify_send_request to FUSE
  Fsnotify: Add a wrapper around the fsnotify function
  FUSE,Fsnotify: Add the fuse_fsnotify_event inode operation
  virtiofs: Add support for handling the remote fsnotify notifications

 fs/fuse/dev.c                    | 37 ++++++++++++++++++
 fs/fuse/dir.c                    | 56 ++++++++++++++++++++++++++++
 fs/fuse/fuse_i.h                 | 10 +++++
 fs/fuse/inode.c                  |  6 +++
 fs/fuse/virtio_fs.c              | 64 +++++++++++++++++++++++++++++++-
 fs/notify/fsnotify.c             | 35 ++++++++++++++++-
 fs/notify/inotify/inotify_user.c | 11 ++++++
 fs/notify/mark.c                 | 10 +++++
 include/linux/fs.h               |  5 +++
 include/linux/fsnotify_backend.h | 14 ++++++-
 include/uapi/linux/fuse.h        | 23 +++++++++++-
 11 files changed, 265 insertions(+), 6 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH 1/7] FUSE: Add the fsnotify opcode and in/out structs to FUSE
  2021-10-25 20:46 [RFC PATCH 0/7] Inotify support in FUSE and virtiofs Ioannis Angelakopoulos
@ 2021-10-25 20:46 ` Ioannis Angelakopoulos
  2021-10-26 14:56   ` Amir Goldstein
  2021-10-25 20:46 ` [RFC PATCH 2/7] FUSE: Add the remote inotify support capability " Ioannis Angelakopoulos
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 47+ messages in thread
From: Ioannis Angelakopoulos @ 2021-10-25 20:46 UTC (permalink / raw)
  To: linux-fsdevel, virtio-fs, linux-kernel, jack, amir73il, viro,
	miklos, vgoyal
  Cc: Ioannis Angelakopoulos

Since fsnotify is the backend for the inotify subsystem all the backend
code implementation we add is related to fsnotify.

To support an fsnotify request in FUSE and specifically virtiofs we add a
new opcode for the FSNOTIFY (51) operation request in the "fuse.h" header.

Also add the "fuse_notify_fsnotify_in" and "fuse_notify_fsnotify_out"
structs that are responsible for passing the fsnotify/inotify related data
to and from the FUSE server.

Signed-off-by: Ioannis Angelakopoulos <iangelak@redhat.com>
---
 include/uapi/linux/fuse.h | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 46838551ea84..418b7fc72417 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -186,6 +186,9 @@
  *  - add FUSE_SYNCFS
  *  7.35
  *  - add FUSE_NOTIFY_LOCK
+ *  7.36
+ *  - add FUSE_HAVE_FSNOTIFY
+ *  - add fuse_notify_fsnotify_(in,out)
  */
 
 #ifndef _LINUX_FUSE_H
@@ -221,7 +224,7 @@
 #define FUSE_KERNEL_VERSION 7
 
 /** Minor version number of this interface */
-#define FUSE_KERNEL_MINOR_VERSION 35
+#define FUSE_KERNEL_MINOR_VERSION 36
 
 /** The node ID of the root inode */
 #define FUSE_ROOT_ID 1
@@ -338,6 +341,7 @@ struct fuse_file_lock {
  *			write/truncate sgid is killed only if file has group
  *			execute permission. (Same as Linux VFS behavior).
  * FUSE_SETXATTR_EXT:	Server supports extended struct fuse_setxattr_in
+ * FUSE_HAVE_FSNOTIFY:	remote fsnotify/inotify event subsystem support
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -369,6 +373,7 @@ struct fuse_file_lock {
 #define FUSE_SUBMOUNTS		(1 << 27)
 #define FUSE_HANDLE_KILLPRIV_V2	(1 << 28)
 #define FUSE_SETXATTR_EXT	(1 << 29)
+#define FUSE_HAVE_FSNOTIFY	(1 << 30)
 
 /**
  * CUSE INIT request/reply flags
@@ -515,6 +520,7 @@ enum fuse_opcode {
 	FUSE_SETUPMAPPING	= 48,
 	FUSE_REMOVEMAPPING	= 49,
 	FUSE_SYNCFS		= 50,
+	FUSE_FSNOTIFY		= 51,
 
 	/* CUSE specific operations */
 	CUSE_INIT		= 4096,
@@ -532,6 +538,7 @@ enum fuse_notify_code {
 	FUSE_NOTIFY_RETRIEVE = 5,
 	FUSE_NOTIFY_DELETE = 6,
 	FUSE_NOTIFY_LOCK = 7,
+	FUSE_NOTIFY_FSNOTIFY = 8,
 	FUSE_NOTIFY_CODE_MAX,
 };
 
@@ -571,6 +578,20 @@ struct fuse_getattr_in {
 	uint64_t	fh;
 };
 
+struct fuse_notify_fsnotify_out {
+	uint64_t inode;
+	uint64_t mask;
+	uint32_t namelen;
+	uint32_t cookie;
+};
+
+struct fuse_notify_fsnotify_in {
+	uint64_t mask;
+	uint64_t group;
+	uint32_t action;
+	uint32_t padding;
+};
+
 #define FUSE_COMPAT_ATTR_OUT_SIZE 96
 
 struct fuse_attr_out {
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH 2/7] FUSE: Add the remote inotify support capability to FUSE
  2021-10-25 20:46 [RFC PATCH 0/7] Inotify support in FUSE and virtiofs Ioannis Angelakopoulos
  2021-10-25 20:46 ` [RFC PATCH 1/7] FUSE: Add the fsnotify opcode and in/out structs to FUSE Ioannis Angelakopoulos
@ 2021-10-25 20:46 ` Ioannis Angelakopoulos
  2021-10-25 20:46 ` [RFC PATCH 3/7] FUSE,Inotify,Fsnotify,VFS: Add the fuse_fsnotify_update_mark inode operation Ioannis Angelakopoulos
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 47+ messages in thread
From: Ioannis Angelakopoulos @ 2021-10-25 20:46 UTC (permalink / raw)
  To: linux-fsdevel, virtio-fs, linux-kernel, jack, amir73il, viro,
	miklos, vgoyal
  Cc: Ioannis Angelakopoulos

Add the remote inotify support capability to FUSE init flags, which is
supported by the kernel only when the "CONFIG_INOTIFY_USER" config
option is enabled in the guest kernel.

If virtiofsd wants the remote inotify support feature enabled only then
the guest kernel will enable it. However, this means that the kernel
will suppress the local inotify events related to inodes within the
directory exported through virtiofs. The suppression of local events
prevents the guest from receiving duplicate events; one from the guest
kernel and one from virtiofsd.

Signed-off-by: Ioannis Angelakopoulos <iangelak@redhat.com>
---
 fs/fuse/fuse_i.h | 7 +++++++
 fs/fuse/inode.c  | 6 ++++++
 2 files changed, 13 insertions(+)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f55f9f94b1a4..c3cebfb936d2 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -649,6 +649,13 @@ struct fuse_conn {
 	 */
 	unsigned handle_killpriv_v2:1;
 
+	/* Is the remote inotify capability supported by the filesystem?
+	 * If yes then all the local inotify events related to inodes
+	 * in the FUSE filesystem will be suppressed and only the remote
+	 * events will be let through.
+	 */
+	unsigned no_fsnotify:1;
+
 	/*
 	 * The following bitfields are only for optimization purposes
 	 * and hence races in setting them will not cause malfunction
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 12d49a1914e8..039a040ddc91 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1143,6 +1143,9 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 			}
 			if (arg->flags & FUSE_SETXATTR_EXT)
 				fc->setxattr_ext = 1;
+			if (!(arg->flags & FUSE_HAVE_FSNOTIFY)) {
+				fc->no_fsnotify = 1;
+			}
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
 			fc->no_lock = 1;
@@ -1190,6 +1193,9 @@ void fuse_send_init(struct fuse_mount *fm)
 #ifdef CONFIG_FUSE_DAX
 	if (fm->fc->dax)
 		ia->in.flags |= FUSE_MAP_ALIGNMENT;
+#endif
+#ifdef CONFIG_INOTIFY_USER
+	ia->in.flags |= FUSE_HAVE_FSNOTIFY;
 #endif
 	if (fm->fc->auto_submounts)
 		ia->in.flags |= FUSE_SUBMOUNTS;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH 3/7] FUSE,Inotify,Fsnotify,VFS: Add the fuse_fsnotify_update_mark inode operation
  2021-10-25 20:46 [RFC PATCH 0/7] Inotify support in FUSE and virtiofs Ioannis Angelakopoulos
  2021-10-25 20:46 ` [RFC PATCH 1/7] FUSE: Add the fsnotify opcode and in/out structs to FUSE Ioannis Angelakopoulos
  2021-10-25 20:46 ` [RFC PATCH 2/7] FUSE: Add the remote inotify support capability " Ioannis Angelakopoulos
@ 2021-10-25 20:46 ` Ioannis Angelakopoulos
  2021-10-26 15:06   ` Amir Goldstein
  2021-10-25 20:46 ` [RFC PATCH 4/7] FUSE: Add the fuse_fsnotify_send_request to FUSE Ioannis Angelakopoulos
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 47+ messages in thread
From: Ioannis Angelakopoulos @ 2021-10-25 20:46 UTC (permalink / raw)
  To: linux-fsdevel, virtio-fs, linux-kernel, jack, amir73il, viro,
	miklos, vgoyal
  Cc: Ioannis Angelakopoulos

Every time a local watch is placed/modified/removed on/from an inode the
same operation has to take place in the FUSE server.

Thus add the inode operation "fuse_fsnotify_update_mark", which is
specific to FUSE inodes. This operation is called from the
"inotify_add_watch" system call in the inotify subsystem.

Specifically, the operation is called when a process tries to add, modify
or remove a watch from a FUSE inode and the remote fsnotify support is
enabled both in the guest kernel and the FUSE server (virtiofsd).

Essentially, when the kernel adds/modifies a watch locally, also send a
fsnotify request to the FUSE server to do the same. We keep the local watch
placement since it is essential for the functionality of the fsnotify
notification subsystem. However, the local events generated by the guest
kernel will be suppressed if they affect FUSE inodes and the remote
fsnotify support is enabled.

Also modify the "fsnotify_detach_mark" function in fs/notify/mark.c to add
support for the remote deletion of watches for FUSE inodes. In contrast to
the add/modify operation we do not modify the inotify subsystem, but the
fsnotify subsystem. That is because there are two ways of deleting a watch
from an inode. The first is by manually calling the "inotify_rm_watch"
system call and the second is automatically by the kernel when the process
that has created an inotify instance exits. In that case the kernel is
responsible for deleting all the watches corresponding to said inotify
instance.

Thus we send the fsnotify request for the deletion of the remote watch at
the lowest level within "fsnotify_detach_mark" to catch both watch removal
cases.

The "fuse_fsnotify_update_mark" function in turn calls the
"fuse_fsnotify_send_request" function, to send an fsnotify request to the
FUSE server related to an inode watch.


Signed-off-by: Ioannis Angelakopoulos <iangelak@redhat.com>
---
 fs/fuse/dir.c                    | 29 +++++++++++++++++++++++++++++
 fs/notify/inotify/inotify_user.c | 11 +++++++++++
 fs/notify/mark.c                 | 10 ++++++++++
 include/linux/fs.h               |  2 ++
 4 files changed, 52 insertions(+)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index d9b977c0f38d..f666aafc8d3f 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -17,6 +17,8 @@
 #include <linux/xattr.h>
 #include <linux/iversion.h>
 #include <linux/posix_acl.h>
+#include <linux/fsnotify_backend.h>
+#include <linux/inotify.h>
 
 static void fuse_advise_use_readdirplus(struct inode *dir)
 {
@@ -1805,6 +1807,30 @@ static int fuse_getattr(struct user_namespace *mnt_userns,
 	return fuse_update_get_attr(inode, NULL, stat, request_mask, flags);
 }
 
+static int fuse_fsnotify_update_mark(struct inode *inode, uint32_t action,
+				     uint64_t group, uint32_t mask)
+{
+	/*
+	 * We have to remove the bits added to the mask before being attached
+	 * or detached to the inode, since these bits are going to be
+	 * added by the "remote" host kernel. If these bits were still enabled
+	 * in the mask that was sent to the "remote" kernel then the watch would
+	 * be rejected as an unsupported value. These bits are added by the
+	 * fsnotify subsystem thus we use the corresponding fsnotify bits here.
+	 */
+	mask = mask & ~(FS_IN_IGNORED | FS_UNMOUNT | FS_IN_ONESHOT |
+			FS_EXCL_UNLINK | FS_EVENT_ON_CHILD);
+
+	if (!(mask & IN_ALL_EVENTS))
+		return -EINVAL;
+
+	/*
+	 * Action 0: Remove a watch
+	 * Action 1: Add/Modify watch
+	 */
+	return fuse_fsnotify_send_request(inode, mask, action, group);
+}
+
 static const struct inode_operations fuse_dir_inode_operations = {
 	.lookup		= fuse_lookup,
 	.mkdir		= fuse_mkdir,
@@ -1824,6 +1850,7 @@ static const struct inode_operations fuse_dir_inode_operations = {
 	.set_acl	= fuse_set_acl,
 	.fileattr_get	= fuse_fileattr_get,
 	.fileattr_set	= fuse_fileattr_set,
+	.fsnotify_update = fuse_fsnotify_update_mark,
 };
 
 static const struct file_operations fuse_dir_operations = {
@@ -1846,6 +1873,7 @@ static const struct inode_operations fuse_common_inode_operations = {
 	.set_acl	= fuse_set_acl,
 	.fileattr_get	= fuse_fileattr_get,
 	.fileattr_set	= fuse_fileattr_set,
+	.fsnotify_update = fuse_fsnotify_update_mark,
 };
 
 static const struct inode_operations fuse_symlink_inode_operations = {
@@ -1853,6 +1881,7 @@ static const struct inode_operations fuse_symlink_inode_operations = {
 	.get_link	= fuse_get_link,
 	.getattr	= fuse_getattr,
 	.listxattr	= fuse_listxattr,
+	.fsnotify_update = fuse_fsnotify_update_mark,
 };
 
 void fuse_init_common(struct inode *inode)
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index 62051247f6d2..3a0fee09a7c3 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -46,6 +46,8 @@
 #define INOTIFY_WATCH_COST	(sizeof(struct inotify_inode_mark) + \
 				 2 * sizeof(struct inode))
 
+#define FSNOTIFY_ADD_MODIFY_MARK	1
+
 /* configurable via /proc/sys/fs/inotify/ */
 static int inotify_max_queued_events __read_mostly;
 
@@ -764,6 +766,15 @@ SYSCALL_DEFINE3(inotify_add_watch, int, fd, const char __user *, pathname,
 
 	/* create/update an inode mark */
 	ret = inotify_update_watch(group, inode, mask);
+	/*
+	 * If the inode belongs to a remote filesystem/server that supports
+	 * remote inotify events then send the mark to the remote server
+	 */
+	if (ret >= 0 && inode->i_op->fsnotify_update) {
+		inode->i_op->fsnotify_update(inode,
+					     FSNOTIFY_ADD_MODIFY_MARK,
+					     (uint64_t)group, mask);
+	}
 	path_put(&path);
 fput_and_out:
 	fdput(f);
diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index fa1d99101f89..f0d37276afcb 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -77,6 +77,7 @@
 #include "fsnotify.h"
 
 #define FSNOTIFY_REAPER_DELAY	(1)	/* 1 jiffy */
+#define FSNOTIFY_DELETE_MARK 0   /* Delete a mark in remote fsnotify */
 
 struct srcu_struct fsnotify_mark_srcu;
 struct kmem_cache *fsnotify_mark_connector_cachep;
@@ -399,6 +400,7 @@ void fsnotify_finish_user_wait(struct fsnotify_iter_info *iter_info)
 void fsnotify_detach_mark(struct fsnotify_mark *mark)
 {
 	struct fsnotify_group *group = mark->group;
+	struct inode *inode = NULL;
 
 	WARN_ON_ONCE(!mutex_is_locked(&group->mark_mutex));
 	WARN_ON_ONCE(!srcu_read_lock_held(&fsnotify_mark_srcu) &&
@@ -411,6 +413,14 @@ void fsnotify_detach_mark(struct fsnotify_mark *mark)
 		spin_unlock(&mark->lock);
 		return;
 	}
+
+	/* Only if the object is an inode send a request to FUSE server */
+	inode = fsnotify_conn_inode(mark->connector);
+	if (inode && inode->i_op->fsnotify_update) {
+		inode->i_op->fsnotify_update(inode, FSNOTIFY_DELETE_MARK,
+					     (uint64_t)group, mark->mask);
+	}
+
 	mark->flags &= ~FSNOTIFY_MARK_FLAG_ATTACHED;
 	list_del_init(&mark->g_list);
 	spin_unlock(&mark->lock);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e7a633353fd2..86bcc44e3ab8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2149,6 +2149,8 @@ struct inode_operations {
 	int (*fileattr_set)(struct user_namespace *mnt_userns,
 			    struct dentry *dentry, struct fileattr *fa);
 	int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
+	int (*fsnotify_update)(struct inode *inode, uint32_t action,
+			       uint64_t group, uint32_t mask);
 } ____cacheline_aligned;
 
 static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH 4/7] FUSE: Add the fuse_fsnotify_send_request to FUSE
  2021-10-25 20:46 [RFC PATCH 0/7] Inotify support in FUSE and virtiofs Ioannis Angelakopoulos
                   ` (2 preceding siblings ...)
  2021-10-25 20:46 ` [RFC PATCH 3/7] FUSE,Inotify,Fsnotify,VFS: Add the fuse_fsnotify_update_mark inode operation Ioannis Angelakopoulos
@ 2021-10-25 20:46 ` Ioannis Angelakopoulos
  2021-10-25 20:46 ` [RFC PATCH 5/7] Fsnotify: Add a wrapper around the fsnotify function Ioannis Angelakopoulos
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 47+ messages in thread
From: Ioannis Angelakopoulos @ 2021-10-25 20:46 UTC (permalink / raw)
  To: linux-fsdevel, virtio-fs, linux-kernel, jack, amir73il, viro,
	miklos, vgoyal
  Cc: Ioannis Angelakopoulos

The function "fuse_fsnotify_send_request" is responsible for sending an
fsnotify FUSE request to the FUSE server (virtiofsd).

The request contains all the information that is stored within the
"fuse_notify_fsnotify_in" struct, i.e., the event mask for the inotify
watch, the action to be performed on the watch (create, modify or delete)
and the group identifier which is essentially the 64bit (fsnotify_group)
pointer that corresponds to one inotify instance created by a user space
process.

Since each group identifier is unique, the FUSE server can create an equal
number of inotify instances to the ones created by the user space
processes.

Signed-off-by: Ioannis Angelakopoulos <iangelak@redhat.com>
---
 fs/fuse/dev.c    | 37 +++++++++++++++++++++++++++++++++++++
 fs/fuse/fuse_i.h |  3 +++
 2 files changed, 40 insertions(+)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index dde341a6388a..89eea8abac4b 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1795,6 +1795,43 @@ static int fuse_notify(struct fuse_conn *fc, enum fuse_notify_code code,
 	}
 }
 
+/* Send a request for a watch placement to the FUSE server */
+int fuse_fsnotify_send_request(struct inode *inode, uint32_t mask,
+			       uint32_t action, uint64_t group)
+{
+	struct fuse_mount *fm = get_fuse_mount(inode);
+	struct fuse_notify_fsnotify_in inarg;
+	int err;
+	FUSE_ARGS(args);
+
+	/* The server does not support remote fsnotify events */
+	if (fm->fc->no_fsnotify)
+		return 0;
+
+	/*
+	 * Send the mask the action (remove, add, modify) and the
+	 * unique identifier that is the fsnotify group that is
+	 * interested in the watch to the FUSE server
+	 */
+	memset(&inarg, 0, sizeof(struct fuse_notify_fsnotify_in));
+	inarg.mask = mask;
+	inarg.action = action;
+	inarg.group = (uint64_t)group;
+
+	args.opcode = FUSE_FSNOTIFY;
+	args.nodeid = get_node_id(inode);
+	args.in_numargs = 1;
+	args.in_args[0].size = sizeof(struct fuse_notify_fsnotify_in);
+	args.in_args[0].value = &inarg;
+	args.out_numargs = 0;
+	args.force = true;
+
+	err = fuse_simple_request(fm, &args);
+
+	return err;
+}
+EXPORT_SYMBOL(fuse_fsnotify_send_request);
+
 /* Look up request on processing list by unique ID */
 static struct fuse_req *request_find(struct fuse_pqueue *fpq, u64 unique)
 {
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c3cebfb936d2..5c83c4535608 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1243,6 +1243,9 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type, bool rcu);
 int fuse_set_acl(struct user_namespace *mnt_userns, struct inode *inode,
 		 struct posix_acl *acl, int type);
 
+int fuse_fsnotify_send_request(struct inode *inode, uint32_t mask,
+			       uint32_t action, uint64_t group);
+
 /* readdir.c */
 int fuse_readdir(struct file *file, struct dir_context *ctx);
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH 5/7] Fsnotify: Add a wrapper around the fsnotify function
  2021-10-25 20:46 [RFC PATCH 0/7] Inotify support in FUSE and virtiofs Ioannis Angelakopoulos
                   ` (3 preceding siblings ...)
  2021-10-25 20:46 ` [RFC PATCH 4/7] FUSE: Add the fuse_fsnotify_send_request to FUSE Ioannis Angelakopoulos
@ 2021-10-25 20:46 ` Ioannis Angelakopoulos
  2021-10-26 14:37   ` Amir Goldstein
  2021-10-25 20:46 ` [RFC PATCH 6/7] FUSE,Fsnotify: Add the fuse_fsnotify_event inode operation Ioannis Angelakopoulos
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 47+ messages in thread
From: Ioannis Angelakopoulos @ 2021-10-25 20:46 UTC (permalink / raw)
  To: linux-fsdevel, virtio-fs, linux-kernel, jack, amir73il, viro,
	miklos, vgoyal
  Cc: Ioannis Angelakopoulos

Generally, inotify events are generated locally by calling the "fsnotify"
function in fs/notify/fsnotify.c and various helper functions. However, now
we expect events to arrive from the FUSE server. Thus, without any
intervention a user space application will receive two events. One event is
generated locally and one arrives from the server.

Hence, to avoid duplicate events we need to "suppress" the local events
generated by the guest kernel for FUSE inodes. To achieve this we add a
wrapper around the "fsnotify" function in fs/notify/fsnotify.c that
checks if the remote inotify is enabled and based on the check either it
"suppresses" or lets through a local event.

The wrapper will be called in the place of the original "fsnotify" call
that is responsible for the event notification (now renamed as
"__fsnotify").

When the remote inotify is not enabled, all local events will be let
through as expected. This process is completely transparent to user space.

Signed-off-by: Ioannis Angelakopoulos <iangelak@redhat.com>
---
 fs/notify/fsnotify.c             | 35 ++++++++++++++++++++++++++++++--
 include/linux/fsnotify_backend.h | 14 ++++++++++++-
 2 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 963e6ce75b96..848a824c29c4 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -440,7 +440,7 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
 }
 
 /*
- * fsnotify - This is the main call to fsnotify.
+ * __fsnotify - This is the main call to fsnotify.
  *
  * The VFS calls into hook specific functions in linux/fsnotify.h.
  * Those functions then in turn call here.  Here will call out to all of the
@@ -459,7 +459,7 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
  *		if both are non-NULL event may be reported to both.
  * @cookie:	inotify rename cookie
  */
-int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
+int __fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
 	     const struct qstr *file_name, struct inode *inode, u32 cookie)
 {
 	const struct path *path = fsnotify_data_path(data, data_type);
@@ -552,6 +552,37 @@ int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
 
 	return ret;
 }
+
+/*
+ * Wrapper around fsnotify. The main functionality is to filter local events in
+ * case the inode belongs to a filesystem that supports remote events
+ */
+int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
+	     const struct qstr *file_name, struct inode *inode, u32 cookie)
+{
+
+	if (inode != NULL || dir != NULL) {
+		/*
+		 * Check if the fsnotify_event operation is available which
+		 * will let the remote inotify events go through and suppress
+		 * the local events
+		 */
+		if (inode && inode->i_op->fsnotify_event) {
+			return inode->i_op->fsnotify_event(mask, data,
+							   data_type, dir,
+							   file_name, inode,
+							   cookie);
+		}
+		if (dir && dir->i_op->fsnotify_event) {
+			return dir->i_op->fsnotify_event(mask, data,
+							 data_type, dir,
+							 file_name, inode,
+							 cookie);
+		}
+	}
+
+	return __fsnotify(mask, data, data_type, dir, file_name, inode, cookie);
+}
 EXPORT_SYMBOL_GPL(fsnotify);
 
 static __init int fsnotify_init(void)
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 1ce66748a2d2..8cfbd685f1c0 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -408,10 +408,15 @@ struct fsnotify_mark {
 
 /* called from the vfs helpers */
 
-/* main fsnotify call to send events */
+/* fsnotify call wrapper */
 extern int fsnotify(__u32 mask, const void *data, int data_type,
 		    struct inode *dir, const struct qstr *name,
 		    struct inode *inode, u32 cookie);
+
+/* main fsnotify call to send events */
+extern int __fsnotify(__u32 mask, const void *data, int data_type,
+		    struct inode *dir, const struct qstr *name,
+		    struct inode *inode, u32 cookie);
 extern int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
 			   int data_type);
 extern void __fsnotify_inode_delete(struct inode *inode);
@@ -595,6 +600,13 @@ static inline int fsnotify(__u32 mask, const void *data, int data_type,
 	return 0;
 }
 
+static inline int __fsnotify(__u32 mask, const void *data, int data_type,
+			   struct inode *dir, const struct qstr *name,
+			   struct inode *inode, u32 cookie)
+{
+	return 0;
+}
+
 static inline int __fsnotify_parent(struct dentry *dentry, __u32 mask,
 				  const void *data, int data_type)
 {
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH 6/7] FUSE,Fsnotify: Add the fuse_fsnotify_event inode operation
  2021-10-25 20:46 [RFC PATCH 0/7] Inotify support in FUSE and virtiofs Ioannis Angelakopoulos
                   ` (4 preceding siblings ...)
  2021-10-25 20:46 ` [RFC PATCH 5/7] Fsnotify: Add a wrapper around the fsnotify function Ioannis Angelakopoulos
@ 2021-10-25 20:46 ` Ioannis Angelakopoulos
  2021-10-25 20:46 ` [RFC PATCH 7/7] virtiofs: Add support for handling the remote fsnotify notifications Ioannis Angelakopoulos
  2021-10-26 15:23 ` [RFC PATCH 0/7] Inotify support in FUSE and virtiofs Amir Goldstein
  7 siblings, 0 replies; 47+ messages in thread
From: Ioannis Angelakopoulos @ 2021-10-25 20:46 UTC (permalink / raw)
  To: linux-fsdevel, virtio-fs, linux-kernel, jack, amir73il, viro,
	miklos, vgoyal
  Cc: Ioannis Angelakopoulos

To avoid duplicate events we need to "suppress" the local events generated
by the guest kernel for FUSE inodes. To achieve this we introduce a new
inode operation "fuse_fsnotify_event". Any VFS operation on a FUSE inode
that calls the fsnotify subsystem will call this function.

Specifically, the new inode operation "fuse_fsnotify_event" is called by
the "fsnotify" wrapper function in fsnotify.c, if the inode is a FUSE
inode. In turn "fuse_fsnotify_event" will check if the remote inotify is
enabled and if yes the event will be dropped, since the local inotify
events should be "suppressed". If the remote inotify is not enabled the
event will go through as expected.

In the case where the remote inotify is enabled, FUSE will directly call
"__fsnotify" to send the remote events to user space and not the "fsnotify"
wrapper.

Signed-off-by: Ioannis Angelakopoulos <iangelak@redhat.com>
---
 fs/fuse/dir.c      | 27 +++++++++++++++++++++++++++
 include/linux/fs.h |  3 +++
 2 files changed, 30 insertions(+)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index f666aafc8d3f..d36f85bd4dda 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1831,6 +1831,30 @@ static int fuse_fsnotify_update_mark(struct inode *inode, uint32_t action,
 	return fuse_fsnotify_send_request(inode, mask, action, group);
 }
 
+static int fuse_fsnotify_event(__u32 mask, const void *data, int data_type,
+			       struct inode *dir, const struct qstr *file_name,
+			       struct inode *inode, u32 cookie)
+{
+	struct fuse_mount *fm = NULL;
+
+	if (inode != NULL)
+		fm = get_fuse_mount(inode);
+	else
+		fm = get_fuse_mount(dir);
+
+	/* Remote inotify supported. Do nothing */
+	if (!(fm->fc->no_fsnotify)) {
+		return 0;
+	/*
+	 * Remote inotify not supported. Call the __fsnotify function
+	 * directly
+	 */
+	} else {
+		return __fsnotify(mask, data, data_type, dir, file_name,
+				  inode, cookie);
+	}
+}
+
 static const struct inode_operations fuse_dir_inode_operations = {
 	.lookup		= fuse_lookup,
 	.mkdir		= fuse_mkdir,
@@ -1851,6 +1875,7 @@ static const struct inode_operations fuse_dir_inode_operations = {
 	.fileattr_get	= fuse_fileattr_get,
 	.fileattr_set	= fuse_fileattr_set,
 	.fsnotify_update = fuse_fsnotify_update_mark,
+	.fsnotify_event = fuse_fsnotify_event,
 };
 
 static const struct file_operations fuse_dir_operations = {
@@ -1874,6 +1899,7 @@ static const struct inode_operations fuse_common_inode_operations = {
 	.fileattr_get	= fuse_fileattr_get,
 	.fileattr_set	= fuse_fileattr_set,
 	.fsnotify_update = fuse_fsnotify_update_mark,
+	.fsnotify_event = fuse_fsnotify_event,
 };
 
 static const struct inode_operations fuse_symlink_inode_operations = {
@@ -1882,6 +1908,7 @@ static const struct inode_operations fuse_symlink_inode_operations = {
 	.getattr	= fuse_getattr,
 	.listxattr	= fuse_listxattr,
 	.fsnotify_update = fuse_fsnotify_update_mark,
+	.fsnotify_event = fuse_fsnotify_event,
 };
 
 void fuse_init_common(struct inode *inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 86bcc44e3ab8..ed6b62e2131a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2151,6 +2151,9 @@ struct inode_operations {
 	int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
 	int (*fsnotify_update)(struct inode *inode, uint32_t action,
 			       uint64_t group, uint32_t mask);
+	int (*fsnotify_event)(__u32 mask, const void *data, int data_type,
+			      struct inode *dir, const struct qstr *file_name,
+			      struct inode *inode, u32 cookie);
 } ____cacheline_aligned;
 
 static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH 7/7] virtiofs: Add support for handling the remote fsnotify notifications
  2021-10-25 20:46 [RFC PATCH 0/7] Inotify support in FUSE and virtiofs Ioannis Angelakopoulos
                   ` (5 preceding siblings ...)
  2021-10-25 20:46 ` [RFC PATCH 6/7] FUSE,Fsnotify: Add the fuse_fsnotify_event inode operation Ioannis Angelakopoulos
@ 2021-10-25 20:46 ` Ioannis Angelakopoulos
  2021-10-26 15:23 ` [RFC PATCH 0/7] Inotify support in FUSE and virtiofs Amir Goldstein
  7 siblings, 0 replies; 47+ messages in thread
From: Ioannis Angelakopoulos @ 2021-10-25 20:46 UTC (permalink / raw)
  To: linux-fsdevel, virtio-fs, linux-kernel, jack, amir73il, viro,
	miklos, vgoyal
  Cc: Ioannis Angelakopoulos

FUSE and specifically virtiofs should be able to handle the asynchronous
event notifications originating from the FUSE server. To this end we add
the FUSE_NOTIFY_FSNOTIFY switch case to the "virtio_fs_handle_notify" in
fs/fuse/virtio_fs.c to handle these specific notifications.

The event notification contains the information that a user space
application would receive when monitoring an inode for events. The
information is the mask of the inode watch, a file name corresponding to
the inode the remote event was generated for and finally, the inotify
cookie.

Then a new event should be generated corresponding to the event
notification received from the FUSE server. Specifically, FUSE in the guest
kernel will call the "__fsnotify" function in fs/notify/fsnotify.c to send
the event to user space.

Signed-off-by: Ioannis Angelakopoulos <iangelak@redhat.com>
---
 fs/fuse/virtio_fs.c | 64 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 62 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index d3dba9e3a07e..4c48c2812caa 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -16,6 +16,7 @@
 #include <linux/fs_parser.h>
 #include <linux/highmem.h>
 #include <linux/uio.h>
+#include <linux/fsnotify_backend.h>
 #include "fuse_i.h"
 
 /* Used to help calculate the FUSE connection's max_pages limit for a request's
@@ -655,14 +656,69 @@ static void notify_node_reuse(struct virtio_fs_vq *notify_fsvq,
 	spin_unlock(&notify_fsvq->lock);
 }
 
+static int fsnotify_remote_event(struct inode *inode, uint32_t mask,
+				 struct qstr *filename, uint32_t cookie)
+{
+	return __fsnotify(mask, NULL, 0, NULL,
+			  (const struct qstr *)filename, inode, cookie);
+}
+
+/*
+ * Function to generate a new event when a fsnotify notification comes from the
+ * fuse server
+ */
+static int generate_fsnotify_event(struct fuse_conn *fc,
+			struct fuse_notify_fsnotify_out *fsnotify_out)
+{
+	struct inode *inode;
+	uint32_t mask, cookie;
+	struct fuse_mount *fm;
+	int ret = -1;
+	struct qstr name;
+
+	down_read(&fc->killsb);
+	inode = fuse_ilookup(fc, fsnotify_out->inode, &fm);
+	/*
+	 * The inode that corresponds to the event does not exist in this case
+	 * so do not generate any new event and just return an error
+	 */
+	if (!inode)
+		goto out;
+
+	mask = fsnotify_out->mask;
+	cookie = fsnotify_out->cookie;
+
+	/*
+	 * If the notification contained the name of the file/dir the event
+	 * occurred for, it will be placed after the fsnotify_out struct in the
+	 * notification message
+	 */
+	if (fsnotify_out->namelen > 0) {
+		name.len = fsnotify_out->namelen;
+		name.name = (char *)fsnotify_out + sizeof(struct fuse_notify_fsnotify_out);
+		ret = fsnotify_remote_event(inode, mask, &name, cookie);
+	} else {
+		ret = fsnotify_remote_event(inode, mask, NULL, cookie);
+	}
+
+	up_read(&fc->killsb);
+out:
+	if (ret < 0)
+		return -EINVAL;
+
+	return ret;
+}
+
 static int virtio_fs_handle_notify(struct virtio_fs *vfs,
-				   struct virtio_fs_notify_node *notifyn)
+				   struct virtio_fs_notify_node *notifyn,
+				   struct fuse_conn *fc)
 {
 	int ret = 0, no_reuse = 0;
 	struct virtio_fs_notify *notify = &notifyn->notify;
 	struct virtio_fs_vq *notify_fsvq = &vfs->vqs[VQ_NOTIFY_IDX];
 	struct fuse_out_header *oh = &notify->out_hdr;
 	struct fuse_notify_lock_out *lo;
+	struct fuse_notify_fsnotify_out *fsnotify_out;
 
 	/*
 	 * For notifications, oh.unique is 0 and oh->error contains code
@@ -673,6 +729,10 @@ static int virtio_fs_handle_notify(struct virtio_fs *vfs,
 		lo = (struct fuse_notify_lock_out *) &notify->outarg;
 		no_reuse = notify_complete_waiting_req(vfs, lo);
 		break;
+	case FUSE_NOTIFY_FSNOTIFY:
+		fsnotify_out = (struct fuse_notify_fsnotify_out *) &notify->outarg;
+		generate_fsnotify_event(fc, fsnotify_out);
+		break;
 	default:
 		pr_err("virtio-fs: Unexpected notification %d\n", oh->error);
 	}
@@ -711,7 +771,7 @@ static void virtio_fs_notify_done_work(struct work_struct *work)
 		WARN_ON(oh->unique);
 		list_del_init(&notifyn->list);
 		/* Handle notification */
-		virtio_fs_handle_notify(vfs, notifyn);
+		virtio_fs_handle_notify(vfs, notifyn, fsvq->fud->fc);
 		spin_lock(&fsvq->lock);
 		dec_in_flight_req(fsvq);
 		spin_unlock(&fsvq->lock);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 5/7] Fsnotify: Add a wrapper around the fsnotify function
  2021-10-25 20:46 ` [RFC PATCH 5/7] Fsnotify: Add a wrapper around the fsnotify function Ioannis Angelakopoulos
@ 2021-10-26 14:37   ` Amir Goldstein
  2021-10-26 15:38     ` Vivek Goyal
  0 siblings, 1 reply; 47+ messages in thread
From: Amir Goldstein @ 2021-10-26 14:37 UTC (permalink / raw)
  To: Ioannis Angelakopoulos
  Cc: linux-fsdevel, virtio-fs-list, linux-kernel, Jan Kara, Al Viro,
	Miklos Szeredi, Vivek Goyal

On Mon, Oct 25, 2021 at 11:47 PM Ioannis Angelakopoulos
<iangelak@redhat.com> wrote:
>
> Generally, inotify events are generated locally by calling the "fsnotify"
> function in fs/notify/fsnotify.c and various helper functions. However, now
> we expect events to arrive from the FUSE server. Thus, without any
> intervention a user space application will receive two events. One event is
> generated locally and one arrives from the server.
>
> Hence, to avoid duplicate events we need to "suppress" the local events
> generated by the guest kernel for FUSE inodes. To achieve this we add a
> wrapper around the "fsnotify" function in fs/notify/fsnotify.c that
> checks if the remote inotify is enabled and based on the check either it
> "suppresses" or lets through a local event.
>
> The wrapper will be called in the place of the original "fsnotify" call
> that is responsible for the event notification (now renamed as
> "__fsnotify").
>
> When the remote inotify is not enabled, all local events will be let
> through as expected. This process is completely transparent to user space.
>
> Signed-off-by: Ioannis Angelakopoulos <iangelak@redhat.com>
> ---
>  fs/notify/fsnotify.c             | 35 ++++++++++++++++++++++++++++++--
>  include/linux/fsnotify_backend.h | 14 ++++++++++++-
>  2 files changed, 46 insertions(+), 3 deletions(-)
>
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 963e6ce75b96..848a824c29c4 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -440,7 +440,7 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
>  }
>
>  /*
> - * fsnotify - This is the main call to fsnotify.
> + * __fsnotify - This is the main call to fsnotify.
>   *
>   * The VFS calls into hook specific functions in linux/fsnotify.h.
>   * Those functions then in turn call here.  Here will call out to all of the
> @@ -459,7 +459,7 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
>   *             if both are non-NULL event may be reported to both.
>   * @cookie:    inotify rename cookie
>   */
> -int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> +int __fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
>              const struct qstr *file_name, struct inode *inode, u32 cookie)
>  {
>         const struct path *path = fsnotify_data_path(data, data_type);
> @@ -552,6 +552,37 @@ int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
>
>         return ret;
>  }
> +
> +/*
> + * Wrapper around fsnotify. The main functionality is to filter local events in
> + * case the inode belongs to a filesystem that supports remote events
> + */
> +int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> +            const struct qstr *file_name, struct inode *inode, u32 cookie)
> +{
> +
> +       if (inode != NULL || dir != NULL) {
> +               /*
> +                * Check if the fsnotify_event operation is available which
> +                * will let the remote inotify events go through and suppress
> +                * the local events
> +                */
> +               if (inode && inode->i_op->fsnotify_event) {
> +                       return inode->i_op->fsnotify_event(mask, data,
> +                                                          data_type, dir,
> +                                                          file_name, inode,
> +                                                          cookie);
> +               }
> +               if (dir && dir->i_op->fsnotify_event) {
> +                       return dir->i_op->fsnotify_event(mask, data,
> +                                                        data_type, dir,
> +                                                        file_name, inode,
> +                                                        cookie);
> +               }
> +       }
> +

That's not the way to accomplish what you want to do.

Assuming that we agree to let filesystem silence VFS fsnotify hooks
it should be done using an inode flag, similar to file flag FMODE_NONOTIFY.

But the reason you want to do that seems a bit odd.
Duplicate events are going to be merged with fanotify and I think that
you ruled out fanotify for the wrong reasons
(you should have only ruled out permission events)

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 1/7] FUSE: Add the fsnotify opcode and in/out structs to FUSE
  2021-10-25 20:46 ` [RFC PATCH 1/7] FUSE: Add the fsnotify opcode and in/out structs to FUSE Ioannis Angelakopoulos
@ 2021-10-26 14:56   ` Amir Goldstein
  2021-10-26 18:28     ` Amir Goldstein
                       ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Amir Goldstein @ 2021-10-26 14:56 UTC (permalink / raw)
  To: Ioannis Angelakopoulos
  Cc: linux-fsdevel, virtio-fs-list, linux-kernel, Jan Kara, Al Viro,
	Miklos Szeredi, Vivek Goyal

On Mon, Oct 25, 2021 at 11:47 PM Ioannis Angelakopoulos
<iangelak@redhat.com> wrote:
>
> Since fsnotify is the backend for the inotify subsystem all the backend
> code implementation we add is related to fsnotify.
>
> To support an fsnotify request in FUSE and specifically virtiofs we add a
> new opcode for the FSNOTIFY (51) operation request in the "fuse.h" header.
>
> Also add the "fuse_notify_fsnotify_in" and "fuse_notify_fsnotify_out"
> structs that are responsible for passing the fsnotify/inotify related data
> to and from the FUSE server.
>
> Signed-off-by: Ioannis Angelakopoulos <iangelak@redhat.com>
> ---
>  include/uapi/linux/fuse.h | 23 ++++++++++++++++++++++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 46838551ea84..418b7fc72417 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -186,6 +186,9 @@
>   *  - add FUSE_SYNCFS
>   *  7.35
>   *  - add FUSE_NOTIFY_LOCK
> + *  7.36
> + *  - add FUSE_HAVE_FSNOTIFY
> + *  - add fuse_notify_fsnotify_(in,out)
>   */
>
>  #ifndef _LINUX_FUSE_H
> @@ -221,7 +224,7 @@
>  #define FUSE_KERNEL_VERSION 7
>
>  /** Minor version number of this interface */
> -#define FUSE_KERNEL_MINOR_VERSION 35
> +#define FUSE_KERNEL_MINOR_VERSION 36
>
>  /** The node ID of the root inode */
>  #define FUSE_ROOT_ID 1
> @@ -338,6 +341,7 @@ struct fuse_file_lock {
>   *                     write/truncate sgid is killed only if file has group
>   *                     execute permission. (Same as Linux VFS behavior).
>   * FUSE_SETXATTR_EXT:  Server supports extended struct fuse_setxattr_in
> + * FUSE_HAVE_FSNOTIFY: remote fsnotify/inotify event subsystem support
>   */
>  #define FUSE_ASYNC_READ                (1 << 0)
>  #define FUSE_POSIX_LOCKS       (1 << 1)
> @@ -369,6 +373,7 @@ struct fuse_file_lock {
>  #define FUSE_SUBMOUNTS         (1 << 27)
>  #define FUSE_HANDLE_KILLPRIV_V2        (1 << 28)
>  #define FUSE_SETXATTR_EXT      (1 << 29)
> +#define FUSE_HAVE_FSNOTIFY     (1 << 30)
>
>  /**
>   * CUSE INIT request/reply flags
> @@ -515,6 +520,7 @@ enum fuse_opcode {
>         FUSE_SETUPMAPPING       = 48,
>         FUSE_REMOVEMAPPING      = 49,
>         FUSE_SYNCFS             = 50,
> +       FUSE_FSNOTIFY           = 51,
>
>         /* CUSE specific operations */
>         CUSE_INIT               = 4096,
> @@ -532,6 +538,7 @@ enum fuse_notify_code {
>         FUSE_NOTIFY_RETRIEVE = 5,
>         FUSE_NOTIFY_DELETE = 6,
>         FUSE_NOTIFY_LOCK = 7,
> +       FUSE_NOTIFY_FSNOTIFY = 8,
>         FUSE_NOTIFY_CODE_MAX,
>  };
>
> @@ -571,6 +578,20 @@ struct fuse_getattr_in {
>         uint64_t        fh;
>  };
>
> +struct fuse_notify_fsnotify_out {
> +       uint64_t inode;

64bit inode is not a good unique identifier of the object.
you need to either include the generation in object identifier
or much better use the object's nfs file handle, the same way
that fanotify stores object identifiers.

> +       uint64_t mask;
> +       uint32_t namelen;
> +       uint32_t cookie;

I object to persisting with the two-events-joined-by-cookie design.
Any new design should include a single event for rename
with information about src and dst.

I know this is inconvenient, but we are NOT going to create a "remote inotify"
interface, we need to create a "remote fsnotify" interface and if server wants
to use inotify, it will need to join the disjoined MOVE_FROM/TO event into
a single "remote event", that FUSE will use to call fsnotify_move().

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 3/7] FUSE,Inotify,Fsnotify,VFS: Add the fuse_fsnotify_update_mark inode operation
  2021-10-25 20:46 ` [RFC PATCH 3/7] FUSE,Inotify,Fsnotify,VFS: Add the fuse_fsnotify_update_mark inode operation Ioannis Angelakopoulos
@ 2021-10-26 15:06   ` Amir Goldstein
  2021-11-01 17:49     ` Vivek Goyal
  0 siblings, 1 reply; 47+ messages in thread
From: Amir Goldstein @ 2021-10-26 15:06 UTC (permalink / raw)
  To: Ioannis Angelakopoulos
  Cc: linux-fsdevel, virtio-fs-list, linux-kernel, Jan Kara, Al Viro,
	Miklos Szeredi, Vivek Goyal

On Mon, Oct 25, 2021 at 11:47 PM Ioannis Angelakopoulos
<iangelak@redhat.com> wrote:
>
> Every time a local watch is placed/modified/removed on/from an inode the
> same operation has to take place in the FUSE server.
>
> Thus add the inode operation "fuse_fsnotify_update_mark", which is
> specific to FUSE inodes. This operation is called from the
> "inotify_add_watch" system call in the inotify subsystem.
>
> Specifically, the operation is called when a process tries to add, modify
> or remove a watch from a FUSE inode and the remote fsnotify support is
> enabled both in the guest kernel and the FUSE server (virtiofsd).
>
> Essentially, when the kernel adds/modifies a watch locally, also send a
> fsnotify request to the FUSE server to do the same. We keep the local watch
> placement since it is essential for the functionality of the fsnotify
> notification subsystem. However, the local events generated by the guest
> kernel will be suppressed if they affect FUSE inodes and the remote
> fsnotify support is enabled.
>
> Also modify the "fsnotify_detach_mark" function in fs/notify/mark.c to add
> support for the remote deletion of watches for FUSE inodes. In contrast to
> the add/modify operation we do not modify the inotify subsystem, but the
> fsnotify subsystem. That is because there are two ways of deleting a watch
> from an inode. The first is by manually calling the "inotify_rm_watch"
> system call and the second is automatically by the kernel when the process
> that has created an inotify instance exits. In that case the kernel is
> responsible for deleting all the watches corresponding to said inotify
> instance.
>
> Thus we send the fsnotify request for the deletion of the remote watch at
> the lowest level within "fsnotify_detach_mark" to catch both watch removal
> cases.
>
> The "fuse_fsnotify_update_mark" function in turn calls the
> "fuse_fsnotify_send_request" function, to send an fsnotify request to the
> FUSE server related to an inode watch.
>
>
> Signed-off-by: Ioannis Angelakopoulos <iangelak@redhat.com>
> ---
>  fs/fuse/dir.c                    | 29 +++++++++++++++++++++++++++++
>  fs/notify/inotify/inotify_user.c | 11 +++++++++++
>  fs/notify/mark.c                 | 10 ++++++++++
>  include/linux/fs.h               |  2 ++
>  4 files changed, 52 insertions(+)
>
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index d9b977c0f38d..f666aafc8d3f 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -17,6 +17,8 @@
>  #include <linux/xattr.h>
>  #include <linux/iversion.h>
>  #include <linux/posix_acl.h>
> +#include <linux/fsnotify_backend.h>
> +#include <linux/inotify.h>
>
>  static void fuse_advise_use_readdirplus(struct inode *dir)
>  {
> @@ -1805,6 +1807,30 @@ static int fuse_getattr(struct user_namespace *mnt_userns,
>         return fuse_update_get_attr(inode, NULL, stat, request_mask, flags);
>  }
>
> +static int fuse_fsnotify_update_mark(struct inode *inode, uint32_t action,
> +                                    uint64_t group, uint32_t mask)
> +{
> +       /*
> +        * We have to remove the bits added to the mask before being attached
> +        * or detached to the inode, since these bits are going to be
> +        * added by the "remote" host kernel. If these bits were still enabled
> +        * in the mask that was sent to the "remote" kernel then the watch would
> +        * be rejected as an unsupported value. These bits are added by the
> +        * fsnotify subsystem thus we use the corresponding fsnotify bits here.
> +        */
> +       mask = mask & ~(FS_IN_IGNORED | FS_UNMOUNT | FS_IN_ONESHOT |
> +                       FS_EXCL_UNLINK | FS_EVENT_ON_CHILD);
> +
> +       if (!(mask & IN_ALL_EVENTS))
> +               return -EINVAL;
> +
> +       /*
> +        * Action 0: Remove a watch
> +        * Action 1: Add/Modify watch
> +        */
> +       return fuse_fsnotify_send_request(inode, mask, action, group);
> +}
> +
>  static const struct inode_operations fuse_dir_inode_operations = {
>         .lookup         = fuse_lookup,
>         .mkdir          = fuse_mkdir,
> @@ -1824,6 +1850,7 @@ static const struct inode_operations fuse_dir_inode_operations = {
>         .set_acl        = fuse_set_acl,
>         .fileattr_get   = fuse_fileattr_get,
>         .fileattr_set   = fuse_fileattr_set,
> +       .fsnotify_update = fuse_fsnotify_update_mark,
>  };
>
>  static const struct file_operations fuse_dir_operations = {
> @@ -1846,6 +1873,7 @@ static const struct inode_operations fuse_common_inode_operations = {
>         .set_acl        = fuse_set_acl,
>         .fileattr_get   = fuse_fileattr_get,
>         .fileattr_set   = fuse_fileattr_set,
> +       .fsnotify_update = fuse_fsnotify_update_mark,
>  };
>
>  static const struct inode_operations fuse_symlink_inode_operations = {
> @@ -1853,6 +1881,7 @@ static const struct inode_operations fuse_symlink_inode_operations = {
>         .get_link       = fuse_get_link,
>         .getattr        = fuse_getattr,
>         .listxattr      = fuse_listxattr,
> +       .fsnotify_update = fuse_fsnotify_update_mark,
>  };
>
>  void fuse_init_common(struct inode *inode)
> diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
> index 62051247f6d2..3a0fee09a7c3 100644
> --- a/fs/notify/inotify/inotify_user.c
> +++ b/fs/notify/inotify/inotify_user.c
> @@ -46,6 +46,8 @@
>  #define INOTIFY_WATCH_COST     (sizeof(struct inotify_inode_mark) + \
>                                  2 * sizeof(struct inode))
>
> +#define FSNOTIFY_ADD_MODIFY_MARK       1
> +
>  /* configurable via /proc/sys/fs/inotify/ */
>  static int inotify_max_queued_events __read_mostly;
>
> @@ -764,6 +766,15 @@ SYSCALL_DEFINE3(inotify_add_watch, int, fd, const char __user *, pathname,
>
>         /* create/update an inode mark */
>         ret = inotify_update_watch(group, inode, mask);
> +       /*
> +        * If the inode belongs to a remote filesystem/server that supports
> +        * remote inotify events then send the mark to the remote server
> +        */
> +       if (ret >= 0 && inode->i_op->fsnotify_update) {
> +               inode->i_op->fsnotify_update(inode,
> +                                            FSNOTIFY_ADD_MODIFY_MARK,
> +                                            (uint64_t)group, mask);
> +       }
>         path_put(&path);
>  fput_and_out:
>         fdput(f);
> diff --git a/fs/notify/mark.c b/fs/notify/mark.c
> index fa1d99101f89..f0d37276afcb 100644
> --- a/fs/notify/mark.c
> +++ b/fs/notify/mark.c
> @@ -77,6 +77,7 @@
>  #include "fsnotify.h"
>
>  #define FSNOTIFY_REAPER_DELAY  (1)     /* 1 jiffy */
> +#define FSNOTIFY_DELETE_MARK 0   /* Delete a mark in remote fsnotify */

This define is part of the vfs API it should be in an include file along side
FSNOTIFY_ADD_MODIFY_MARK (if we keep them in the API).

>
>  struct srcu_struct fsnotify_mark_srcu;
>  struct kmem_cache *fsnotify_mark_connector_cachep;
> @@ -399,6 +400,7 @@ void fsnotify_finish_user_wait(struct fsnotify_iter_info *iter_info)
>  void fsnotify_detach_mark(struct fsnotify_mark *mark)
>  {
>         struct fsnotify_group *group = mark->group;
> +       struct inode *inode = NULL;
>
>         WARN_ON_ONCE(!mutex_is_locked(&group->mark_mutex));
>         WARN_ON_ONCE(!srcu_read_lock_held(&fsnotify_mark_srcu) &&
> @@ -411,6 +413,14 @@ void fsnotify_detach_mark(struct fsnotify_mark *mark)
>                 spin_unlock(&mark->lock);
>                 return;
>         }
> +
> +       /* Only if the object is an inode send a request to FUSE server */
> +       inode = fsnotify_conn_inode(mark->connector);
> +       if (inode && inode->i_op->fsnotify_update) {
> +               inode->i_op->fsnotify_update(inode, FSNOTIFY_DELETE_MARK,
> +                                            (uint64_t)group, mark->mask);
> +       }
> +
>         mark->flags &= ~FSNOTIFY_MARK_FLAG_ATTACHED;
>         list_del_init(&mark->g_list);
>         spin_unlock(&mark->lock);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e7a633353fd2..86bcc44e3ab8 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2149,6 +2149,8 @@ struct inode_operations {
>         int (*fileattr_set)(struct user_namespace *mnt_userns,
>                             struct dentry *dentry, struct fileattr *fa);
>         int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
> +       int (*fsnotify_update)(struct inode *inode, uint32_t action,
> +                              uint64_t group, uint32_t mask);
>  } ____cacheline_aligned;
>

Please split the patch that introduces the API from the FUSE implementation.

Regarding the API, group does not belong in this interface.
The inode object has an "aggregated mask" at i_fsnotify_mask
indicating an interest for an event from any group.
Remote servers should be notified when the aggregated mask changes.

Hence, Miklos has proposed a "remote fsnotify update" API which does
not carry the mask nor the action, only the watched object:
https://lore.kernel.org/linux-fsdevel/20190501205541.GC30899@veci.piliscsaba.redhat.com/

On that same thread, you will see that I also proposed the API to support
full filesystem watch (by passing sb).
I am not requiring that you implement sb watch for FUSE/virtiofs, but the
API should take this future extension into account.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-10-25 20:46 [RFC PATCH 0/7] Inotify support in FUSE and virtiofs Ioannis Angelakopoulos
                   ` (6 preceding siblings ...)
  2021-10-25 20:46 ` [RFC PATCH 7/7] virtiofs: Add support for handling the remote fsnotify notifications Ioannis Angelakopoulos
@ 2021-10-26 15:23 ` Amir Goldstein
  2021-10-26 15:52   ` Vivek Goyal
  2021-10-26 16:18   ` Vivek Goyal
  7 siblings, 2 replies; 47+ messages in thread
From: Amir Goldstein @ 2021-10-26 15:23 UTC (permalink / raw)
  To: Ioannis Angelakopoulos
  Cc: linux-fsdevel, virtio-fs-list, linux-kernel, Jan Kara, Al Viro,
	Miklos Szeredi, Vivek Goyal, Steve French

On Mon, Oct 25, 2021 at 11:47 PM Ioannis Angelakopoulos
<iangelak@redhat.com> wrote:
>
> Hello,
>
> I am a PhD student currently interning at Red Hat and working on the
> virtiofs file system. I am trying to add support for the Inotify
> notification subsystem in virtiofs. I seek your feedback and
> suggestions on what is the right direction to take.
>

Hi Ioannis!

I am very happy that you have taken on this task.
People have been requesting this functionality in the past [1]
Not specifically for FUSE, but FUSE is a very good place to start.

[1] https://lore.kernel.org/linux-fsdevel/CAH2r5mt1Fy6hR+Rdig0sHsOS8fVQDsKf9HqZjvjORS3R-7=RFw@mail.gmail.com/

> Currently, virtiofs does not support the Inotify API and there are
> applications which look for the Inotify support in virtiofs (e.g., Kata
> containers).
>
> However, all the event notification subsystems (Dnotify/Inotify/Fanotify)
> are supported by only by local kernel file systems.
>
> --Proposed solution
>
> With this RFC patch we add the inotify support to the FUSE kernel module
> so that the remote virtiofs file system (based on FUSE) used by a QEMU
> guest VM can make use of this feature.
>
> Specifically, we enhance FUSE to add/modify/delete watches on the FUSE
> server and also receive remote inotify events. To achieve this we modify
> the fsnotify subsystem so that it calls specific hooks in FUSE when a
> remote watch is added/modified/deleted and FUSE calls into fsnotify when
> a remote event is received to send the event to user space.
>
> In our case the FUSE server is virtiofsd.
>
> We also considered an out of band approach for implementing the remote
> notifications (e.g., FAM, Gamin), however this approach would break
> applications that are already compatible with inotify, and thus would
> require an update.
>
> These kernel patches depend on the patch series posted by Vivek Goyal:
> https://lore.kernel.org/linux-fsdevel/20210930143850.1188628-1-vgoyal@redhat.com/

It would be a shame if remote fsnotify was not added as a generic
capability to FUSE filesystems and not only virtiofs.
Is there a way to get rid of this dependency?

>
> My PoC Linux kernel patches are here:
> https://github.com/iangelak/linux/commits/inotify_v1
>
> My PoC virtiofsd corresponding patches are here:
> https://github.com/iangelak/qemu/commits/inotify_v1
>
> --Advantages
>
> 1) Our approach is compatible with existing applications that rely on
> Inotify, thus improves portability.
>
> 2) Everything is implemented in one place (virtiofs and virtiofsd) and
> there is no need to run additional processes (daemons) specifically to
> handle the remote notifications.
>
> --Weaknesses
>
> 1) Both a local (QEMU guest) and a remote (Host/Virtiofsd) watch on the
> target inode have to be active at the same time. The local watch
> guarantees that events are going to be sent to the guest user space while
> the remote watch captures events occurring on the host (and will be sent
> to the guest).
>
> As a result, when an event occures on a inode within the exported
> directory by virtiofs, two events will be generated at the same time; a
> local event (generated by the guest kernel) and a remote event (generated
> by the host), thus the guest will receive duplicate events.
>
> To account for this issue we implemented two modes; one where local events
> function as expected (when virtiofsd does not support the remote
> inotify) and one where the local events are suppressed and only the
> remote events originating from the host side are let through (when
> virtiofsd supports the remote inotify).

Dropping events from the local side would be weird.
Avoiding duplicate events is not a good enough reason IMO
compared to the problems this could cause.
I am not convinced this is worth it.

>
> 3) The lifetime of the local watch in the guest kernel is very
> important. Specifically, there is a possibility that the guest does not
> receive remote events on time, if it removes its local watch on the
> target or deletes the inode (and thus the guest kernel removes the watch).
> In these cases the guest kernel removes the local watch before the
> remote events arrive from the host (virtiofsd) and as such the guest
> kernel drops all the remote events for the target inode (since the
> corresponding local watch does not exist anymore). On top of that,
> virtiofsd keeps an open proc file descriptor for each inode that is not
> immediately closed on a inode deletion request by the guest. As a result
> no IN_DELETE_SELF is generated by virtiofsd and sent to the guest kernel
> in this case.
>
> 4) Because virtiofsd implements additional operations during the
> servicing of a request from the guest, additional inotify events might
> be generated and sent to the guest other than the ones the guest
> expects. However, this is not technically a limitation and it is dependent
> on the implementation of the remote file system server (in this case
> virtiofsd).
>
> 5) The current implementation only supports Inotify, due to its
> simplicity and not Fanotify. Fanotify's complexity requires support from
> virtiofsd that is not currently available. One such example is
> Fsnotify's access permission decision capabilities, which could
> conflict with virtiofsd's current access permission implementation.

Good example, bad decision.
It is perfectly fine for a remote server to provide a "supported event mask"
and leave permission events out of the game.

Imagine a remote SMB server, it also does not support all of the events
that the local application would like to set.

That should not be a reason to rule out fanotify, only specific
fanotify events.

Same goes to FAN_MARK_MOUNT and FAN_MARK_FILESYSTEM
remote server may or may not support anything other than watching
inode objects, but it should not be a limit of the "remote fsnotify" API.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 5/7] Fsnotify: Add a wrapper around the fsnotify function
  2021-10-26 14:37   ` Amir Goldstein
@ 2021-10-26 15:38     ` Vivek Goyal
  0 siblings, 0 replies; 47+ messages in thread
From: Vivek Goyal @ 2021-10-26 15:38 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Ioannis Angelakopoulos, linux-fsdevel, virtio-fs-list,
	linux-kernel, Jan Kara, Al Viro, Miklos Szeredi

On Tue, Oct 26, 2021 at 05:37:55PM +0300, Amir Goldstein wrote:
> On Mon, Oct 25, 2021 at 11:47 PM Ioannis Angelakopoulos
> <iangelak@redhat.com> wrote:
> >
> > Generally, inotify events are generated locally by calling the "fsnotify"
> > function in fs/notify/fsnotify.c and various helper functions. However, now
> > we expect events to arrive from the FUSE server. Thus, without any
> > intervention a user space application will receive two events. One event is
> > generated locally and one arrives from the server.
> >
> > Hence, to avoid duplicate events we need to "suppress" the local events
> > generated by the guest kernel for FUSE inodes. To achieve this we add a
> > wrapper around the "fsnotify" function in fs/notify/fsnotify.c that
> > checks if the remote inotify is enabled and based on the check either it
> > "suppresses" or lets through a local event.
> >
> > The wrapper will be called in the place of the original "fsnotify" call
> > that is responsible for the event notification (now renamed as
> > "__fsnotify").
> >
> > When the remote inotify is not enabled, all local events will be let
> > through as expected. This process is completely transparent to user space.
> >
> > Signed-off-by: Ioannis Angelakopoulos <iangelak@redhat.com>
> > ---
> >  fs/notify/fsnotify.c             | 35 ++++++++++++++++++++++++++++++--
> >  include/linux/fsnotify_backend.h | 14 ++++++++++++-
> >  2 files changed, 46 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> > index 963e6ce75b96..848a824c29c4 100644
> > --- a/fs/notify/fsnotify.c
> > +++ b/fs/notify/fsnotify.c
> > @@ -440,7 +440,7 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
> >  }
> >
> >  /*
> > - * fsnotify - This is the main call to fsnotify.
> > + * __fsnotify - This is the main call to fsnotify.
> >   *
> >   * The VFS calls into hook specific functions in linux/fsnotify.h.
> >   * Those functions then in turn call here.  Here will call out to all of the
> > @@ -459,7 +459,7 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
> >   *             if both are non-NULL event may be reported to both.
> >   * @cookie:    inotify rename cookie
> >   */
> > -int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> > +int __fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> >              const struct qstr *file_name, struct inode *inode, u32 cookie)
> >  {
> >         const struct path *path = fsnotify_data_path(data, data_type);
> > @@ -552,6 +552,37 @@ int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> >
> >         return ret;
> >  }
> > +
> > +/*
> > + * Wrapper around fsnotify. The main functionality is to filter local events in
> > + * case the inode belongs to a filesystem that supports remote events
> > + */
> > +int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> > +            const struct qstr *file_name, struct inode *inode, u32 cookie)
> > +{
> > +
> > +       if (inode != NULL || dir != NULL) {
> > +               /*
> > +                * Check if the fsnotify_event operation is available which
> > +                * will let the remote inotify events go through and suppress
> > +                * the local events
> > +                */
> > +               if (inode && inode->i_op->fsnotify_event) {
> > +                       return inode->i_op->fsnotify_event(mask, data,
> > +                                                          data_type, dir,
> > +                                                          file_name, inode,
> > +                                                          cookie);
> > +               }
> > +               if (dir && dir->i_op->fsnotify_event) {
> > +                       return dir->i_op->fsnotify_event(mask, data,
> > +                                                        data_type, dir,
> > +                                                        file_name, inode,
> > +                                                        cookie);
> > +               }
> > +       }
> > +
> 
> That's not the way to accomplish what you want to do.
> 
> Assuming that we agree to let filesystem silence VFS fsnotify hooks
> it should be done using an inode flag, similar to file flag FMODE_NONOTIFY.

Ok, so basically define a new flag say FMODE_FOO and check that in
fsnotify() and ignore event if flag is set. And filesystem can set
that flag if remote events are enabled. And then vfs will ignore
local events. Sounds reasonable.

> 
> But the reason you want to do that seems a bit odd.

I gave this idea to Ioannis. But defining a inode flag probably is
much cheaper as comapred to inode operation.

> Duplicate events are going to be merged with fanotify and I think that
> you ruled out fanotify for the wrong reasons
> (you should have only ruled out permission events)

Ioannis was looking at fanoity and wondering if fanotify can be supported
as well. fanotify seemed to be much more powerful as compared to inotify
and some of the things looked like not feasible to be supported for
remote filesystems.

But if it is acceptable to support only limited functionality/events, then
it probably is a good idea to keep fanotify in mind and somebody can extend
it for fanotify as well.

Ideally it will be nice to support fanoity as well (as much as possible).
Just that it seemed more complicated. So we thought that let us start
with a simpler API (inotify) and try to implement that first.

I don't understand "Duplicate events are going to be merged with
fanotify". So fanotiy has something to figure out there are duplicate
events and merge these and user space never sees duplicate events.

Vivek


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-10-26 15:23 ` [RFC PATCH 0/7] Inotify support in FUSE and virtiofs Amir Goldstein
@ 2021-10-26 15:52   ` Vivek Goyal
  2021-10-26 18:19     ` Amir Goldstein
  2021-10-26 16:18   ` Vivek Goyal
  1 sibling, 1 reply; 47+ messages in thread
From: Vivek Goyal @ 2021-10-26 15:52 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Ioannis Angelakopoulos, linux-fsdevel, virtio-fs-list,
	linux-kernel, Jan Kara, Al Viro, Miklos Szeredi, Steve French

On Tue, Oct 26, 2021 at 06:23:50PM +0300, Amir Goldstein wrote:
> On Mon, Oct 25, 2021 at 11:47 PM Ioannis Angelakopoulos
> <iangelak@redhat.com> wrote:
> >
> > Hello,
> >
> > I am a PhD student currently interning at Red Hat and working on the
> > virtiofs file system. I am trying to add support for the Inotify
> > notification subsystem in virtiofs. I seek your feedback and
> > suggestions on what is the right direction to take.
> >
> 
> Hi Ioannis!
> 
> I am very happy that you have taken on this task.
> People have been requesting this functionality in the past [1]
> Not specifically for FUSE, but FUSE is a very good place to start.
> 
> [1] https://lore.kernel.org/linux-fsdevel/CAH2r5mt1Fy6hR+Rdig0sHsOS8fVQDsKf9HqZjvjORS3R-7=RFw@mail.gmail.com/

Hi Amir,

Aha, good to know that other remote filesystems are looking for similar
functionality. So there can be a common design so that other remote
filesystems can support remote inotify/fanotify events too.

> 
> > Currently, virtiofs does not support the Inotify API and there are
> > applications which look for the Inotify support in virtiofs (e.g., Kata
> > containers).
> >
> > However, all the event notification subsystems (Dnotify/Inotify/Fanotify)
> > are supported by only by local kernel file systems.
> >
> > --Proposed solution
> >
> > With this RFC patch we add the inotify support to the FUSE kernel module
> > so that the remote virtiofs file system (based on FUSE) used by a QEMU
> > guest VM can make use of this feature.
> >
> > Specifically, we enhance FUSE to add/modify/delete watches on the FUSE
> > server and also receive remote inotify events. To achieve this we modify
> > the fsnotify subsystem so that it calls specific hooks in FUSE when a
> > remote watch is added/modified/deleted and FUSE calls into fsnotify when
> > a remote event is received to send the event to user space.
> >
> > In our case the FUSE server is virtiofsd.
> >
> > We also considered an out of band approach for implementing the remote
> > notifications (e.g., FAM, Gamin), however this approach would break
> > applications that are already compatible with inotify, and thus would
> > require an update.
> >
> > These kernel patches depend on the patch series posted by Vivek Goyal:
> > https://lore.kernel.org/linux-fsdevel/20210930143850.1188628-1-vgoyal@redhat.com/
> 
> It would be a shame if remote fsnotify was not added as a generic
> capability to FUSE filesystems and not only virtiofs.
> Is there a way to get rid of this dependency?

Agreed. I think currently he has just added support for virtiofs. But
it should be possible to extend it to other fuse filesystems as well.
All they need to do is send notification and regular fuse already
has support for allowing server to send notifications to fuse client
kernel.

> 
> >
> > My PoC Linux kernel patches are here:
> > https://github.com/iangelak/linux/commits/inotify_v1
> >
> > My PoC virtiofsd corresponding patches are here:
> > https://github.com/iangelak/qemu/commits/inotify_v1
> >
> > --Advantages
> >
> > 1) Our approach is compatible with existing applications that rely on
> > Inotify, thus improves portability.
> >
> > 2) Everything is implemented in one place (virtiofs and virtiofsd) and
> > there is no need to run additional processes (daemons) specifically to
> > handle the remote notifications.
> >
> > --Weaknesses
> >
> > 1) Both a local (QEMU guest) and a remote (Host/Virtiofsd) watch on the
> > target inode have to be active at the same time. The local watch
> > guarantees that events are going to be sent to the guest user space while
> > the remote watch captures events occurring on the host (and will be sent
> > to the guest).
> >
> > As a result, when an event occures on a inode within the exported
> > directory by virtiofs, two events will be generated at the same time; a
> > local event (generated by the guest kernel) and a remote event (generated
> > by the host), thus the guest will receive duplicate events.
> >
> > To account for this issue we implemented two modes; one where local events
> > function as expected (when virtiofsd does not support the remote
> > inotify) and one where the local events are suppressed and only the
> > remote events originating from the host side are let through (when
> > virtiofsd supports the remote inotify).
> 
> Dropping events from the local side would be weird.
> Avoiding duplicate events is not a good enough reason IMO
> compared to the problems this could cause.
> I am not convinced this is worth it.

So what should be done? If we don't drop local events, then application
will see both local and remote events. And then we will have to build
this knowledge in API that an event can be either local or remote.
Application should be able to distinguish between these two and act
accordingly. That sounds like a lot. And I am not even sure if application
cares about local events in case of a remote shared filesystem.

I have no experience with inotify/fanotify/fsnotify and what people
expect from inotify/fanotify API. So we are open to all the ideas
w.r.t what will be a good design to support this thing on remote
filesystems. Currently whole infrastructure seems to be written with
local filesystems in mind.

> 
> >
> > 3) The lifetime of the local watch in the guest kernel is very
> > important. Specifically, there is a possibility that the guest does not
> > receive remote events on time, if it removes its local watch on the
> > target or deletes the inode (and thus the guest kernel removes the watch).
> > In these cases the guest kernel removes the local watch before the
> > remote events arrive from the host (virtiofsd) and as such the guest
> > kernel drops all the remote events for the target inode (since the
> > corresponding local watch does not exist anymore). On top of that,
> > virtiofsd keeps an open proc file descriptor for each inode that is not
> > immediately closed on a inode deletion request by the guest. As a result
> > no IN_DELETE_SELF is generated by virtiofsd and sent to the guest kernel
> > in this case.
> >
> > 4) Because virtiofsd implements additional operations during the
> > servicing of a request from the guest, additional inotify events might
> > be generated and sent to the guest other than the ones the guest
> > expects. However, this is not technically a limitation and it is dependent
> > on the implementation of the remote file system server (in this case
> > virtiofsd).
> >
> > 5) The current implementation only supports Inotify, due to its
> > simplicity and not Fanotify. Fanotify's complexity requires support from
> > virtiofsd that is not currently available. One such example is
> > Fsnotify's access permission decision capabilities, which could
> > conflict with virtiofsd's current access permission implementation.
> 
> Good example, bad decision.
> It is perfectly fine for a remote server to provide a "supported event mask"
> and leave permission events out of the game.
> 
> Imagine a remote SMB server, it also does not support all of the events
> that the local application would like to set.
> 
> That should not be a reason to rule out fanotify, only specific
> fanotify events.
> 
> Same goes to FAN_MARK_MOUNT and FAN_MARK_FILESYSTEM
> remote server may or may not support anything other than watching
> inode objects, but it should not be a limit of the "remote fsnotify" API.

Agreed. If limited fanotify functionality is acceptable, then it should
be written in such a way so that one can easily extend it to support
limited fanotify support.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-10-26 15:23 ` [RFC PATCH 0/7] Inotify support in FUSE and virtiofs Amir Goldstein
  2021-10-26 15:52   ` Vivek Goyal
@ 2021-10-26 16:18   ` Vivek Goyal
  2021-10-26 17:59     ` Amir Goldstein
  1 sibling, 1 reply; 47+ messages in thread
From: Vivek Goyal @ 2021-10-26 16:18 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Ioannis Angelakopoulos, linux-fsdevel, virtio-fs-list,
	linux-kernel, Jan Kara, Al Viro, Miklos Szeredi, Steve French

On Tue, Oct 26, 2021 at 06:23:50PM +0300, Amir Goldstein wrote:

[..]
> > 3) The lifetime of the local watch in the guest kernel is very
> > important. Specifically, there is a possibility that the guest does not
> > receive remote events on time, if it removes its local watch on the
> > target or deletes the inode (and thus the guest kernel removes the watch).
> > In these cases the guest kernel removes the local watch before the
> > remote events arrive from the host (virtiofsd) and as such the guest
> > kernel drops all the remote events for the target inode (since the
> > corresponding local watch does not exist anymore).

So this is one of the issues which has been haunting us in virtiofs. If
a file is removed, for local events, event is generated first and
then watch is removed. But in case of remote filesystems, it is racy.
It is possible that by the time event arrives, watch is already gone
and application never sees the delete event.

Not sure how to address this issue.

Vivek


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-10-26 16:18   ` Vivek Goyal
@ 2021-10-26 17:59     ` Amir Goldstein
  2021-10-26 18:27       ` Vivek Goyal
  0 siblings, 1 reply; 47+ messages in thread
From: Amir Goldstein @ 2021-10-26 17:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Ioannis Angelakopoulos, linux-fsdevel, virtio-fs-list,
	linux-kernel, Jan Kara, Al Viro, Miklos Szeredi, Steve French

On Tue, Oct 26, 2021 at 7:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Tue, Oct 26, 2021 at 06:23:50PM +0300, Amir Goldstein wrote:
>
> [..]
> > > 3) The lifetime of the local watch in the guest kernel is very
> > > important. Specifically, there is a possibility that the guest does not
> > > receive remote events on time, if it removes its local watch on the
> > > target or deletes the inode (and thus the guest kernel removes the watch).
> > > In these cases the guest kernel removes the local watch before the
> > > remote events arrive from the host (virtiofsd) and as such the guest
> > > kernel drops all the remote events for the target inode (since the
> > > corresponding local watch does not exist anymore).
>
> So this is one of the issues which has been haunting us in virtiofs. If
> a file is removed, for local events, event is generated first and
> then watch is removed. But in case of remote filesystems, it is racy.
> It is possible that by the time event arrives, watch is already gone
> and application never sees the delete event.
>
> Not sure how to address this issue.

Can you take me through the scenario step by step.
I am not sure I understand the exact sequence of the race.
If it is local file removal that causes watch to be removed,
then don't drop local events and you are good to go.
Is it something else?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-10-26 15:52   ` Vivek Goyal
@ 2021-10-26 18:19     ` Amir Goldstein
  0 siblings, 0 replies; 47+ messages in thread
From: Amir Goldstein @ 2021-10-26 18:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Ioannis Angelakopoulos, linux-fsdevel, virtio-fs-list,
	linux-kernel, Jan Kara, Al Viro, Miklos Szeredi, Steve French

> > > As a result, when an event occures on a inode within the exported
> > > directory by virtiofs, two events will be generated at the same time; a
> > > local event (generated by the guest kernel) and a remote event (generated
> > > by the host), thus the guest will receive duplicate events.
> > >
> > > To account for this issue we implemented two modes; one where local events
> > > function as expected (when virtiofsd does not support the remote
> > > inotify) and one where the local events are suppressed and only the
> > > remote events originating from the host side are let through (when
> > > virtiofsd supports the remote inotify).
> >
> > Dropping events from the local side would be weird.
> > Avoiding duplicate events is not a good enough reason IMO
> > compared to the problems this could cause.
> > I am not convinced this is worth it.
>
> So what should be done? If we don't drop local events, then application
> will see both local and remote events. And then we will have to build
> this knowledge in API that an event can be either local or remote.
> Application should be able to distinguish between these two and act
> accordingly. That sounds like a lot. And I am not even sure if application
> cares about local events in case of a remote shared filesystem.
>

I am not completely ruling out the S_NONOTIFY inode flag, but I am not
yet convinced that it is needed.
So what if applications get duplicate events? What's the worst thing that
could happen?
And once again, merging most of the duplicate events is pretty easy
and fanotify does that already.

> I have no experience with inotify/fanotify/fsnotify and what people
> expect from inotify/fanotify API. So we are open to all the ideas
> w.r.t what will be a good design to support this thing on remote
> filesystems. Currently whole infrastructure seems to be written with
> local filesystems in mind.
>

I have no control of what users expect from inotify/fanotify
but I do know that some users' expectations are misaligned with
how inotify/fanotify actually works.

For example, some users may expect events not to be reordered,
but there is no such guarantee.

Some inotify users would expect to find a file with the filename
in the event without realizing that this is a snapshot of a past
filename, that may now not exist or be a completely different object.

Those are some misconceptions that we tried to address with
the fanotify FAN_REPORT_DFID_NAME APIs and hopefully we also
documented the expectations better in the man page.

The use of NFS file handles to identify objects in the FAN_REPORT_FID
modes enables getting events also for already dead objects
(without keeping the inode alive like inotify).

I didn't want to complicate Ioannis in this early stage, but I think
that FUSE fsnotify should be tied to LOOKUP_HANDLE.
All FUSE remote objects should be described by file handles
which is the same manner in which network protocols work
and file handles should be used to describe objects in
fsnotify FUSE events.

But I am getting ahead of myself with a lot of hand waving....

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-10-26 17:59     ` Amir Goldstein
@ 2021-10-26 18:27       ` Vivek Goyal
  2021-10-26 19:04         ` Amir Goldstein
       [not found]         ` <CAO17o20sdKAWQN6w7Oe0Ze06qcK+J=6rrmA_aWGnY__MRVDCKw@mail.gmail.com>
  0 siblings, 2 replies; 47+ messages in thread
From: Vivek Goyal @ 2021-10-26 18:27 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Ioannis Angelakopoulos, linux-fsdevel, virtio-fs-list,
	linux-kernel, Jan Kara, Al Viro, Miklos Szeredi, Steve French

On Tue, Oct 26, 2021 at 08:59:44PM +0300, Amir Goldstein wrote:
> On Tue, Oct 26, 2021 at 7:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Tue, Oct 26, 2021 at 06:23:50PM +0300, Amir Goldstein wrote:
> >
> > [..]
> > > > 3) The lifetime of the local watch in the guest kernel is very
> > > > important. Specifically, there is a possibility that the guest does not
> > > > receive remote events on time, if it removes its local watch on the
> > > > target or deletes the inode (and thus the guest kernel removes the watch).
> > > > In these cases the guest kernel removes the local watch before the
> > > > remote events arrive from the host (virtiofsd) and as such the guest
> > > > kernel drops all the remote events for the target inode (since the
> > > > corresponding local watch does not exist anymore).
> >
> > So this is one of the issues which has been haunting us in virtiofs. If
> > a file is removed, for local events, event is generated first and
> > then watch is removed. But in case of remote filesystems, it is racy.
> > It is possible that by the time event arrives, watch is already gone
> > and application never sees the delete event.
> >
> > Not sure how to address this issue.
> 

> Can you take me through the scenario step by step.
> I am not sure I understand the exact sequence of the race.

Ioannis, please correct me If I get something wrong. You know exact
details much more than me.

A. Say a guest process unlinks a file.
B. Fuse sends an unlink request to server (virtiofsd)
C. File is unlinked on host. Assume there are no other users so inode
   will be freed as well. And event will be generated on host and watch
   removed.
D. Now Fuse server will send a unlink request reply. unlink notification
   might still be in kernel buffers or still be in virtiofsd or could
   be in virtiofs virtqueue.
E. Fuse client will receive unlink reply and remove local watch.

Fuse reply and notification event are now traveling in parallel on
different virtqueues and there is no connection between these two. And
it could very well happen that fuse reply comes first, gets processed
first and local watch is removed. And notification is processed right
after but by then local watch is gone and filesystem will be forced to
drop event.

As of now situation is more complicated in virtiofsd. We don't keep
file handle open for file and keep an O_PATH fd open for each file.
That means in step D above, inode on host is not freed yet and unlink
event is not generated yet. When unlink reply reaches fuse client,
it sends FORGET messages to server, and then server closes O_PATH fd
and then host generates unlink events. By that time its too late,
guest has already remove local watches (and triggered removal of
remote watches too).

This second problem probably can be solved by using file handles, but
basic race will still continue to be there.

> If it is local file removal that causes watch to be removed,
> then don't drop local events and you are good to go.
> Is it something else?

- If remote events are enabled, then idea will be that user space gets
  and event when file is actually removed from server, right? Now it
  is possible that another VM has this file open and file has not been
  yet removed. So local event only tells you that file has been removed
  in guest VM (or locally) but does not tell anything about the state
  of file on server. (It has been unlinked on server but inode continues
  to be alive internall).

- If user receives both local and remote delete event, it will be
  confusing. I guess if we want to see both the events, then there
  has to be some sort of info in event which classifies whether event
  is local or remote. And let application act accordingly.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 1/7] FUSE: Add the fsnotify opcode and in/out structs to FUSE
  2021-10-26 14:56   ` Amir Goldstein
@ 2021-10-26 18:28     ` Amir Goldstein
       [not found]     ` <CAO17o20+jiij64y7b3eKoCjG5b_mLZj6o1LSnZ7+8exN3dFYEg@mail.gmail.com>
  2021-10-27 21:46     ` Vivek Goyal
  2 siblings, 0 replies; 47+ messages in thread
From: Amir Goldstein @ 2021-10-26 18:28 UTC (permalink / raw)
  To: Ioannis Angelakopoulos
  Cc: linux-fsdevel, virtio-fs-list, linux-kernel, Jan Kara, Al Viro,
	Miklos Szeredi, Vivek Goyal

On Tue, Oct 26, 2021 at 5:56 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Oct 25, 2021 at 11:47 PM Ioannis Angelakopoulos
> <iangelak@redhat.com> wrote:
> >
> > Since fsnotify is the backend for the inotify subsystem all the backend
> > code implementation we add is related to fsnotify.
> >
> > To support an fsnotify request in FUSE and specifically virtiofs we add a
> > new opcode for the FSNOTIFY (51) operation request in the "fuse.h" header.
> >
> > Also add the "fuse_notify_fsnotify_in" and "fuse_notify_fsnotify_out"
> > structs that are responsible for passing the fsnotify/inotify related data
> > to and from the FUSE server.
> >
> > Signed-off-by: Ioannis Angelakopoulos <iangelak@redhat.com>
> > ---
> >  include/uapi/linux/fuse.h | 23 ++++++++++++++++++++++-
> >  1 file changed, 22 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index 46838551ea84..418b7fc72417 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -186,6 +186,9 @@
> >   *  - add FUSE_SYNCFS
> >   *  7.35
> >   *  - add FUSE_NOTIFY_LOCK
> > + *  7.36
> > + *  - add FUSE_HAVE_FSNOTIFY
> > + *  - add fuse_notify_fsnotify_(in,out)
> >   */
> >
> >  #ifndef _LINUX_FUSE_H
> > @@ -221,7 +224,7 @@
> >  #define FUSE_KERNEL_VERSION 7
> >
> >  /** Minor version number of this interface */
> > -#define FUSE_KERNEL_MINOR_VERSION 35
> > +#define FUSE_KERNEL_MINOR_VERSION 36
> >
> >  /** The node ID of the root inode */
> >  #define FUSE_ROOT_ID 1
> > @@ -338,6 +341,7 @@ struct fuse_file_lock {
> >   *                     write/truncate sgid is killed only if file has group
> >   *                     execute permission. (Same as Linux VFS behavior).
> >   * FUSE_SETXATTR_EXT:  Server supports extended struct fuse_setxattr_in
> > + * FUSE_HAVE_FSNOTIFY: remote fsnotify/inotify event subsystem support
> >   */
> >  #define FUSE_ASYNC_READ                (1 << 0)
> >  #define FUSE_POSIX_LOCKS       (1 << 1)
> > @@ -369,6 +373,7 @@ struct fuse_file_lock {
> >  #define FUSE_SUBMOUNTS         (1 << 27)
> >  #define FUSE_HANDLE_KILLPRIV_V2        (1 << 28)
> >  #define FUSE_SETXATTR_EXT      (1 << 29)
> > +#define FUSE_HAVE_FSNOTIFY     (1 << 30)
> >
> >  /**
> >   * CUSE INIT request/reply flags
> > @@ -515,6 +520,7 @@ enum fuse_opcode {
> >         FUSE_SETUPMAPPING       = 48,
> >         FUSE_REMOVEMAPPING      = 49,
> >         FUSE_SYNCFS             = 50,
> > +       FUSE_FSNOTIFY           = 51,
> >
> >         /* CUSE specific operations */
> >         CUSE_INIT               = 4096,
> > @@ -532,6 +538,7 @@ enum fuse_notify_code {
> >         FUSE_NOTIFY_RETRIEVE = 5,
> >         FUSE_NOTIFY_DELETE = 6,
> >         FUSE_NOTIFY_LOCK = 7,
> > +       FUSE_NOTIFY_FSNOTIFY = 8,
> >         FUSE_NOTIFY_CODE_MAX,
> >  };
> >
> > @@ -571,6 +578,20 @@ struct fuse_getattr_in {
> >         uint64_t        fh;
> >  };
> >
> > +struct fuse_notify_fsnotify_out {
> > +       uint64_t inode;
>
> 64bit inode is not a good unique identifier of the object.
> you need to either include the generation in object identifier
> or much better use the object's nfs file handle, the same way
> that fanotify stores object identifiers.
>
> > +       uint64_t mask;
> > +       uint32_t namelen;
> > +       uint32_t cookie;
>
> I object to persisting with the two-events-joined-by-cookie design.
> Any new design should include a single event for rename
> with information about src and dst.
>
> I know this is inconvenient, but we are NOT going to create a "remote inotify"
> interface, we need to create a "remote fsnotify" interface and if server wants
> to use inotify, it will need to join the disjoined MOVE_FROM/TO event into
> a single "remote event", that FUSE will use to call fsnotify_move().
>

TBH, the disjoint vs. joint from/to event is an unfinished business
for fanotify.
So my objection above is more of a strong wish.
But I admit that if existing network protocols already encode the disjoint
from/to events semantics, I may need to fold back on that objection.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-10-26 18:27       ` Vivek Goyal
@ 2021-10-26 19:04         ` Amir Goldstein
       [not found]         ` <CAO17o20sdKAWQN6w7Oe0Ze06qcK+J=6rrmA_aWGnY__MRVDCKw@mail.gmail.com>
  1 sibling, 0 replies; 47+ messages in thread
From: Amir Goldstein @ 2021-10-26 19:04 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Ioannis Angelakopoulos, linux-fsdevel, virtio-fs-list,
	linux-kernel, Jan Kara, Al Viro, Miklos Szeredi, Steve French

On Tue, Oct 26, 2021 at 9:27 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Tue, Oct 26, 2021 at 08:59:44PM +0300, Amir Goldstein wrote:
> > On Tue, Oct 26, 2021 at 7:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Tue, Oct 26, 2021 at 06:23:50PM +0300, Amir Goldstein wrote:
> > >
> > > [..]
> > > > > 3) The lifetime of the local watch in the guest kernel is very
> > > > > important. Specifically, there is a possibility that the guest does not
> > > > > receive remote events on time, if it removes its local watch on the
> > > > > target or deletes the inode (and thus the guest kernel removes the watch).
> > > > > In these cases the guest kernel removes the local watch before the
> > > > > remote events arrive from the host (virtiofsd) and as such the guest
> > > > > kernel drops all the remote events for the target inode (since the
> > > > > corresponding local watch does not exist anymore).
> > >
> > > So this is one of the issues which has been haunting us in virtiofs. If
> > > a file is removed, for local events, event is generated first and
> > > then watch is removed. But in case of remote filesystems, it is racy.
> > > It is possible that by the time event arrives, watch is already gone
> > > and application never sees the delete event.
> > >
> > > Not sure how to address this issue.
> >
>
> > Can you take me through the scenario step by step.
> > I am not sure I understand the exact sequence of the race.
>
> Ioannis, please correct me If I get something wrong. You know exact
> details much more than me.
>
> A. Say a guest process unlinks a file.
> B. Fuse sends an unlink request to server (virtiofsd)
> C. File is unlinked on host. Assume there are no other users so inode
>    will be freed as well. And event will be generated on host and watch
>    removed.
> D. Now Fuse server will send a unlink request reply. unlink notification
>    might still be in kernel buffers or still be in virtiofsd or could
>    be in virtiofs virtqueue.
> E. Fuse client will receive unlink reply and remove local watch.
>
> Fuse reply and notification event are now traveling in parallel on
> different virtqueues and there is no connection between these two. And
> it could very well happen that fuse reply comes first, gets processed
> first and local watch is removed. And notification is processed right
> after but by then local watch is gone and filesystem will be forced to
> drop event.
>
> As of now situation is more complicated in virtiofsd. We don't keep
> file handle open for file and keep an O_PATH fd open for each file.
> That means in step D above, inode on host is not freed yet and unlink
> event is not generated yet. When unlink reply reaches fuse client,
> it sends FORGET messages to server, and then server closes O_PATH fd
> and then host generates unlink events. By that time its too late,
> guest has already remove local watches (and triggered removal of
> remote watches too).
>
> This second problem probably can be solved by using file handles, but
> basic race will still continue to be there.
>
> > If it is local file removal that causes watch to be removed,
> > then don't drop local events and you are good to go.
> > Is it something else?
>
> - If remote events are enabled, then idea will be that user space gets
>   and event when file is actually removed from server, right? Now it
>   is possible that another VM has this file open and file has not been
>   yet removed. So local event only tells you that file has been removed
>   in guest VM (or locally) but does not tell anything about the state
>   of file on server. (It has been unlinked on server but inode continues
>   to be alive internall).
>
> - If user receives both local and remote delete event, it will be
>   confusing. I guess if we want to see both the events, then there
>   has to be some sort of info in event which classifies whether event
>   is local or remote. And let application act accordingly.
>

Maybe. Not sure this is the way to go.

There are several options to deal with this situation.
The thing is that applications cannot usually rely on getting
DELETE_SELF events for many different reasons that might
keep the inode reflink elevated also on local filesystems.

Perhaps the only way for FUSE (or any network) client to
know if object on server was really deleted is to issue a lookup
request to the file handle and when getting ESTALE.

It really sounds to me like DELETE_SELF should not be reported
at all for the first implementation.
It is very easy to deal with DELETE_SELF events scenario with
a filesystem watch, so bare that in mind as a possible application level
solution.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 1/7] FUSE: Add the fsnotify opcode and in/out structs to FUSE
       [not found]     ` <CAO17o20+jiij64y7b3eKoCjG5b_mLZj6o1LSnZ7+8exN3dFYEg@mail.gmail.com>
@ 2021-10-27  5:46       ` Amir Goldstein
  0 siblings, 0 replies; 47+ messages in thread
From: Amir Goldstein @ 2021-10-27  5:46 UTC (permalink / raw)
  To: Ioannis Angelakopoulos
  Cc: linux-fsdevel, virtio-fs-list, linux-kernel, Jan Kara, Al Viro,
	Miklos Szeredi, Vivek Goyal

>> > +       uint64_t mask;
>> > +       uint32_t namelen;
>> > +       uint32_t cookie;
>>
>> I object to persisting with the two-events-joined-by-cookie design.
>> Any new design should include a single event for rename
>> with information about src and dst.
>>
>> I know this is inconvenient, but we are NOT going to create a "remote inotify"
>> interface, we need to create a "remote fsnotify" interface and if server wants
>> to use inotify, it will need to join the disjoined MOVE_FROM/TO event into
>> a single "remote event", that FUSE will use to call fsnotify_move().
>>
> I do not fully understand this. Should we define a new "remote" event
> that corresponds to the merged MOVE_FROM/TO events and send that to the guest instead?

Yes.

First of all, my objection to the old cookie API stems from seeing applications
that do not use it correctly and because I find that it can be hard to use it.

For those reasons, when we extended fanotify to provide "inotify compatible"
semantics, the cookie API was not carried over to fanotify.

You can see an example of "inotify compatible" API in the implementation of
fsnotifywatch [1].

That said, the functionality of joining to/from events is still missing in
fanotify and I have posted several proposals for an API extension that
will solve it using an event that contains information on both src and dst [2].

> If that is the case, wouldn't that require that user space processes be aware of this newly added "remote"
> event?

No, the application will not be aware of the FUSE notify protocol at all.
If application is using inotify, it will get the old from+to+cookie events, but
FUSE server will notify the guest {MOVE, inode, from_path, to_path}
and FUSE client will call fsnotify_mode() or a variant.
Then inotify watchers will get what they are used to and fsnotify watchers
will get whatever information the existing or future API will provide.

To explain this from another perspective, you currently implemented the
virtiofsd watch using inotify, but you should not limit virtiofsd to inotify.
Once you learn the benefits of fanotify, you may consider implementing
the virtiofsd watches using fanotify. It should not matter for the inotify
application in the guest at all.

With the current implementation of watch in virtiofsd using inotify, that
will cause the inconvenience of having to match the from/to events
before reporting the event to guest, but inconvenience for a specific
implementation is not what should be driving the API design.

> Also in the inotify man page there is a limitations case that states that it is possible for one of these events
> to be missing. What should the server do in this case?
>

The protocol should be able to carry src and dst information, but
you could leave src (of moved_to) or dst (of moved_from) empty.
In that case, FUSE client cannot call fsnotify_move() as is, so we
can either change fsnotify_move() or use a variant for reporting remote
rename events.

The extra info for rename events with the FAN_REPORT_TARGET_FID
proposal is also designed to be optional.

Thanks,
Amir.

[1] https://man7.org/linux/man-pages/man1/fsnotifywatch.1.html
[2] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjYDDk00VPdWtRB1_tf+gCoPFgSQ9O0p0fGaW_JiFUUKA@mail.gmail.com/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
       [not found]         ` <CAO17o20sdKAWQN6w7Oe0Ze06qcK+J=6rrmA_aWGnY__MRVDCKw@mail.gmail.com>
@ 2021-10-27  5:59           ` Amir Goldstein
  2021-10-27 13:23             ` Jan Kara
  2021-10-27 20:24             ` Vivek Goyal
  0 siblings, 2 replies; 47+ messages in thread
From: Amir Goldstein @ 2021-10-27  5:59 UTC (permalink / raw)
  To: Ioannis Angelakopoulos
  Cc: Vivek Goyal, linux-fsdevel, virtio-fs-list, linux-kernel,
	Jan Kara, Al Viro, Miklos Szeredi, Steve French

On Tue, Oct 26, 2021 at 10:14 PM Ioannis Angelakopoulos
<iangelak@redhat.com> wrote:
>
>
>
> On Tue, Oct 26, 2021 at 2:27 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>>
>> On Tue, Oct 26, 2021 at 08:59:44PM +0300, Amir Goldstein wrote:
>> > On Tue, Oct 26, 2021 at 7:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>> > >
>> > > On Tue, Oct 26, 2021 at 06:23:50PM +0300, Amir Goldstein wrote:
>> > >
>> > > [..]
>> > > > > 3) The lifetime of the local watch in the guest kernel is very
>> > > > > important. Specifically, there is a possibility that the guest does not
>> > > > > receive remote events on time, if it removes its local watch on the
>> > > > > target or deletes the inode (and thus the guest kernel removes the watch).
>> > > > > In these cases the guest kernel removes the local watch before the
>> > > > > remote events arrive from the host (virtiofsd) and as such the guest
>> > > > > kernel drops all the remote events for the target inode (since the
>> > > > > corresponding local watch does not exist anymore).
>> > >
>> > > So this is one of the issues which has been haunting us in virtiofs. If
>> > > a file is removed, for local events, event is generated first and
>> > > then watch is removed. But in case of remote filesystems, it is racy.
>> > > It is possible that by the time event arrives, watch is already gone
>> > > and application never sees the delete event.
>> > >
>> > > Not sure how to address this issue.
>> >
>>
>> > Can you take me through the scenario step by step.
>> > I am not sure I understand the exact sequence of the race.
>>
>> Ioannis, please correct me If I get something wrong. You know exact
>> details much more than me.
>>
>> A. Say a guest process unlinks a file.
>> B. Fuse sends an unlink request to server (virtiofsd)
>> C. File is unlinked on host. Assume there are no other users so inode
>>    will be freed as well. And event will be generated on host and watch
>>    removed.
>> D. Now Fuse server will send a unlink request reply. unlink notification
>>    might still be in kernel buffers or still be in virtiofsd or could
>>    be in virtiofs virtqueue.
>> E. Fuse client will receive unlink reply and remove local watch.
>>
>> Fuse reply and notification event are now traveling in parallel on
>> different virtqueues and there is no connection between these two. And
>> it could very well happen that fuse reply comes first, gets processed
>> first and local watch is removed. And notification is processed right
>> after but by then local watch is gone and filesystem will be forced to
>> drop event.
>>
>> As of now situation is more complicated in virtiofsd. We don't keep
>> file handle open for file and keep an O_PATH fd open for each file.
>> That means in step D above, inode on host is not freed yet and unlink
>> event is not generated yet. When unlink reply reaches fuse client,
>> it sends FORGET messages to server, and then server closes O_PATH fd
>> and then host generates unlink events. By that time its too late,
>> guest has already remove local watches (and triggered removal of
>> remote watches too).
>>
>> This second problem probably can be solved by using file handles, but
>> basic race will still continue to be there.
>>
>> > If it is local file removal that causes watch to be removed,
>> > then don't drop local events and you are good to go.
>> > Is it something else?
>>
>> - If remote events are enabled, then idea will be that user space gets
>>   and event when file is actually removed from server, right? Now it
>>   is possible that another VM has this file open and file has not been
>>   yet removed. So local event only tells you that file has been removed
>>   in guest VM (or locally) but does not tell anything about the state
>>   of file on server. (It has been unlinked on server but inode continues
>>   to be alive internall).
>>
>> - If user receives both local and remote delete event, it will be
>>   confusing. I guess if we want to see both the events, then there
>>   has to be some sort of info in event which classifies whether event
>>   is local or remote. And let application act accordingly.
>>
>> Thanks
>> Vivek
>>
>
> Hello Amir!
>
> Sorry for taking part in the conversation a bit late.  Vivek was on point with the
> example he gave but the race is a bit more generic than only the DELETE event.
>
> Let's say that a guest process monitors an inode for OPEN events:
>
> 1) The same guest process or another guest process opens the file (related to the
> monitored inode), and then closes and immediately deletes the file/inode.
> 2) The FUSE server (virtiofsd) will mimic the operations of the guest process:
>      a) Will open the file on the host side and thus a remote OPEN event is going to
>      be generated on the host and sent to the guest.
>      b) Will unlink the remote inode and if no other host process uses the inode then the
>      inode will be freed and a DELETE event is going to be generated on the host and sent
>      to the guest (However, due to how virtiofsd works and Vivek mentioned, this step won't
>      happen immediately)
>

You are confusing DELETE with DELETE_SELF.
DELETE corresponds to unlink(), so you get a DELETE event even if
inode is a hardlink
with nlink > 0 after unlink().

The DELETE event is reported (along with filename) against the parent directory
inode, so the test case above won't drop the event.

> The problem here is that the OPEN event might still be travelling towards the guest in the
> virtqueues and arrives after the guest has already deleted its local inode.
> While the remote event (OPEN) received by the guest is valid, its fsnotify
> subsystem will drop it since the local inode is not there.
>

I have a feeling that we are mixing issues related to shared server
and remote fsnotify.
Does virtiofsd support multiple guests already? There are many other
issues related
to cache coherency that should be dealt with in this case, some of
them overlap the
problem that you describe, so solving the narrow problem of dropped
remote events
seems like the wrong way to approach the problem.

I think that in a shared server situation, the simple LOOKUP/FORGET protocol
will not suffice. I will not even try to solve the generic problem,
but will just
mentioned that SMB/NFS protocols use delegations/oplocks in the protocol
to advertise object usage by other clients to all clients.

I think this issue is far outside the scope of your project and you should
just leave the dropped events as a known limitation at this point.
inotify has the event IN_IGNORED that application can use as a hint that some
events could have been dropped.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-10-27  5:59           ` Amir Goldstein
@ 2021-10-27 13:23             ` Jan Kara
  2021-10-27 20:29               ` Vivek Goyal
  2021-10-27 20:24             ` Vivek Goyal
  1 sibling, 1 reply; 47+ messages in thread
From: Jan Kara @ 2021-10-27 13:23 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Ioannis Angelakopoulos, Vivek Goyal, linux-fsdevel,
	virtio-fs-list, linux-kernel, Jan Kara, Al Viro, Miklos Szeredi,
	Steve French

On Wed 27-10-21 08:59:15, Amir Goldstein wrote:
> On Tue, Oct 26, 2021 at 10:14 PM Ioannis Angelakopoulos
> <iangelak@redhat.com> wrote:
> > On Tue, Oct 26, 2021 at 2:27 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > The problem here is that the OPEN event might still be travelling towards the guest in the
> > virtqueues and arrives after the guest has already deleted its local inode.
> > While the remote event (OPEN) received by the guest is valid, its fsnotify
> > subsystem will drop it since the local inode is not there.
> >
> 
> I have a feeling that we are mixing issues related to shared server
> and remote fsnotify.

I don't think Ioannis was speaking about shared server case here. I think
he says that in a simple FUSE remote notification setup we can loose OPEN
events (or basically any other) if the inode for which the event happens
gets deleted sufficiently early after the event being generated. That seems
indeed somewhat unexpected and could be confusing if it happens e.g. for
some directory operations.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-10-27  5:59           ` Amir Goldstein
  2021-10-27 13:23             ` Jan Kara
@ 2021-10-27 20:24             ` Vivek Goyal
  2021-10-28  5:11               ` Amir Goldstein
  1 sibling, 1 reply; 47+ messages in thread
From: Vivek Goyal @ 2021-10-27 20:24 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Ioannis Angelakopoulos, linux-fsdevel, virtio-fs-list,
	linux-kernel, Jan Kara, Al Viro, Miklos Szeredi, Steve French

On Wed, Oct 27, 2021 at 08:59:15AM +0300, Amir Goldstein wrote:
> On Tue, Oct 26, 2021 at 10:14 PM Ioannis Angelakopoulos
> <iangelak@redhat.com> wrote:
> >
> >
> >
> > On Tue, Oct 26, 2021 at 2:27 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >>
> >> On Tue, Oct 26, 2021 at 08:59:44PM +0300, Amir Goldstein wrote:
> >> > On Tue, Oct 26, 2021 at 7:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >> > >
> >> > > On Tue, Oct 26, 2021 at 06:23:50PM +0300, Amir Goldstein wrote:
> >> > >
> >> > > [..]
> >> > > > > 3) The lifetime of the local watch in the guest kernel is very
> >> > > > > important. Specifically, there is a possibility that the guest does not
> >> > > > > receive remote events on time, if it removes its local watch on the
> >> > > > > target or deletes the inode (and thus the guest kernel removes the watch).
> >> > > > > In these cases the guest kernel removes the local watch before the
> >> > > > > remote events arrive from the host (virtiofsd) and as such the guest
> >> > > > > kernel drops all the remote events for the target inode (since the
> >> > > > > corresponding local watch does not exist anymore).
> >> > >
> >> > > So this is one of the issues which has been haunting us in virtiofs. If
> >> > > a file is removed, for local events, event is generated first and
> >> > > then watch is removed. But in case of remote filesystems, it is racy.
> >> > > It is possible that by the time event arrives, watch is already gone
> >> > > and application never sees the delete event.
> >> > >
> >> > > Not sure how to address this issue.
> >> >
> >>
> >> > Can you take me through the scenario step by step.
> >> > I am not sure I understand the exact sequence of the race.
> >>
> >> Ioannis, please correct me If I get something wrong. You know exact
> >> details much more than me.
> >>
> >> A. Say a guest process unlinks a file.
> >> B. Fuse sends an unlink request to server (virtiofsd)
> >> C. File is unlinked on host. Assume there are no other users so inode
> >>    will be freed as well. And event will be generated on host and watch
> >>    removed.
> >> D. Now Fuse server will send a unlink request reply. unlink notification
> >>    might still be in kernel buffers or still be in virtiofsd or could
> >>    be in virtiofs virtqueue.
> >> E. Fuse client will receive unlink reply and remove local watch.
> >>
> >> Fuse reply and notification event are now traveling in parallel on
> >> different virtqueues and there is no connection between these two. And
> >> it could very well happen that fuse reply comes first, gets processed
> >> first and local watch is removed. And notification is processed right
> >> after but by then local watch is gone and filesystem will be forced to
> >> drop event.
> >>
> >> As of now situation is more complicated in virtiofsd. We don't keep
> >> file handle open for file and keep an O_PATH fd open for each file.
> >> That means in step D above, inode on host is not freed yet and unlink
> >> event is not generated yet. When unlink reply reaches fuse client,
> >> it sends FORGET messages to server, and then server closes O_PATH fd
> >> and then host generates unlink events. By that time its too late,
> >> guest has already remove local watches (and triggered removal of
> >> remote watches too).
> >>
> >> This second problem probably can be solved by using file handles, but
> >> basic race will still continue to be there.
> >>
> >> > If it is local file removal that causes watch to be removed,
> >> > then don't drop local events and you are good to go.
> >> > Is it something else?
> >>
> >> - If remote events are enabled, then idea will be that user space gets
> >>   and event when file is actually removed from server, right? Now it
> >>   is possible that another VM has this file open and file has not been
> >>   yet removed. So local event only tells you that file has been removed
> >>   in guest VM (or locally) but does not tell anything about the state
> >>   of file on server. (It has been unlinked on server but inode continues
> >>   to be alive internall).
> >>
> >> - If user receives both local and remote delete event, it will be
> >>   confusing. I guess if we want to see both the events, then there
> >>   has to be some sort of info in event which classifies whether event
> >>   is local or remote. And let application act accordingly.
> >>
> >> Thanks
> >> Vivek
> >>
> >
> > Hello Amir!
> >
> > Sorry for taking part in the conversation a bit late.  Vivek was on point with the
> > example he gave but the race is a bit more generic than only the DELETE event.
> >
> > Let's say that a guest process monitors an inode for OPEN events:
> >
> > 1) The same guest process or another guest process opens the file (related to the
> > monitored inode), and then closes and immediately deletes the file/inode.
> > 2) The FUSE server (virtiofsd) will mimic the operations of the guest process:
> >      a) Will open the file on the host side and thus a remote OPEN event is going to
> >      be generated on the host and sent to the guest.
> >      b) Will unlink the remote inode and if no other host process uses the inode then the
> >      inode will be freed and a DELETE event is going to be generated on the host and sent
> >      to the guest (However, due to how virtiofsd works and Vivek mentioned, this step won't
> >      happen immediately)
> >
> 
> You are confusing DELETE with DELETE_SELF.
> DELETE corresponds to unlink(), so you get a DELETE event even if
> inode is a hardlink
> with nlink > 0 after unlink().
> 
> The DELETE event is reported (along with filename) against the parent directory
> inode, so the test case above won't drop the event.

Hi Amir,

Agreed that there is confusion between DELETE and DELETE_SELF events. I
think Ioannis is referring to DELETE_SELF event. With this example he
is trying to emphasize that due to races, problem is not limited to
DELETE_SELF events only and other events could arrive little later
after the local watch in guest has been removed and then all those
events will be dropped as well. So he gave example of OPEN event. And
I think remote IN_IGNORED might face the same fate.

In the case of IN_IGNORED, I am wondering is it ok to generate that
event locally instead.

> 
> > The problem here is that the OPEN event might still be travelling towards the guest in the
> > virtqueues and arrives after the guest has already deleted its local inode.
> > While the remote event (OPEN) received by the guest is valid, its fsnotify
> > subsystem will drop it since the local inode is not there.
> >
> 
> I have a feeling that we are mixing issues related to shared server
> and remote fsnotify.
> Does virtiofsd support multiple guests already?

I would like to think that there are not many basic issues with shared
directory configuration. So we don't stop users from using it. 

> There are many other
> issues related
> to cache coherency that should be dealt with in this case, some of
> them overlap the
> problem that you describe, so solving the narrow problem of dropped
> remote events
> seems like the wrong way to approach the problem.

Dropped remote event problem/race will exist even if it was not shared
server. Remote events travel through different virtqueue as comapred
to fuse request reply. So there is no guarantee in what order events
or replies will processed.

> 
> I think that in a shared server situation, the simple LOOKUP/FORGET protocol
> will not suffice.

Hmm.., So what's the problem with LOOKUP/FORGET in shared dir case?

> I will not even try to solve the generic problem,
> but will just
> mentioned that SMB/NFS protocols use delegations/oplocks in the protocol
> to advertise object usage by other clients to all clients.

I think Miklos was looking into the idea of some sort of file leases
on fuse for creating equivalent of delegations. Not sure if that work
made any progress.

> 
> I think this issue is far outside the scope of your project and you should
> just leave the dropped events as a known limitation at this point.

May be that's what we should do to begin with and just say these events
can be lost or never arrive.

> inotify has the event IN_IGNORED that application can use as a hint that some
> events could have been dropped.

That probably will require generating IN_IGNORED locally when local watch
goes away (and not rely on remote IN_IGNORED), IIUC.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-10-27 13:23             ` Jan Kara
@ 2021-10-27 20:29               ` Vivek Goyal
  2021-10-27 20:37                 ` Vivek Goyal
  2021-11-02 11:09                 ` Jan Kara
  0 siblings, 2 replies; 47+ messages in thread
From: Vivek Goyal @ 2021-10-27 20:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, Ioannis Angelakopoulos, linux-fsdevel,
	virtio-fs-list, linux-kernel, Al Viro, Miklos Szeredi,
	Steve French

On Wed, Oct 27, 2021 at 03:23:19PM +0200, Jan Kara wrote:
> On Wed 27-10-21 08:59:15, Amir Goldstein wrote:
> > On Tue, Oct 26, 2021 at 10:14 PM Ioannis Angelakopoulos
> > <iangelak@redhat.com> wrote:
> > > On Tue, Oct 26, 2021 at 2:27 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > The problem here is that the OPEN event might still be travelling towards the guest in the
> > > virtqueues and arrives after the guest has already deleted its local inode.
> > > While the remote event (OPEN) received by the guest is valid, its fsnotify
> > > subsystem will drop it since the local inode is not there.
> > >
> > 
> > I have a feeling that we are mixing issues related to shared server
> > and remote fsnotify.
> 
> I don't think Ioannis was speaking about shared server case here. I think
> he says that in a simple FUSE remote notification setup we can loose OPEN
> events (or basically any other) if the inode for which the event happens
> gets deleted sufficiently early after the event being generated. That seems
> indeed somewhat unexpected and could be confusing if it happens e.g. for
> some directory operations.

Hi Jan,

Agreed. That's what Ioannis is trying to say. That some of the remote events
can be lost if fuse/guest local inode is unlinked. I think problem exists
both for shared and non-shared directory case.

With local filesystems we have a control that we can first queue up
the event in buffer before we remove local watches. With events travelling
from a remote server, there is no such control/synchronization. It can
very well happen that events got delayed in the communication path
somewhere and local watches went away and now there is no way to 
deliver those events to the application.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-10-27 20:29               ` Vivek Goyal
@ 2021-10-27 20:37                 ` Vivek Goyal
  2021-11-02 11:09                 ` Jan Kara
  1 sibling, 0 replies; 47+ messages in thread
From: Vivek Goyal @ 2021-10-27 20:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, Ioannis Angelakopoulos, linux-fsdevel,
	virtio-fs-list, linux-kernel, Al Viro, Miklos Szeredi,
	Steve French

On Wed, Oct 27, 2021 at 04:29:41PM -0400, Vivek Goyal wrote:
> On Wed, Oct 27, 2021 at 03:23:19PM +0200, Jan Kara wrote:
> > On Wed 27-10-21 08:59:15, Amir Goldstein wrote:
> > > On Tue, Oct 26, 2021 at 10:14 PM Ioannis Angelakopoulos
> > > <iangelak@redhat.com> wrote:
> > > > On Tue, Oct 26, 2021 at 2:27 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > The problem here is that the OPEN event might still be travelling towards the guest in the
> > > > virtqueues and arrives after the guest has already deleted its local inode.
> > > > While the remote event (OPEN) received by the guest is valid, its fsnotify
> > > > subsystem will drop it since the local inode is not there.
> > > >
> > > 
> > > I have a feeling that we are mixing issues related to shared server
> > > and remote fsnotify.
> > 
> > I don't think Ioannis was speaking about shared server case here. I think
> > he says that in a simple FUSE remote notification setup we can loose OPEN
> > events (or basically any other) if the inode for which the event happens
> > gets deleted sufficiently early after the event being generated. That seems
> > indeed somewhat unexpected and could be confusing if it happens e.g. for
> > some directory operations.
> 
> Hi Jan,
> 
> Agreed. That's what Ioannis is trying to say. That some of the remote events
> can be lost if fuse/guest local inode is unlinked. I think problem exists
> both for shared and non-shared directory case.

Thinking more about it, one will need remote fsnotify support only if
this is a case of shared direcotry. If there are no clients doing parallel
operations in shared dir (like dropping some new files), then local
inotify should be good enough. And I think Ioannis mentioned that local
inotify already work with fuse.

For example, kata container developers wanted to use inotify on virtiofs,
because a process on host (kubernetes?) will update some sort of config
file in shared directory. They wanted to monitor that config in guest,
get an inotify event and then act on updated config file. That's the
case of shared dir.

Given we do not support inotify yet, I think they went ahead with some
sort of polling mechanism to take care of there use case.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 1/7] FUSE: Add the fsnotify opcode and in/out structs to FUSE
  2021-10-26 14:56   ` Amir Goldstein
  2021-10-26 18:28     ` Amir Goldstein
       [not found]     ` <CAO17o20+jiij64y7b3eKoCjG5b_mLZj6o1LSnZ7+8exN3dFYEg@mail.gmail.com>
@ 2021-10-27 21:46     ` Vivek Goyal
  2021-10-28  4:13       ` Amir Goldstein
  2 siblings, 1 reply; 47+ messages in thread
From: Vivek Goyal @ 2021-10-27 21:46 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Ioannis Angelakopoulos, linux-fsdevel, virtio-fs-list,
	linux-kernel, Jan Kara, Al Viro, Miklos Szeredi

On Tue, Oct 26, 2021 at 05:56:03PM +0300, Amir Goldstein wrote:
> On Mon, Oct 25, 2021 at 11:47 PM Ioannis Angelakopoulos
> <iangelak@redhat.com> wrote:
> >
> > Since fsnotify is the backend for the inotify subsystem all the backend
> > code implementation we add is related to fsnotify.
> >
> > To support an fsnotify request in FUSE and specifically virtiofs we add a
> > new opcode for the FSNOTIFY (51) operation request in the "fuse.h" header.
> >
> > Also add the "fuse_notify_fsnotify_in" and "fuse_notify_fsnotify_out"
> > structs that are responsible for passing the fsnotify/inotify related data
> > to and from the FUSE server.
> >
> > Signed-off-by: Ioannis Angelakopoulos <iangelak@redhat.com>
> > ---
> >  include/uapi/linux/fuse.h | 23 ++++++++++++++++++++++-
> >  1 file changed, 22 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index 46838551ea84..418b7fc72417 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -186,6 +186,9 @@
> >   *  - add FUSE_SYNCFS
> >   *  7.35
> >   *  - add FUSE_NOTIFY_LOCK
> > + *  7.36
> > + *  - add FUSE_HAVE_FSNOTIFY
> > + *  - add fuse_notify_fsnotify_(in,out)
> >   */
> >
> >  #ifndef _LINUX_FUSE_H
> > @@ -221,7 +224,7 @@
> >  #define FUSE_KERNEL_VERSION 7
> >
> >  /** Minor version number of this interface */
> > -#define FUSE_KERNEL_MINOR_VERSION 35
> > +#define FUSE_KERNEL_MINOR_VERSION 36
> >
> >  /** The node ID of the root inode */
> >  #define FUSE_ROOT_ID 1
> > @@ -338,6 +341,7 @@ struct fuse_file_lock {
> >   *                     write/truncate sgid is killed only if file has group
> >   *                     execute permission. (Same as Linux VFS behavior).
> >   * FUSE_SETXATTR_EXT:  Server supports extended struct fuse_setxattr_in
> > + * FUSE_HAVE_FSNOTIFY: remote fsnotify/inotify event subsystem support
> >   */
> >  #define FUSE_ASYNC_READ                (1 << 0)
> >  #define FUSE_POSIX_LOCKS       (1 << 1)
> > @@ -369,6 +373,7 @@ struct fuse_file_lock {
> >  #define FUSE_SUBMOUNTS         (1 << 27)
> >  #define FUSE_HANDLE_KILLPRIV_V2        (1 << 28)
> >  #define FUSE_SETXATTR_EXT      (1 << 29)
> > +#define FUSE_HAVE_FSNOTIFY     (1 << 30)
> >
> >  /**
> >   * CUSE INIT request/reply flags
> > @@ -515,6 +520,7 @@ enum fuse_opcode {
> >         FUSE_SETUPMAPPING       = 48,
> >         FUSE_REMOVEMAPPING      = 49,
> >         FUSE_SYNCFS             = 50,
> > +       FUSE_FSNOTIFY           = 51,
> >
> >         /* CUSE specific operations */
> >         CUSE_INIT               = 4096,
> > @@ -532,6 +538,7 @@ enum fuse_notify_code {
> >         FUSE_NOTIFY_RETRIEVE = 5,
> >         FUSE_NOTIFY_DELETE = 6,
> >         FUSE_NOTIFY_LOCK = 7,
> > +       FUSE_NOTIFY_FSNOTIFY = 8,
> >         FUSE_NOTIFY_CODE_MAX,
> >  };
> >
> > @@ -571,6 +578,20 @@ struct fuse_getattr_in {
> >         uint64_t        fh;
> >  };
> >
> > +struct fuse_notify_fsnotify_out {
> > +       uint64_t inode;
> 
> 64bit inode is not a good unique identifier of the object.

I think he wants to store 64bit nodeid (internal to fuse so that client
and server can figure out which inode they are talking about). But I 
think you are concerned about what happens if an event arrived for an
inode after inode has been released and nodeid possibly used for some
other inode. And then we will find that new inode in guest cache and
end up associating event with wrong inode.

Generation number will help in the sense that server has a chance
to always update generation number on lookup. So even if nodeid
is reused, generation number will make make sure we don't end
up associating this event with reused node id inode. I guess
makes sense.

> you need to either include the generation in object identifier
> or much better use the object's nfs file handle, the same way
> that fanotify stores object identifiers.

I think nfs file handle is much more complicated and its a separate
project altogether. I am assuming we are talking about persistent
nfs file handle as generated by host. I think biggest issue we faced
with that is that guest is untrusted and we don't want to resolve
file handle provided by guest on host otherwise guest can craft
file handles and possibly be able to open other files on same filesystem
outside shared dir. 

> 
> > +       uint64_t mask;
> > +       uint32_t namelen;
> > +       uint32_t cookie;
> 
> I object to persisting with the two-events-joined-by-cookie design.
> Any new design should include a single event for rename
> with information about src and dst.
> 
> I know this is inconvenient, but we are NOT going to create a "remote inotify"
> interface, we need to create a "remote fsnotify" interface and if server wants
> to use inotify, it will need to join the disjoined MOVE_FROM/TO event into
> a single "remote event", that FUSE will use to call fsnotify_move().

man inotify says following.

"       Matching up the IN_MOVED_FROM and IN_MOVED_TO event pair  generated  by
       rename(2)  is thus inherently racy.  (Don't forget that if an object is
       renamed outside of a monitored directory, there  may  not  even  be  an
       IN_MOVED_TO  event.)"

So if guest is no monitoring target dir of renamed file, then we will not
even get IN_MOVED_TO. In that case we can't merge two events into one.

And this sounds like inotify/fanotify interface needs to come up with
an merged event and in that case remote filesystem will simply propagate
that event. (Instead of coming up with a new event only for remote
filesystems. Sounds like this is not a problem limited to remote
filesystems only).

Thanks
Vivek


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 1/7] FUSE: Add the fsnotify opcode and in/out structs to FUSE
  2021-10-27 21:46     ` Vivek Goyal
@ 2021-10-28  4:13       ` Amir Goldstein
  2021-10-28 14:20         ` Vivek Goyal
  0 siblings, 1 reply; 47+ messages in thread
From: Amir Goldstein @ 2021-10-28  4:13 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Ioannis Angelakopoulos, linux-fsdevel, virtio-fs-list,
	linux-kernel, Jan Kara, Al Viro, Miklos Szeredi

> > you need to either include the generation in object identifier
> > or much better use the object's nfs file handle, the same way
> > that fanotify stores object identifiers.
>
> I think nfs file handle is much more complicated and its a separate
> project altogether. I am assuming we are talking about persistent
> nfs file handle as generated by host. I think biggest issue we faced
> with that is that guest is untrusted and we don't want to resolve
> file handle provided by guest on host otherwise guest can craft
> file handles and possibly be able to open other files on same filesystem
> outside shared dir.
>

Right now, virtiofsd keeps all inodes and dentries of live client inodes
pinned in cache on the server.
If you switch to file handles design, virtiofsd only need to keep a map
of all the file handles that server handed out to client to address
this concern.

For directories, this practice is not even needed for security, because
a decoded directory file handle can be verified to be within the shared dir.
It is only needed to prevent DoS, because a crafted directory file handle
(outside of shared dir) can be used to generate extra IO and thrash the
inode/dentry cache on the server.

> >
> > > +       uint64_t mask;
> > > +       uint32_t namelen;
> > > +       uint32_t cookie;
> >
> > I object to persisting with the two-events-joined-by-cookie design.
> > Any new design should include a single event for rename
> > with information about src and dst.
> >
> > I know this is inconvenient, but we are NOT going to create a "remote inotify"
> > interface, we need to create a "remote fsnotify" interface and if server wants
> > to use inotify, it will need to join the disjoined MOVE_FROM/TO event into
> > a single "remote event", that FUSE will use to call fsnotify_move().
>
> man inotify says following.
>
> "       Matching up the IN_MOVED_FROM and IN_MOVED_TO event pair  generated  by
>        rename(2)  is thus inherently racy.  (Don't forget that if an object is
>        renamed outside of a monitored directory, there  may  not  even  be  an
>        IN_MOVED_TO  event.)"
>
> So if guest is no monitoring target dir of renamed file, then we will not
> even get IN_MOVED_TO. In that case we can't merge two events into one.
>
> And this sounds like inotify/fanotify interface needs to come up with
> an merged event and in that case remote filesystem will simply propagate
> that event. (Instead of coming up with a new event only for remote
> filesystems. Sounds like this is not a problem limited to remote
> filesystems only).
>

I don't see it that way.
I see the "internal protocol" for filesystems/vfs to report rename is
fsnotify_move() which carries information about both src and target.
I would like the remote protocol  to carry the same information.
It is then up to the userspace API whether to report the rename
as two events or a unified event.

For example, debugfs, calls fsnotify_move() when an object is renamed
from underneath the vfs. It does not call fsnotify() twice with a cookie,
because we would not want to change local filesystem nor remote protocols
when we want to add new userspace APIs for reporting fsevents.

That comes down to my feeling that this claims to be a proposal
for "remote fsnotify", but it looks and sounds like a proposal for
"remote inotify" and this is the core of my objection to passing
the rename cookie in the protocol.

Regarding the issue that the src/dst path may not be known
to the server, as I wrote, it is fine if either the src/dst path information
is omitted from an event, but the protocol should provide the
placeholder to report them.

After sufficient arguing, I might be convinced that the cookie may be included
as an optional field in addition to the fields that I requested.

I understand why you write that this sounds like an fanotify interface
that needs to be resolved and you are correct, but the reason that the
fanotify interface issue was not yet resolved is that we are trying to not
repeat the mistakes of the past and for that same reason, I am insisting
on the protocol.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-10-27 20:24             ` Vivek Goyal
@ 2021-10-28  5:11               ` Amir Goldstein
  0 siblings, 0 replies; 47+ messages in thread
From: Amir Goldstein @ 2021-10-28  5:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Ioannis Angelakopoulos, linux-fsdevel, virtio-fs-list,
	linux-kernel, Jan Kara, Al Viro, Miklos Szeredi, Steve French

On Wed, Oct 27, 2021 at 11:24 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Wed, Oct 27, 2021 at 08:59:15AM +0300, Amir Goldstein wrote:
> > On Tue, Oct 26, 2021 at 10:14 PM Ioannis Angelakopoulos
> > <iangelak@redhat.com> wrote:
> > >
> > >
> > >
> > > On Tue, Oct 26, 2021 at 2:27 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >>
> > >> On Tue, Oct 26, 2021 at 08:59:44PM +0300, Amir Goldstein wrote:
> > >> > On Tue, Oct 26, 2021 at 7:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >> > >
> > >> > > On Tue, Oct 26, 2021 at 06:23:50PM +0300, Amir Goldstein wrote:
> > >> > >
> > >> > > [..]
> > >> > > > > 3) The lifetime of the local watch in the guest kernel is very
> > >> > > > > important. Specifically, there is a possibility that the guest does not
> > >> > > > > receive remote events on time, if it removes its local watch on the
> > >> > > > > target or deletes the inode (and thus the guest kernel removes the watch).
> > >> > > > > In these cases the guest kernel removes the local watch before the
> > >> > > > > remote events arrive from the host (virtiofsd) and as such the guest
> > >> > > > > kernel drops all the remote events for the target inode (since the
> > >> > > > > corresponding local watch does not exist anymore).
> > >> > >
> > >> > > So this is one of the issues which has been haunting us in virtiofs. If
> > >> > > a file is removed, for local events, event is generated first and
> > >> > > then watch is removed. But in case of remote filesystems, it is racy.
> > >> > > It is possible that by the time event arrives, watch is already gone
> > >> > > and application never sees the delete event.
> > >> > >
> > >> > > Not sure how to address this issue.
> > >> >
> > >>
> > >> > Can you take me through the scenario step by step.
> > >> > I am not sure I understand the exact sequence of the race.
> > >>
> > >> Ioannis, please correct me If I get something wrong. You know exact
> > >> details much more than me.
> > >>
> > >> A. Say a guest process unlinks a file.
> > >> B. Fuse sends an unlink request to server (virtiofsd)
> > >> C. File is unlinked on host. Assume there are no other users so inode
> > >>    will be freed as well. And event will be generated on host and watch
> > >>    removed.
> > >> D. Now Fuse server will send a unlink request reply. unlink notification
> > >>    might still be in kernel buffers or still be in virtiofsd or could
> > >>    be in virtiofs virtqueue.
> > >> E. Fuse client will receive unlink reply and remove local watch.
> > >>
> > >> Fuse reply and notification event are now traveling in parallel on
> > >> different virtqueues and there is no connection between these two. And
> > >> it could very well happen that fuse reply comes first, gets processed
> > >> first and local watch is removed. And notification is processed right
> > >> after but by then local watch is gone and filesystem will be forced to
> > >> drop event.
> > >>
> > >> As of now situation is more complicated in virtiofsd. We don't keep
> > >> file handle open for file and keep an O_PATH fd open for each file.
> > >> That means in step D above, inode on host is not freed yet and unlink
> > >> event is not generated yet. When unlink reply reaches fuse client,
> > >> it sends FORGET messages to server, and then server closes O_PATH fd
> > >> and then host generates unlink events. By that time its too late,
> > >> guest has already remove local watches (and triggered removal of
> > >> remote watches too).
> > >>
> > >> This second problem probably can be solved by using file handles, but
> > >> basic race will still continue to be there.
> > >>
> > >> > If it is local file removal that causes watch to be removed,
> > >> > then don't drop local events and you are good to go.
> > >> > Is it something else?
> > >>
> > >> - If remote events are enabled, then idea will be that user space gets
> > >>   and event when file is actually removed from server, right? Now it
> > >>   is possible that another VM has this file open and file has not been
> > >>   yet removed. So local event only tells you that file has been removed
> > >>   in guest VM (or locally) but does not tell anything about the state
> > >>   of file on server. (It has been unlinked on server but inode continues
> > >>   to be alive internall).
> > >>
> > >> - If user receives both local and remote delete event, it will be
> > >>   confusing. I guess if we want to see both the events, then there
> > >>   has to be some sort of info in event which classifies whether event
> > >>   is local or remote. And let application act accordingly.
> > >>
> > >> Thanks
> > >> Vivek
> > >>
> > >
> > > Hello Amir!
> > >
> > > Sorry for taking part in the conversation a bit late.  Vivek was on point with the
> > > example he gave but the race is a bit more generic than only the DELETE event.
> > >
> > > Let's say that a guest process monitors an inode for OPEN events:
> > >
> > > 1) The same guest process or another guest process opens the file (related to the
> > > monitored inode), and then closes and immediately deletes the file/inode.
> > > 2) The FUSE server (virtiofsd) will mimic the operations of the guest process:
> > >      a) Will open the file on the host side and thus a remote OPEN event is going to
> > >      be generated on the host and sent to the guest.
> > >      b) Will unlink the remote inode and if no other host process uses the inode then the
> > >      inode will be freed and a DELETE event is going to be generated on the host and sent
> > >      to the guest (However, due to how virtiofsd works and Vivek mentioned, this step won't
> > >      happen immediately)
> > >
> >
> > You are confusing DELETE with DELETE_SELF.
> > DELETE corresponds to unlink(), so you get a DELETE event even if
> > inode is a hardlink
> > with nlink > 0 after unlink().
> >
> > The DELETE event is reported (along with filename) against the parent directory
> > inode, so the test case above won't drop the event.
>
> Hi Amir,
>
> Agreed that there is confusion between DELETE and DELETE_SELF events. I
> think Ioannis is referring to DELETE_SELF event. With this example he
> is trying to emphasize that due to races, problem is not limited to
> DELETE_SELF events only and other events could arrive little later
> after the local watch in guest has been removed and then all those
> events will be dropped as well. So he gave example of OPEN event. And
> I think remote IN_IGNORED might face the same fate.
>
> In the case of IN_IGNORED, I am wondering is it ok to generate that
> event locally instead.
>

Yes, that is what I meant.
Local watch removed can be used as a hint to applications that
events might have been lost.

> >
> > > The problem here is that the OPEN event might still be travelling towards the guest in the
> > > virtqueues and arrives after the guest has already deleted its local inode.
> > > While the remote event (OPEN) received by the guest is valid, its fsnotify
> > > subsystem will drop it since the local inode is not there.
> > >
> >
> > I have a feeling that we are mixing issues related to shared server
> > and remote fsnotify.
> > Does virtiofsd support multiple guests already?
>
> I would like to think that there are not many basic issues with shared
> directory configuration. So we don't stop users from using it.
>
> > There are many other
> > issues related
> > to cache coherency that should be dealt with in this case, some of
> > them overlap the
> > problem that you describe, so solving the narrow problem of dropped
> > remote events
> > seems like the wrong way to approach the problem.
>
> Dropped remote event problem/race will exist even if it was not shared
> server. Remote events travel through different virtqueue as comapred
> to fuse request reply. So there is no guarantee in what order events
> or replies will processed.
>
> >
> > I think that in a shared server situation, the simple LOOKUP/FORGET protocol
> > will not suffice.
>
> Hmm.., So what's the problem with LOOKUP/FORGET in shared dir case?
>

I guess when we say "shared dir" case, we mean different things.
I was thinking about "shared between different client" and about the
fact the information about the server refcount of objects is obscured from
the client.

For example, LOOKUP_INC/DEC requests that report back the server's
shared object refcount could prevent a guest inode from being evicted
in case inode has a remote watch.

But you were mentioning "shared dir with guest and host", so LOOKUP
refcount is not relevant in that case.

> > I will not even try to solve the generic problem,
> > but will just
> > mentioned that SMB/NFS protocols use delegations/oplocks in the protocol
> > to advertise object usage by other clients to all clients.
>
> I think Miklos was looking into the idea of some sort of file leases
> on fuse for creating equivalent of delegations. Not sure if that work
> made any progress.
>
> >
> > I think this issue is far outside the scope of your project and you should
> > just leave the dropped events as a known limitation at this point.
>
> May be that's what we should do to begin with and just say these events
> can be lost or never arrive.
>
> > inotify has the event IN_IGNORED that application can use as a hint that some
> > events could have been dropped.
>
> That probably will require generating IN_IGNORED locally when local watch
> goes away (and not rely on remote IN_IGNORED), IIUC.
>

Right.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 1/7] FUSE: Add the fsnotify opcode and in/out structs to FUSE
  2021-10-28  4:13       ` Amir Goldstein
@ 2021-10-28 14:20         ` Vivek Goyal
  0 siblings, 0 replies; 47+ messages in thread
From: Vivek Goyal @ 2021-10-28 14:20 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Ioannis Angelakopoulos, linux-fsdevel, virtio-fs-list,
	linux-kernel, Jan Kara, Al Viro, Miklos Szeredi

On Thu, Oct 28, 2021 at 07:13:10AM +0300, Amir Goldstein wrote:
> > > you need to either include the generation in object identifier
> > > or much better use the object's nfs file handle, the same way
> > > that fanotify stores object identifiers.
> >
> > I think nfs file handle is much more complicated and its a separate
> > project altogether. I am assuming we are talking about persistent
> > nfs file handle as generated by host. I think biggest issue we faced
> > with that is that guest is untrusted and we don't want to resolve
> > file handle provided by guest on host otherwise guest can craft
> > file handles and possibly be able to open other files on same filesystem
> > outside shared dir.
> >
> 
> Right now, virtiofsd keeps all inodes and dentries of live client inodes
> pinned in cache on the server.
> If you switch to file handles design, virtiofsd only need to keep a map
> of all the file handles that server handed out to client to address
> this concern.

Keeping a map of all file handles server handed out to client, should work.
But that means these file handles are not persistent and can't be used after
a vritiofsd restart (virtiofsd will lose its map over restart). And that
means one can not pass these file handles to user space. And that rules out
giving file handle to user space as part of fanotify API, IIUC.

So in practice it is as good as (nodeid, generation) solution. And using
file handles means giving CAP_DAC_READ_SEARCH to virtiofsd daemon. Will
really like to run virtiofsd unrpviliged (in a user namespace) and be
able to support as many features as possible.

> 
> For directories, this practice is not even needed for security, because
> a decoded directory file handle can be verified to be within the shared dir.
> It is only needed to prevent DoS, because a crafted directory file handle
> (outside of shared dir) can be used to generate extra IO and thrash the
> inode/dentry cache on the server.

Good to know that for directories only DOS is a concern.

> 
> > >
> > > > +       uint64_t mask;
> > > > +       uint32_t namelen;
> > > > +       uint32_t cookie;
> > >
> > > I object to persisting with the two-events-joined-by-cookie design.
> > > Any new design should include a single event for rename
> > > with information about src and dst.
> > >
> > > I know this is inconvenient, but we are NOT going to create a "remote inotify"
> > > interface, we need to create a "remote fsnotify" interface and if server wants
> > > to use inotify, it will need to join the disjoined MOVE_FROM/TO event into
> > > a single "remote event", that FUSE will use to call fsnotify_move().
> >
> > man inotify says following.
> >
> > "       Matching up the IN_MOVED_FROM and IN_MOVED_TO event pair  generated  by
> >        rename(2)  is thus inherently racy.  (Don't forget that if an object is
> >        renamed outside of a monitored directory, there  may  not  even  be  an
> >        IN_MOVED_TO  event.)"
> >
> > So if guest is no monitoring target dir of renamed file, then we will not
> > even get IN_MOVED_TO. In that case we can't merge two events into one.
> >
> > And this sounds like inotify/fanotify interface needs to come up with
> > an merged event and in that case remote filesystem will simply propagate
> > that event. (Instead of coming up with a new event only for remote
> > filesystems. Sounds like this is not a problem limited to remote
> > filesystems only).
> >
> 
> I don't see it that way.
> I see the "internal protocol" for filesystems/vfs to report rename is
> fsnotify_move() which carries information about both src and target.
> I would like the remote protocol  to carry the same information.
> It is then up to the userspace API whether to report the rename
> as two events or a unified event.

Sure not a bad idea and allowing "cookie" does not stop remote filesystems
from reporting single rename event.

In this case virtiofs is just a passthrough filesystem. It simply reports
back what underlying filesystem is reporting. So if inotify interface
reports two events, it will simply send that. In fact old users will
expect that.

I understand that you don't like two separate events and hence wants
to kill cookie. But how will we fix that when underlying inotify API
reports two separate events. If we try to merge these there is no
guarantee that we can do that because we might get only one events
as we might not be watching either source/target directory. We place
watches only as specified by client. 

> 
> For example, debugfs, calls fsnotify_move() when an object is renamed
> from underneath the vfs. It does not call fsnotify() twice with a cookie,
> because we would not want to change local filesystem nor remote protocols
> when we want to add new userspace APIs for reporting fsevents.
> 
> That comes down to my feeling that this claims to be a proposal
> for "remote fsnotify", but it looks and sounds like a proposal for
> "remote inotify" and this is the core of my objection to passing
> the rename cookie in the protocol.

We are starting with remote inotify support and hoping it can be extended
to remote fanotify in limited feature form. So want to make modifications
in such a way so that one can easily extend it for fanotify as well.

Now question is, that should there be separate fuse commands for inotify
and fanotify request/events. Or we should try to merge these two and
design fuse_notify_fsnotify_out{} and fuse_notify_fsnotify_in{} in such
a way so that it could support both inotify/fanotify. I guess later will
make more sense. Just that it will be more complicated as well because
fanotify API can send much more information to user space. Like, file
handles, "fd", "pid" etc.

I think "pid" might not make much sense in the context of remote filesystem.
In case of virtiofsd, it will simply be pid of another virtiofsd instance
most likely which does not mean anything for guest process. In fact,
there is a chance that it could be same pid as of guest process.

> 
> Regarding the issue that the src/dst path may not be known
> to the server, as I wrote, it is fine if either the src/dst path information
> is omitted from an event, but the protocol should provide the
> placeholder to report them.

This kind of makes more sense. That protocol should have capability to report
a rename event with both src/dst path to future proof it. That way when
inotify/fanotify APIs start reporting a single rename event, we will be
able to simply send it back to client without any fuse protocol
modifications.

One risk of adding space for src/dst path info in fuse_notify_fsnotify_out{}
is that we don't know how this information will end up looking like when
inotify/fanotify actually start supporting single rename event. It is
possible that we add some fields now and final single event format looks
different and now we have unused fields.

> 
> After sufficient arguing, I might be convinced that the cookie may be included
> as an optional field in addition to the fields that I requested.

But we should allow cookie as well so that older inotify API continues
to be supported as well.
> 
> I understand why you write that this sounds like an fanotify interface
> that needs to be resolved and you are correct, but the reason that the
> fanotify interface issue was not yet resolved is that we are trying to not
> repeat the mistakes of the past and for that same reason, I am insisting
> on the protocol.

Right now we are writing protocol fields based on what inotify (and
possibly fanotify) are supporting. But if you want to extend it further
so that it can report something which you intend to introduce in future,
I think that's fine too.

So how will additional fields will look like to support this signle
rename event.

Vivek


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 3/7] FUSE,Inotify,Fsnotify,VFS: Add the fuse_fsnotify_update_mark inode operation
  2021-10-26 15:06   ` Amir Goldstein
@ 2021-11-01 17:49     ` Vivek Goyal
  2021-11-02  7:34       ` Amir Goldstein
  0 siblings, 1 reply; 47+ messages in thread
From: Vivek Goyal @ 2021-11-01 17:49 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Ioannis Angelakopoulos, linux-fsdevel, virtio-fs-list,
	linux-kernel, Jan Kara, Al Viro, Miklos Szeredi

On Tue, Oct 26, 2021 at 06:06:15PM +0300, Amir Goldstein wrote:
> On Mon, Oct 25, 2021 at 11:47 PM Ioannis Angelakopoulos
> <iangelak@redhat.com> wrote:
> >
> > Every time a local watch is placed/modified/removed on/from an inode the
> > same operation has to take place in the FUSE server.
> >
> > Thus add the inode operation "fuse_fsnotify_update_mark", which is
> > specific to FUSE inodes. This operation is called from the
> > "inotify_add_watch" system call in the inotify subsystem.
> >
> > Specifically, the operation is called when a process tries to add, modify
> > or remove a watch from a FUSE inode and the remote fsnotify support is
> > enabled both in the guest kernel and the FUSE server (virtiofsd).
> >
> > Essentially, when the kernel adds/modifies a watch locally, also send a
> > fsnotify request to the FUSE server to do the same. We keep the local watch
> > placement since it is essential for the functionality of the fsnotify
> > notification subsystem. However, the local events generated by the guest
> > kernel will be suppressed if they affect FUSE inodes and the remote
> > fsnotify support is enabled.
> >
> > Also modify the "fsnotify_detach_mark" function in fs/notify/mark.c to add
> > support for the remote deletion of watches for FUSE inodes. In contrast to
> > the add/modify operation we do not modify the inotify subsystem, but the
> > fsnotify subsystem. That is because there are two ways of deleting a watch
> > from an inode. The first is by manually calling the "inotify_rm_watch"
> > system call and the second is automatically by the kernel when the process
> > that has created an inotify instance exits. In that case the kernel is
> > responsible for deleting all the watches corresponding to said inotify
> > instance.
> >
> > Thus we send the fsnotify request for the deletion of the remote watch at
> > the lowest level within "fsnotify_detach_mark" to catch both watch removal
> > cases.
> >
> > The "fuse_fsnotify_update_mark" function in turn calls the
> > "fuse_fsnotify_send_request" function, to send an fsnotify request to the
> > FUSE server related to an inode watch.
> >
> >
> > Signed-off-by: Ioannis Angelakopoulos <iangelak@redhat.com>
> > ---
> >  fs/fuse/dir.c                    | 29 +++++++++++++++++++++++++++++
> >  fs/notify/inotify/inotify_user.c | 11 +++++++++++
> >  fs/notify/mark.c                 | 10 ++++++++++
> >  include/linux/fs.h               |  2 ++
> >  4 files changed, 52 insertions(+)
> >
> > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > index d9b977c0f38d..f666aafc8d3f 100644
> > --- a/fs/fuse/dir.c
> > +++ b/fs/fuse/dir.c
> > @@ -17,6 +17,8 @@
> >  #include <linux/xattr.h>
> >  #include <linux/iversion.h>
> >  #include <linux/posix_acl.h>
> > +#include <linux/fsnotify_backend.h>
> > +#include <linux/inotify.h>
> >
> >  static void fuse_advise_use_readdirplus(struct inode *dir)
> >  {
> > @@ -1805,6 +1807,30 @@ static int fuse_getattr(struct user_namespace *mnt_userns,
> >         return fuse_update_get_attr(inode, NULL, stat, request_mask, flags);
> >  }
> >
> > +static int fuse_fsnotify_update_mark(struct inode *inode, uint32_t action,
> > +                                    uint64_t group, uint32_t mask)
> > +{
> > +       /*
> > +        * We have to remove the bits added to the mask before being attached
> > +        * or detached to the inode, since these bits are going to be
> > +        * added by the "remote" host kernel. If these bits were still enabled
> > +        * in the mask that was sent to the "remote" kernel then the watch would
> > +        * be rejected as an unsupported value. These bits are added by the
> > +        * fsnotify subsystem thus we use the corresponding fsnotify bits here.
> > +        */
> > +       mask = mask & ~(FS_IN_IGNORED | FS_UNMOUNT | FS_IN_ONESHOT |
> > +                       FS_EXCL_UNLINK | FS_EVENT_ON_CHILD);
> > +
> > +       if (!(mask & IN_ALL_EVENTS))
> > +               return -EINVAL;
> > +
> > +       /*
> > +        * Action 0: Remove a watch
> > +        * Action 1: Add/Modify watch
> > +        */
> > +       return fuse_fsnotify_send_request(inode, mask, action, group);
> > +}
> > +
> >  static const struct inode_operations fuse_dir_inode_operations = {
> >         .lookup         = fuse_lookup,
> >         .mkdir          = fuse_mkdir,
> > @@ -1824,6 +1850,7 @@ static const struct inode_operations fuse_dir_inode_operations = {
> >         .set_acl        = fuse_set_acl,
> >         .fileattr_get   = fuse_fileattr_get,
> >         .fileattr_set   = fuse_fileattr_set,
> > +       .fsnotify_update = fuse_fsnotify_update_mark,
> >  };
> >
> >  static const struct file_operations fuse_dir_operations = {
> > @@ -1846,6 +1873,7 @@ static const struct inode_operations fuse_common_inode_operations = {
> >         .set_acl        = fuse_set_acl,
> >         .fileattr_get   = fuse_fileattr_get,
> >         .fileattr_set   = fuse_fileattr_set,
> > +       .fsnotify_update = fuse_fsnotify_update_mark,
> >  };
> >
> >  static const struct inode_operations fuse_symlink_inode_operations = {
> > @@ -1853,6 +1881,7 @@ static const struct inode_operations fuse_symlink_inode_operations = {
> >         .get_link       = fuse_get_link,
> >         .getattr        = fuse_getattr,
> >         .listxattr      = fuse_listxattr,
> > +       .fsnotify_update = fuse_fsnotify_update_mark,
> >  };
> >
> >  void fuse_init_common(struct inode *inode)
> > diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
> > index 62051247f6d2..3a0fee09a7c3 100644
> > --- a/fs/notify/inotify/inotify_user.c
> > +++ b/fs/notify/inotify/inotify_user.c
> > @@ -46,6 +46,8 @@
> >  #define INOTIFY_WATCH_COST     (sizeof(struct inotify_inode_mark) + \
> >                                  2 * sizeof(struct inode))
> >
> > +#define FSNOTIFY_ADD_MODIFY_MARK       1
> > +
> >  /* configurable via /proc/sys/fs/inotify/ */
> >  static int inotify_max_queued_events __read_mostly;
> >
> > @@ -764,6 +766,15 @@ SYSCALL_DEFINE3(inotify_add_watch, int, fd, const char __user *, pathname,
> >
> >         /* create/update an inode mark */
> >         ret = inotify_update_watch(group, inode, mask);
> > +       /*
> > +        * If the inode belongs to a remote filesystem/server that supports
> > +        * remote inotify events then send the mark to the remote server
> > +        */
> > +       if (ret >= 0 && inode->i_op->fsnotify_update) {
> > +               inode->i_op->fsnotify_update(inode,
> > +                                            FSNOTIFY_ADD_MODIFY_MARK,
> > +                                            (uint64_t)group, mask);
> > +       }
> >         path_put(&path);
> >  fput_and_out:
> >         fdput(f);
> > diff --git a/fs/notify/mark.c b/fs/notify/mark.c
> > index fa1d99101f89..f0d37276afcb 100644
> > --- a/fs/notify/mark.c
> > +++ b/fs/notify/mark.c
> > @@ -77,6 +77,7 @@
> >  #include "fsnotify.h"
> >
> >  #define FSNOTIFY_REAPER_DELAY  (1)     /* 1 jiffy */
> > +#define FSNOTIFY_DELETE_MARK 0   /* Delete a mark in remote fsnotify */
> 
> This define is part of the vfs API it should be in an include file along side
> FSNOTIFY_ADD_MODIFY_MARK (if we keep them in the API).
> 
> >
> >  struct srcu_struct fsnotify_mark_srcu;
> >  struct kmem_cache *fsnotify_mark_connector_cachep;
> > @@ -399,6 +400,7 @@ void fsnotify_finish_user_wait(struct fsnotify_iter_info *iter_info)
> >  void fsnotify_detach_mark(struct fsnotify_mark *mark)
> >  {
> >         struct fsnotify_group *group = mark->group;
> > +       struct inode *inode = NULL;
> >
> >         WARN_ON_ONCE(!mutex_is_locked(&group->mark_mutex));
> >         WARN_ON_ONCE(!srcu_read_lock_held(&fsnotify_mark_srcu) &&
> > @@ -411,6 +413,14 @@ void fsnotify_detach_mark(struct fsnotify_mark *mark)
> >                 spin_unlock(&mark->lock);
> >                 return;
> >         }
> > +
> > +       /* Only if the object is an inode send a request to FUSE server */
> > +       inode = fsnotify_conn_inode(mark->connector);
> > +       if (inode && inode->i_op->fsnotify_update) {
> > +               inode->i_op->fsnotify_update(inode, FSNOTIFY_DELETE_MARK,
> > +                                            (uint64_t)group, mark->mask);
> > +       }
> > +
> >         mark->flags &= ~FSNOTIFY_MARK_FLAG_ATTACHED;
> >         list_del_init(&mark->g_list);
> >         spin_unlock(&mark->lock);
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index e7a633353fd2..86bcc44e3ab8 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2149,6 +2149,8 @@ struct inode_operations {
> >         int (*fileattr_set)(struct user_namespace *mnt_userns,
> >                             struct dentry *dentry, struct fileattr *fa);
> >         int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
> > +       int (*fsnotify_update)(struct inode *inode, uint32_t action,
> > +                              uint64_t group, uint32_t mask);
> >  } ____cacheline_aligned;
> >
> 
> Please split the patch that introduces the API from the FUSE implementation.
> 
> Regarding the API, group does not belong in this interface.
> The inode object has an "aggregated mask" at i_fsnotify_mask
> indicating an interest for an event from any group.
> Remote servers should be notified when the aggregated mask changes.

Is aggregated mask updated when nobody is watching for that event. I
am not familiar with the code yet but Ioannis says that it does not
seem to happen. That means once an event is being watched, it will
continue to be generated on server and will continue to travel to
guest and ultimately guest will drop that event because no application
is watching.

If number of events generated are significant, it is a problem. Storing
and forwarding so many events can consume significant amount of memory
and cpu and also trigger dropping of some of the events.

Having said that, probably right fix is in fsnotify API so that
"aggregated mask" is updated  to remove events as well. And then
remote file systems can update server aggreated mask too.

> 
> Hence, Miklos has proposed a "remote fsnotify update" API which does
> not carry the mask nor the action, only the watched object:
> https://lore.kernel.org/linux-fsdevel/20190501205541.GC30899@veci.piliscsaba.redhat.com/
> 
> On that same thread, you will see that I also proposed the API to support
> full filesystem watch (by passing sb).

When you say API, you just mean signature of inode operation function. 
Say ->notify_update(). If we were to support "sb" later, updating 
inode operations should be easy. But I get the point that from VFS
API point of view, pass whole inode to filesystems.

Thanks
Vivek

> I am not requiring that you implement sb watch for FUSE/virtiofs, but the
> API should take this future extension into account.
> 
> Thanks,
> Amir.
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 3/7] FUSE,Inotify,Fsnotify,VFS: Add the fuse_fsnotify_update_mark inode operation
  2021-11-01 17:49     ` Vivek Goyal
@ 2021-11-02  7:34       ` Amir Goldstein
  0 siblings, 0 replies; 47+ messages in thread
From: Amir Goldstein @ 2021-11-02  7:34 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Ioannis Angelakopoulos, linux-fsdevel, virtio-fs-list,
	linux-kernel, Jan Kara, Al Viro, Miklos Szeredi

On Mon, Nov 1, 2021 at 7:49 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Tue, Oct 26, 2021 at 06:06:15PM +0300, Amir Goldstein wrote:
> > On Mon, Oct 25, 2021 at 11:47 PM Ioannis Angelakopoulos
> > <iangelak@redhat.com> wrote:
> > >
> > > Every time a local watch is placed/modified/removed on/from an inode the
> > > same operation has to take place in the FUSE server.
> > >
> > > Thus add the inode operation "fuse_fsnotify_update_mark", which is
> > > specific to FUSE inodes. This operation is called from the
> > > "inotify_add_watch" system call in the inotify subsystem.
> > >
> > > Specifically, the operation is called when a process tries to add, modify
> > > or remove a watch from a FUSE inode and the remote fsnotify support is
> > > enabled both in the guest kernel and the FUSE server (virtiofsd).
> > >
> > > Essentially, when the kernel adds/modifies a watch locally, also send a
> > > fsnotify request to the FUSE server to do the same. We keep the local watch
> > > placement since it is essential for the functionality of the fsnotify
> > > notification subsystem. However, the local events generated by the guest
> > > kernel will be suppressed if they affect FUSE inodes and the remote
> > > fsnotify support is enabled.
> > >
> > > Also modify the "fsnotify_detach_mark" function in fs/notify/mark.c to add
> > > support for the remote deletion of watches for FUSE inodes. In contrast to
> > > the add/modify operation we do not modify the inotify subsystem, but the
> > > fsnotify subsystem. That is because there are two ways of deleting a watch
> > > from an inode. The first is by manually calling the "inotify_rm_watch"
> > > system call and the second is automatically by the kernel when the process
> > > that has created an inotify instance exits. In that case the kernel is
> > > responsible for deleting all the watches corresponding to said inotify
> > > instance.
> > >
> > > Thus we send the fsnotify request for the deletion of the remote watch at
> > > the lowest level within "fsnotify_detach_mark" to catch both watch removal
> > > cases.
> > >
> > > The "fuse_fsnotify_update_mark" function in turn calls the
> > > "fuse_fsnotify_send_request" function, to send an fsnotify request to the
> > > FUSE server related to an inode watch.
> > >
> > >
> > > Signed-off-by: Ioannis Angelakopoulos <iangelak@redhat.com>
> > > ---
> > >  fs/fuse/dir.c                    | 29 +++++++++++++++++++++++++++++
> > >  fs/notify/inotify/inotify_user.c | 11 +++++++++++
> > >  fs/notify/mark.c                 | 10 ++++++++++
> > >  include/linux/fs.h               |  2 ++
> > >  4 files changed, 52 insertions(+)
> > >
> > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > > index d9b977c0f38d..f666aafc8d3f 100644
> > > --- a/fs/fuse/dir.c
> > > +++ b/fs/fuse/dir.c
> > > @@ -17,6 +17,8 @@
> > >  #include <linux/xattr.h>
> > >  #include <linux/iversion.h>
> > >  #include <linux/posix_acl.h>
> > > +#include <linux/fsnotify_backend.h>
> > > +#include <linux/inotify.h>
> > >
> > >  static void fuse_advise_use_readdirplus(struct inode *dir)
> > >  {
> > > @@ -1805,6 +1807,30 @@ static int fuse_getattr(struct user_namespace *mnt_userns,
> > >         return fuse_update_get_attr(inode, NULL, stat, request_mask, flags);
> > >  }
> > >
> > > +static int fuse_fsnotify_update_mark(struct inode *inode, uint32_t action,
> > > +                                    uint64_t group, uint32_t mask)
> > > +{
> > > +       /*
> > > +        * We have to remove the bits added to the mask before being attached
> > > +        * or detached to the inode, since these bits are going to be
> > > +        * added by the "remote" host kernel. If these bits were still enabled
> > > +        * in the mask that was sent to the "remote" kernel then the watch would
> > > +        * be rejected as an unsupported value. These bits are added by the
> > > +        * fsnotify subsystem thus we use the corresponding fsnotify bits here.
> > > +        */
> > > +       mask = mask & ~(FS_IN_IGNORED | FS_UNMOUNT | FS_IN_ONESHOT |
> > > +                       FS_EXCL_UNLINK | FS_EVENT_ON_CHILD);
> > > +
> > > +       if (!(mask & IN_ALL_EVENTS))
> > > +               return -EINVAL;
> > > +
> > > +       /*
> > > +        * Action 0: Remove a watch
> > > +        * Action 1: Add/Modify watch
> > > +        */
> > > +       return fuse_fsnotify_send_request(inode, mask, action, group);
> > > +}
> > > +
> > >  static const struct inode_operations fuse_dir_inode_operations = {
> > >         .lookup         = fuse_lookup,
> > >         .mkdir          = fuse_mkdir,
> > > @@ -1824,6 +1850,7 @@ static const struct inode_operations fuse_dir_inode_operations = {
> > >         .set_acl        = fuse_set_acl,
> > >         .fileattr_get   = fuse_fileattr_get,
> > >         .fileattr_set   = fuse_fileattr_set,
> > > +       .fsnotify_update = fuse_fsnotify_update_mark,
> > >  };
> > >
> > >  static const struct file_operations fuse_dir_operations = {
> > > @@ -1846,6 +1873,7 @@ static const struct inode_operations fuse_common_inode_operations = {
> > >         .set_acl        = fuse_set_acl,
> > >         .fileattr_get   = fuse_fileattr_get,
> > >         .fileattr_set   = fuse_fileattr_set,
> > > +       .fsnotify_update = fuse_fsnotify_update_mark,
> > >  };
> > >
> > >  static const struct inode_operations fuse_symlink_inode_operations = {
> > > @@ -1853,6 +1881,7 @@ static const struct inode_operations fuse_symlink_inode_operations = {
> > >         .get_link       = fuse_get_link,
> > >         .getattr        = fuse_getattr,
> > >         .listxattr      = fuse_listxattr,
> > > +       .fsnotify_update = fuse_fsnotify_update_mark,
> > >  };
> > >
> > >  void fuse_init_common(struct inode *inode)
> > > diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
> > > index 62051247f6d2..3a0fee09a7c3 100644
> > > --- a/fs/notify/inotify/inotify_user.c
> > > +++ b/fs/notify/inotify/inotify_user.c
> > > @@ -46,6 +46,8 @@
> > >  #define INOTIFY_WATCH_COST     (sizeof(struct inotify_inode_mark) + \
> > >                                  2 * sizeof(struct inode))
> > >
> > > +#define FSNOTIFY_ADD_MODIFY_MARK       1
> > > +
> > >  /* configurable via /proc/sys/fs/inotify/ */
> > >  static int inotify_max_queued_events __read_mostly;
> > >
> > > @@ -764,6 +766,15 @@ SYSCALL_DEFINE3(inotify_add_watch, int, fd, const char __user *, pathname,
> > >
> > >         /* create/update an inode mark */
> > >         ret = inotify_update_watch(group, inode, mask);
> > > +       /*
> > > +        * If the inode belongs to a remote filesystem/server that supports
> > > +        * remote inotify events then send the mark to the remote server
> > > +        */
> > > +       if (ret >= 0 && inode->i_op->fsnotify_update) {
> > > +               inode->i_op->fsnotify_update(inode,
> > > +                                            FSNOTIFY_ADD_MODIFY_MARK,
> > > +                                            (uint64_t)group, mask);
> > > +       }
> > >         path_put(&path);
> > >  fput_and_out:
> > >         fdput(f);
> > > diff --git a/fs/notify/mark.c b/fs/notify/mark.c
> > > index fa1d99101f89..f0d37276afcb 100644
> > > --- a/fs/notify/mark.c
> > > +++ b/fs/notify/mark.c
> > > @@ -77,6 +77,7 @@
> > >  #include "fsnotify.h"
> > >
> > >  #define FSNOTIFY_REAPER_DELAY  (1)     /* 1 jiffy */
> > > +#define FSNOTIFY_DELETE_MARK 0   /* Delete a mark in remote fsnotify */
> >
> > This define is part of the vfs API it should be in an include file along side
> > FSNOTIFY_ADD_MODIFY_MARK (if we keep them in the API).
> >
> > >
> > >  struct srcu_struct fsnotify_mark_srcu;
> > >  struct kmem_cache *fsnotify_mark_connector_cachep;
> > > @@ -399,6 +400,7 @@ void fsnotify_finish_user_wait(struct fsnotify_iter_info *iter_info)
> > >  void fsnotify_detach_mark(struct fsnotify_mark *mark)
> > >  {
> > >         struct fsnotify_group *group = mark->group;
> > > +       struct inode *inode = NULL;
> > >
> > >         WARN_ON_ONCE(!mutex_is_locked(&group->mark_mutex));
> > >         WARN_ON_ONCE(!srcu_read_lock_held(&fsnotify_mark_srcu) &&
> > > @@ -411,6 +413,14 @@ void fsnotify_detach_mark(struct fsnotify_mark *mark)
> > >                 spin_unlock(&mark->lock);
> > >                 return;
> > >         }
> > > +
> > > +       /* Only if the object is an inode send a request to FUSE server */
> > > +       inode = fsnotify_conn_inode(mark->connector);
> > > +       if (inode && inode->i_op->fsnotify_update) {
> > > +               inode->i_op->fsnotify_update(inode, FSNOTIFY_DELETE_MARK,
> > > +                                            (uint64_t)group, mark->mask);
> > > +       }
> > > +
> > >         mark->flags &= ~FSNOTIFY_MARK_FLAG_ATTACHED;
> > >         list_del_init(&mark->g_list);
> > >         spin_unlock(&mark->lock);
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index e7a633353fd2..86bcc44e3ab8 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -2149,6 +2149,8 @@ struct inode_operations {
> > >         int (*fileattr_set)(struct user_namespace *mnt_userns,
> > >                             struct dentry *dentry, struct fileattr *fa);
> > >         int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
> > > +       int (*fsnotify_update)(struct inode *inode, uint32_t action,
> > > +                              uint64_t group, uint32_t mask);
> > >  } ____cacheline_aligned;
> > >
> >
> > Please split the patch that introduces the API from the FUSE implementation.
> >
> > Regarding the API, group does not belong in this interface.
> > The inode object has an "aggregated mask" at i_fsnotify_mask
> > indicating an interest for an event from any group.
> > Remote servers should be notified when the aggregated mask changes.
>
> Is aggregated mask updated when nobody is watching for that event. I

It should, that's the whole point of the i_fsnotify_mask.
To optimize at the lowest level possible and do nothing if no subscriber is
interested in event X.

> am not familiar with the code yet but Ioannis says that it does not
> seem to happen.

Details please.
The object interest mask is updated in __fsnotify_recalc_mask()
and fsnotify_detach_connector_from_object().

For inotify those calls should be coming from:
inotify_rm_watch() => (final) fsnotify_put_mark()
Or from:
inotify_release() => fsnotify_destroy_group() =>
  fsnotify_clear_marks_by_group()

> That means once an event is being watched, it will
> continue to be generated on server and will continue to travel to
> guest and ultimately guest will drop that event because no application
> is watching.
>

There is alway going to be some lag between a process losing interest is
an event and the producer getting notified of the loss of interest, but the
API/protocol is supposed to publish the interest of subscribers to publishers.

> If number of events generated are significant, it is a problem. Storing
> and forwarding so many events can consume significant amount of memory
> and cpu and also trigger dropping of some of the events.
>
> Having said that, probably right fix is in fsnotify API so that
> "aggregated mask" is updated  to remove events as well. And then
> remote file systems can update server aggreated mask too.
>

Yes, if it is broken, it definitely needs to be fixed.
At this time, I am not sure if this is the case.

> >
> > Hence, Miklos has proposed a "remote fsnotify update" API which does
> > not carry the mask nor the action, only the watched object:
> > https://lore.kernel.org/linux-fsdevel/20190501205541.GC30899@veci.piliscsaba.redhat.com/
> >
> > On that same thread, you will see that I also proposed the API to support
> > full filesystem watch (by passing sb).
>
> When you say API, you just mean signature of inode operation function.
> Say ->notify_update(). If we were to support "sb" later, updating
> inode operations should be easy. But I get the point that from VFS
> API point of view, pass whole inode to filesystems.

Right.
This is internal fs/vfs API, but still it is a nuisance to create an operation
and add another one later or change the signature in all filesystems,
although in this case it's probably only fuse so less of a problem.
TBH, I think this API might be a better fit for super_operations and
I don't think that we need an inode_operation as well:

      int (*fsnotify_update)(struct super_block *sb, struct inode *inode,
                                        uint32_t action, uint32_t mask);

I only left 'action' as a hint, because maybe filesystem wants to
treat additions to mask differently than removals, but it is optional
or rather a generic UPDATE value could be used when the change
is the nature of the mask change unknown.

'sb' is mandatory because it will determine to which server the
mask update is intended.

'inode' is optional, where a NULL value could be interpreted as
"any inode" or "global filesystem watch mask".

In my view, the remote filesystem is the event generator 'inode'
and 'mask' are just two parameters of event filters.
The view that perceives an 'inode' as the event generator is
a view that matches an "inotify watch" as the source of events,
which is the core of my objection.

Keep in mind that virtiofsd (and any fuse filesystem) can implement
the remote fsnotify backend without any host OS support for fs notifications -
By publishing to interested guests the operations performed by other guests.
The server only needs the host OS support for publishing to guests changes
performed by the host.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-10-27 20:29               ` Vivek Goyal
  2021-10-27 20:37                 ` Vivek Goyal
@ 2021-11-02 11:09                 ` Jan Kara
  2021-11-02 12:54                   ` Amir Goldstein
  1 sibling, 1 reply; 47+ messages in thread
From: Jan Kara @ 2021-11-02 11:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Amir Goldstein, Ioannis Angelakopoulos, linux-fsdevel,
	virtio-fs-list, linux-kernel, Al Viro, Miklos Szeredi,
	Steve French

On Wed 27-10-21 16:29:40, Vivek Goyal wrote:
> On Wed, Oct 27, 2021 at 03:23:19PM +0200, Jan Kara wrote:
> > On Wed 27-10-21 08:59:15, Amir Goldstein wrote:
> > > On Tue, Oct 26, 2021 at 10:14 PM Ioannis Angelakopoulos
> > > <iangelak@redhat.com> wrote:
> > > > On Tue, Oct 26, 2021 at 2:27 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > The problem here is that the OPEN event might still be travelling towards the guest in the
> > > > virtqueues and arrives after the guest has already deleted its local inode.
> > > > While the remote event (OPEN) received by the guest is valid, its fsnotify
> > > > subsystem will drop it since the local inode is not there.
> > > >
> > > 
> > > I have a feeling that we are mixing issues related to shared server
> > > and remote fsnotify.
> > 
> > I don't think Ioannis was speaking about shared server case here. I think
> > he says that in a simple FUSE remote notification setup we can loose OPEN
> > events (or basically any other) if the inode for which the event happens
> > gets deleted sufficiently early after the event being generated. That seems
> > indeed somewhat unexpected and could be confusing if it happens e.g. for
> > some directory operations.
> 
> Hi Jan,
> 
> Agreed. That's what Ioannis is trying to say. That some of the remote events
> can be lost if fuse/guest local inode is unlinked. I think problem exists
> both for shared and non-shared directory case.
> 
> With local filesystems we have a control that we can first queue up
> the event in buffer before we remove local watches. With events travelling
> from a remote server, there is no such control/synchronization. It can
> very well happen that events got delayed in the communication path
> somewhere and local watches went away and now there is no way to 
> deliver those events to the application.

So after thinking for some time about this I have the following question
about the architecture of this solution: Why do you actually have local
fsnotify watches at all? They seem to cause quite some trouble... I mean
cannot we have fsnotify marks only on FUSE server and generate all events
there? When e.g. file is created from the client, client tells the server
about creation, the server performs the creation which generates the
fsnotify event, that is received by the server and forwared back to the
client which just queues it into notification group's queue for userspace
to read it.

Now with this architecture there's no problem with duplicate events for
local & server notification marks, similarly there's no problem with lost
events after inode deletion because events received by the client are
directly queued into notification queue without any checking whether inode
is still alive etc. Would this work or am I missing something?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-11-02 11:09                 ` Jan Kara
@ 2021-11-02 12:54                   ` Amir Goldstein
  2021-11-02 20:34                     ` Vivek Goyal
  2021-11-03 10:09                     ` Jan Kara
  0 siblings, 2 replies; 47+ messages in thread
From: Amir Goldstein @ 2021-11-02 12:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, Ioannis Angelakopoulos, linux-fsdevel,
	virtio-fs-list, linux-kernel, Al Viro, Miklos Szeredi,
	Steve French

On Tue, Nov 2, 2021 at 1:09 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 27-10-21 16:29:40, Vivek Goyal wrote:
> > On Wed, Oct 27, 2021 at 03:23:19PM +0200, Jan Kara wrote:
> > > On Wed 27-10-21 08:59:15, Amir Goldstein wrote:
> > > > On Tue, Oct 26, 2021 at 10:14 PM Ioannis Angelakopoulos
> > > > <iangelak@redhat.com> wrote:
> > > > > On Tue, Oct 26, 2021 at 2:27 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > > The problem here is that the OPEN event might still be travelling towards the guest in the
> > > > > virtqueues and arrives after the guest has already deleted its local inode.
> > > > > While the remote event (OPEN) received by the guest is valid, its fsnotify
> > > > > subsystem will drop it since the local inode is not there.
> > > > >
> > > >
> > > > I have a feeling that we are mixing issues related to shared server
> > > > and remote fsnotify.
> > >
> > > I don't think Ioannis was speaking about shared server case here. I think
> > > he says that in a simple FUSE remote notification setup we can loose OPEN
> > > events (or basically any other) if the inode for which the event happens
> > > gets deleted sufficiently early after the event being generated. That seems
> > > indeed somewhat unexpected and could be confusing if it happens e.g. for
> > > some directory operations.
> >
> > Hi Jan,
> >
> > Agreed. That's what Ioannis is trying to say. That some of the remote events
> > can be lost if fuse/guest local inode is unlinked. I think problem exists
> > both for shared and non-shared directory case.
> >
> > With local filesystems we have a control that we can first queue up
> > the event in buffer before we remove local watches. With events travelling
> > from a remote server, there is no such control/synchronization. It can
> > very well happen that events got delayed in the communication path
> > somewhere and local watches went away and now there is no way to
> > deliver those events to the application.
>
> So after thinking for some time about this I have the following question
> about the architecture of this solution: Why do you actually have local
> fsnotify watches at all? They seem to cause quite some trouble... I mean
> cannot we have fsnotify marks only on FUSE server and generate all events
> there? When e.g. file is created from the client, client tells the server
> about creation, the server performs the creation which generates the
> fsnotify event, that is received by the server and forwared back to the
> client which just queues it into notification group's queue for userspace
> to read it.
>
> Now with this architecture there's no problem with duplicate events for
> local & server notification marks, similarly there's no problem with lost
> events after inode deletion because events received by the client are
> directly queued into notification queue without any checking whether inode
> is still alive etc. Would this work or am I missing something?
>

What about group #1 that wants mask A and group #2 that wants mask B
events?

Do you propose to maintain separate event queues over the protocol?
Attach a "recipient list" to each event?

I just don't see how this can scale other than:
- Local marks and connectors manage the subscriptions on local machine
- Protocol updates the server with the combined masks for watched objects

I think that the "post-mortem events" issue could be solved by keeping an
S_DEAD fuse inode object in limbo just for the mark.
When a remote server sends FS_IN_IGNORED or FS_DELETE_SELF for
an inode, the fuse client inode can be finally evicted.
I haven't tried to see how hard that would be to implement.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-11-02 12:54                   ` Amir Goldstein
@ 2021-11-02 20:34                     ` Vivek Goyal
  2021-11-03  7:31                       ` Amir Goldstein
  2021-11-03 10:09                     ` Jan Kara
  1 sibling, 1 reply; 47+ messages in thread
From: Vivek Goyal @ 2021-11-02 20:34 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Ioannis Angelakopoulos, linux-fsdevel, virtio-fs-list,
	linux-kernel, Al Viro, Miklos Szeredi, Steve French

On Tue, Nov 02, 2021 at 02:54:01PM +0200, Amir Goldstein wrote:
> On Tue, Nov 2, 2021 at 1:09 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Wed 27-10-21 16:29:40, Vivek Goyal wrote:
> > > On Wed, Oct 27, 2021 at 03:23:19PM +0200, Jan Kara wrote:
> > > > On Wed 27-10-21 08:59:15, Amir Goldstein wrote:
> > > > > On Tue, Oct 26, 2021 at 10:14 PM Ioannis Angelakopoulos
> > > > > <iangelak@redhat.com> wrote:
> > > > > > On Tue, Oct 26, 2021 at 2:27 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > > > The problem here is that the OPEN event might still be travelling towards the guest in the
> > > > > > virtqueues and arrives after the guest has already deleted its local inode.
> > > > > > While the remote event (OPEN) received by the guest is valid, its fsnotify
> > > > > > subsystem will drop it since the local inode is not there.
> > > > > >
> > > > >
> > > > > I have a feeling that we are mixing issues related to shared server
> > > > > and remote fsnotify.
> > > >
> > > > I don't think Ioannis was speaking about shared server case here. I think
> > > > he says that in a simple FUSE remote notification setup we can loose OPEN
> > > > events (or basically any other) if the inode for which the event happens
> > > > gets deleted sufficiently early after the event being generated. That seems
> > > > indeed somewhat unexpected and could be confusing if it happens e.g. for
> > > > some directory operations.
> > >
> > > Hi Jan,
> > >
> > > Agreed. That's what Ioannis is trying to say. That some of the remote events
> > > can be lost if fuse/guest local inode is unlinked. I think problem exists
> > > both for shared and non-shared directory case.
> > >
> > > With local filesystems we have a control that we can first queue up
> > > the event in buffer before we remove local watches. With events travelling
> > > from a remote server, there is no such control/synchronization. It can
> > > very well happen that events got delayed in the communication path
> > > somewhere and local watches went away and now there is no way to
> > > deliver those events to the application.
> >
> > So after thinking for some time about this I have the following question
> > about the architecture of this solution: Why do you actually have local
> > fsnotify watches at all? They seem to cause quite some trouble... I mean
> > cannot we have fsnotify marks only on FUSE server and generate all events
> > there?

I think currently we are already implementing this part of the proposal.
We are sending "group" pointer to server while updating a watch. And server
is managing watches per inode per group. IOW, if client has group1 wanting
mask A and group2 wanting mask B, then server is going to add two watches
with two masks on same inotify fd instance.

Admittedly we did this because we did not know that an aggregate mask
exists. And using an aggregate mask in guest kernel and then server
putting a single watch for that inode based on aggregate mask simplifies
server implementation.

One downside of this approach is more complexity on server. Also more
number of events will be travelling from host to guest. So if two groups
are watching same events on same inode, then I think two copies of
events will travel to guest. One for the group1 and one for group2 (as
we are having separate watches on host). If we use aggregate watch on
host, then only one event can travel between host and guest and I am
assuming same event can be replicated among multiple groups, depending
on their interest.

> When e.g. file is created from the client, client tells the server
> > about creation, the server performs the creation which generates the
> > fsnotify event, that is received by the server and forwared back to the
> > client which just queues it into notification group's queue for userspace
> > to read it.

This is the part we have not implemented. When we generate the event,
we just generate the event for the inode. There is no notion
of that this event has been generated for a specific group with-in
this inode. As of now that's left to the local fsnotify code to manage
and figure out.

So the idea is, that when event arrives from remote, queue it up directly
into the group (without having to worry about inode). Hmm.., how do we do
that. That means we need to return that group identifier in notification
event atleast so that client can find out the group (without having to
worry about inode?).

So group will basically become part of the remote protocol if we were
to go this way. And implementation becomes more complicated.

> >
> > Now with this architecture there's no problem with duplicate events for
> > local & server notification marks,

I guess supressing local events is really trivial. Either we have that
inode flag Amir suggested or have an inode operation to let file system
decide.

> similarly there's no problem with lost
> > events after inode deletion because events received by the client are
> > directly queued into notification queue without any checking whether inode
> > is still alive etc. Would this work or am I missing something?


So when will the watches on remote go away. When a file is unlinked and
inode is going away we call fsnotify_inoderemove(). This generates
FS_DELETE_SELF and then seems to remove all local marks on the inode.

Now if we don't have local marks and guest inode is going away, and client
sends FUSE_FORGET message, I am assuming that will be the time to cleanup
all the remote watches and groups etc. And if file is open by some other
guest, then DELETE_SELF event will not have been generated by then and
we will clean remote watches.

Even if other guest did not have file open, cleanup of remote watches
and DELETE_SELF will be parallel operation and can be racy. So if
thread reading inotify descriptor gets little late in reading DELETE_SELF,
it is possible another thread in virtiofsd cleaned up all remote
watches and associated groups. And now there is no way to send event
back to guest and we lost event?

My understanding of this notification magic is very primitive. So it
is very much possible I am misunderstanding how remote watches and
groups will be managed and reported back. But my current assumption
is that their life time will have to be tied to remote inode we
are managing. Otherwise when will remote server clean its own
internal state (watch descriptors), when inode goes away. 

> >
> 
> What about group #1 that wants mask A and group #2 that wants mask B
> events?
> 
> Do you propose to maintain separate event queues over the protocol?
> Attach a "recipient list" to each event?
> 
> I just don't see how this can scale other than:
> - Local marks and connectors manage the subscriptions on local machine
> - Protocol updates the server with the combined masks for watched objects
> 
> I think that the "post-mortem events" issue could be solved by keeping an
> S_DEAD fuse inode object in limbo just for the mark.
> When a remote server sends FS_IN_IGNORED or FS_DELETE_SELF for
> an inode, the fuse client inode can be finally evicted.

There is no guarantee that FS_IN_IGNORED or FS_DELETE_SELF will come
or when will it come. If another guest has reference on inode it might
not come for a long time. And this will kind of become a mechanism
for one guest to keep other's inode cache full of such objects.

If event queue becomes too full, we might drop these events. But I guess
in that case we will have to generate IN_Q_OVERFLOW and that can somehow
be used to cleanup such S_DEAD inodes?

nodeid is managed by server. So I am assuming that FORGET messages will
not be sent to server for this inode till we have seen FS_IN_IGNORED
and FS_DELETE_SELF events?

Thanks
Vivek

> I haven't tried to see how hard that would be to implement.
> 
> Thanks,
> Amir.
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-11-02 20:34                     ` Vivek Goyal
@ 2021-11-03  7:31                       ` Amir Goldstein
  2021-11-03 22:29                         ` Vivek Goyal
  0 siblings, 1 reply; 47+ messages in thread
From: Amir Goldstein @ 2021-11-03  7:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Ioannis Angelakopoulos, linux-fsdevel, virtio-fs-list,
	linux-kernel, Al Viro, Miklos Szeredi, Steve French

> > >
> >
> > What about group #1 that wants mask A and group #2 that wants mask B
> > events?
> >
> > Do you propose to maintain separate event queues over the protocol?
> > Attach a "recipient list" to each event?
> >
> > I just don't see how this can scale other than:
> > - Local marks and connectors manage the subscriptions on local machine
> > - Protocol updates the server with the combined masks for watched objects
> >
> > I think that the "post-mortem events" issue could be solved by keeping an
> > S_DEAD fuse inode object in limbo just for the mark.
> > When a remote server sends FS_IN_IGNORED or FS_DELETE_SELF for
> > an inode, the fuse client inode can be finally evicted.
>
> There is no guarantee that FS_IN_IGNORED or FS_DELETE_SELF will come
> or when will it come. If another guest has reference on inode it might
> not come for a long time. And this will kind of become a mechanism
> for one guest to keep other's inode cache full of such objects.
>
> If event queue becomes too full, we might drop these events. But I guess
> in that case we will have to generate IN_Q_OVERFLOW and that can somehow
> be used to cleanup such S_DEAD inodes?

That depends on the server implementation.
If the server is watching host fs using fanotify filesystem mark, then
an overflow
event does NOT mean that other new events on inode may be missed only
that old events could have been missed.
Server should know about all the watched inodes, so it can check on overflow
if any of the watched inodes were deleted and notify the client using a reliable
channel.

Given the current server implementation with inotify, IN_Q_OVERFLOW
means server may have lost an IN_IGNORED event and may not get any
more events on inode, so server should check all the watched inodes after
overflow, notify the client of all deleted inodes and try to re-create
the watches
for all inodes with known path or use magic /prod/pid/fd path if that
works (??).

>
> nodeid is managed by server. So I am assuming that FORGET messages will
> not be sent to server for this inode till we have seen FS_IN_IGNORED
> and FS_DELETE_SELF events?
>

Or until the application that requested the watch calls
inotify_rm_watch() or closes
the inotify fd.

IOW, when fs implements remote fsnotify, the local watch keeps the local deleted
inode object in limbo until the local watch is removed.
When the remote fsnotify server informs that the remote watch (or remote inode)
is gone, the local watch is removed as well and then the inotify
application also gets
an FS_IN_IGNORED event.

Lifetime of local inode is complicated and lifetime of this "shared inode"
is much more complicated, so I am not pretending to claim that I have this all
figured out or that it could be reliably done at all.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-11-02 12:54                   ` Amir Goldstein
  2021-11-02 20:34                     ` Vivek Goyal
@ 2021-11-03 10:09                     ` Jan Kara
  2021-11-03 11:17                       ` Amir Goldstein
  1 sibling, 1 reply; 47+ messages in thread
From: Jan Kara @ 2021-11-03 10:09 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Vivek Goyal, Ioannis Angelakopoulos, linux-fsdevel,
	virtio-fs-list, linux-kernel, Al Viro, Miklos Szeredi,
	Steve French

On Tue 02-11-21 14:54:01, Amir Goldstein wrote:
> On Tue, Nov 2, 2021 at 1:09 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Wed 27-10-21 16:29:40, Vivek Goyal wrote:
> > > On Wed, Oct 27, 2021 at 03:23:19PM +0200, Jan Kara wrote:
> > > > On Wed 27-10-21 08:59:15, Amir Goldstein wrote:
> > > > > On Tue, Oct 26, 2021 at 10:14 PM Ioannis Angelakopoulos
> > > > > <iangelak@redhat.com> wrote:
> > > > > > On Tue, Oct 26, 2021 at 2:27 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > > > The problem here is that the OPEN event might still be travelling towards the guest in the
> > > > > > virtqueues and arrives after the guest has already deleted its local inode.
> > > > > > While the remote event (OPEN) received by the guest is valid, its fsnotify
> > > > > > subsystem will drop it since the local inode is not there.
> > > > > >
> > > > >
> > > > > I have a feeling that we are mixing issues related to shared server
> > > > > and remote fsnotify.
> > > >
> > > > I don't think Ioannis was speaking about shared server case here. I think
> > > > he says that in a simple FUSE remote notification setup we can loose OPEN
> > > > events (or basically any other) if the inode for which the event happens
> > > > gets deleted sufficiently early after the event being generated. That seems
> > > > indeed somewhat unexpected and could be confusing if it happens e.g. for
> > > > some directory operations.
> > >
> > > Hi Jan,
> > >
> > > Agreed. That's what Ioannis is trying to say. That some of the remote events
> > > can be lost if fuse/guest local inode is unlinked. I think problem exists
> > > both for shared and non-shared directory case.
> > >
> > > With local filesystems we have a control that we can first queue up
> > > the event in buffer before we remove local watches. With events travelling
> > > from a remote server, there is no such control/synchronization. It can
> > > very well happen that events got delayed in the communication path
> > > somewhere and local watches went away and now there is no way to
> > > deliver those events to the application.
> >
> > So after thinking for some time about this I have the following question
> > about the architecture of this solution: Why do you actually have local
> > fsnotify watches at all? They seem to cause quite some trouble... I mean
> > cannot we have fsnotify marks only on FUSE server and generate all events
> > there? When e.g. file is created from the client, client tells the server
> > about creation, the server performs the creation which generates the
> > fsnotify event, that is received by the server and forwared back to the
> > client which just queues it into notification group's queue for userspace
> > to read it.
> >
> > Now with this architecture there's no problem with duplicate events for
> > local & server notification marks, similarly there's no problem with lost
> > events after inode deletion because events received by the client are
> > directly queued into notification queue without any checking whether inode
> > is still alive etc. Would this work or am I missing something?
> >
> 
> What about group #1 that wants mask A and group #2 that wants mask B
> events?
> 
> Do you propose to maintain separate event queues over the protocol?
> Attach a "recipient list" to each event?

Yes, that was my idea. Essentially when we see group A creates mark on FUSE
for path P, we notify server, it will create notification group A on the
server (if not already existing - there we need some notification group
identifier unique among all clients), and place mark for it on path P. Then
the full stream of notification events generated for group A on the server
will just be forwarded to the client and inserted into the A's notification
queue. IMO this is very simple solution to implement - you just need to
forward mark addition / removal events from the client to the server and you
forward event stream from the server to the client. Everything else is
handled by the fsnotify infrastructure on the server.

> I just don't see how this can scale other than:
> - Local marks and connectors manage the subscriptions on local machine
> - Protocol updates the server with the combined masks for watched objects

I agree that depending on the usecase and particular FUSE filesystem
performance of this solution may be a concern. OTOH the only additional
cost of this solution I can see (compared to all those processes just
watching files locally) is the passing of the events from the server to the
client. For local FUSE filesystems such as virtiofs this should be rather
cheap since you have to do very little processing for each generated event.
For filesystems such as sshfs, I can imagine this would be a bigger deal.

Also one problem I can see with my proposal is that it will have problems
with stuff such as leases - i.e., if the client does not notify the server
of the changes quickly but rather batches local operations and tells the
server about them only on special occasions. I don't know enough about FUSE
filesystems to tell whether this is a frequent problem or not.

> I think that the "post-mortem events" issue could be solved by keeping an
> S_DEAD fuse inode object in limbo just for the mark.
> When a remote server sends FS_IN_IGNORED or FS_DELETE_SELF for
> an inode, the fuse client inode can be finally evicted.
> I haven't tried to see how hard that would be to implement.

Sure, there can be other solutions to this particular problem. I just
want to discuss the other architecture to see why we cannot to it in a
simple way :).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-11-03 10:09                     ` Jan Kara
@ 2021-11-03 11:17                       ` Amir Goldstein
  2021-11-03 22:36                         ` Vivek Goyal
  0 siblings, 1 reply; 47+ messages in thread
From: Amir Goldstein @ 2021-11-03 11:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, Ioannis Angelakopoulos, linux-fsdevel,
	virtio-fs-list, linux-kernel, Al Viro, Miklos Szeredi,
	Steve French

> > > > Hi Jan,
> > > >
> > > > Agreed. That's what Ioannis is trying to say. That some of the remote events
> > > > can be lost if fuse/guest local inode is unlinked. I think problem exists
> > > > both for shared and non-shared directory case.
> > > >
> > > > With local filesystems we have a control that we can first queue up
> > > > the event in buffer before we remove local watches. With events travelling
> > > > from a remote server, there is no such control/synchronization. It can
> > > > very well happen that events got delayed in the communication path
> > > > somewhere and local watches went away and now there is no way to
> > > > deliver those events to the application.
> > >
> > > So after thinking for some time about this I have the following question
> > > about the architecture of this solution: Why do you actually have local
> > > fsnotify watches at all? They seem to cause quite some trouble... I mean
> > > cannot we have fsnotify marks only on FUSE server and generate all events
> > > there? When e.g. file is created from the client, client tells the server
> > > about creation, the server performs the creation which generates the
> > > fsnotify event, that is received by the server and forwared back to the
> > > client which just queues it into notification group's queue for userspace
> > > to read it.
> > >
> > > Now with this architecture there's no problem with duplicate events for
> > > local & server notification marks, similarly there's no problem with lost
> > > events after inode deletion because events received by the client are
> > > directly queued into notification queue without any checking whether inode
> > > is still alive etc. Would this work or am I missing something?
> > >
> >
> > What about group #1 that wants mask A and group #2 that wants mask B
> > events?
> >
> > Do you propose to maintain separate event queues over the protocol?
> > Attach a "recipient list" to each event?
>
> Yes, that was my idea. Essentially when we see group A creates mark on FUSE
> for path P, we notify server, it will create notification group A on the
> server (if not already existing - there we need some notification group
> identifier unique among all clients), and place mark for it on path P. Then
> the full stream of notification events generated for group A on the server
> will just be forwarded to the client and inserted into the A's notification
> queue. IMO this is very simple solution to implement - you just need to
> forward mark addition / removal events from the client to the server and you
> forward event stream from the server to the client. Everything else is
> handled by the fsnotify infrastructure on the server.
>
> > I just don't see how this can scale other than:
> > - Local marks and connectors manage the subscriptions on local machine
> > - Protocol updates the server with the combined masks for watched objects
>
> I agree that depending on the usecase and particular FUSE filesystem
> performance of this solution may be a concern. OTOH the only additional
> cost of this solution I can see (compared to all those processes just
> watching files locally) is the passing of the events from the server to the
> client. For local FUSE filesystems such as virtiofs this should be rather
> cheap since you have to do very little processing for each generated event.
> For filesystems such as sshfs, I can imagine this would be a bigger deal.
>
> Also one problem I can see with my proposal is that it will have problems
> with stuff such as leases - i.e., if the client does not notify the server
> of the changes quickly but rather batches local operations and tells the
> server about them only on special occasions. I don't know enough about FUSE
> filesystems to tell whether this is a frequent problem or not.
>
> > I think that the "post-mortem events" issue could be solved by keeping an
> > S_DEAD fuse inode object in limbo just for the mark.
> > When a remote server sends FS_IN_IGNORED or FS_DELETE_SELF for
> > an inode, the fuse client inode can be finally evicted.
> > I haven't tried to see how hard that would be to implement.
>
> Sure, there can be other solutions to this particular problem. I just
> want to discuss the other architecture to see why we cannot to it in a
> simple way :).
>

Fair enough.

Beyond the scalability aspects, I think that a design that exposes the group
to the remote server and allows to "inject" events to the group queue
will prevent
users from useful features going forward.

For example, fanotify ignored_mask could be added to a group, even on
a mount mark, even if the remote server only supports inode marks and it
would just work.

Another point of view for the post-mortem events:
As Miklos once noted and as you wrote above, for cache coherency and leases,
an async notification queue is not adequate and synchronous notifications are
too costly, so there needs to be some shared memory solution involving guest
cache invalidation by host.

Suppose said cache invalidation solution would be able to set a variety of
"dirty" flags, not just one type of dirty or to call in another way -
an "event mask".
If that is available, then when a fuse inode gets evicted, the events from the
"event mask" can be queued before destroying the inode and mark -
post mortem event issue averted...

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-11-03  7:31                       ` Amir Goldstein
@ 2021-11-03 22:29                         ` Vivek Goyal
  2021-11-04  5:19                           ` Amir Goldstein
  0 siblings, 1 reply; 47+ messages in thread
From: Vivek Goyal @ 2021-11-03 22:29 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Ioannis Angelakopoulos, linux-fsdevel, virtio-fs-list,
	linux-kernel, Al Viro, Miklos Szeredi, Steve French

On Wed, Nov 03, 2021 at 09:31:02AM +0200, Amir Goldstein wrote:
> > > >
> > >
> > > What about group #1 that wants mask A and group #2 that wants mask B
> > > events?
> > >
> > > Do you propose to maintain separate event queues over the protocol?
> > > Attach a "recipient list" to each event?
> > >
> > > I just don't see how this can scale other than:
> > > - Local marks and connectors manage the subscriptions on local machine
> > > - Protocol updates the server with the combined masks for watched objects
> > >
> > > I think that the "post-mortem events" issue could be solved by keeping an
> > > S_DEAD fuse inode object in limbo just for the mark.
> > > When a remote server sends FS_IN_IGNORED or FS_DELETE_SELF for
> > > an inode, the fuse client inode can be finally evicted.
> >
> > There is no guarantee that FS_IN_IGNORED or FS_DELETE_SELF will come
> > or when will it come. If another guest has reference on inode it might
> > not come for a long time. And this will kind of become a mechanism
> > for one guest to keep other's inode cache full of such objects.
> >
> > If event queue becomes too full, we might drop these events. But I guess
> > in that case we will have to generate IN_Q_OVERFLOW and that can somehow
> > be used to cleanup such S_DEAD inodes?
> 
> That depends on the server implementation.
> If the server is watching host fs using fanotify filesystem mark, then
> an overflow
> event does NOT mean that other new events on inode may be missed only
> that old events could have been missed.
> Server should know about all the watched inodes, so it can check on overflow
> if any of the watched inodes were deleted and notify the client using a reliable
> channel.

Ok. We have only one channel for notifications. I guess we can program
the channel in such a way so that it does not drop overflow events but
can drop other kind of events if things get crazy. If too many overflow
events and we allocate too much of memory, I guess at some point of
time, oom killer will kick in a kill server.

> 
> Given the current server implementation with inotify, IN_Q_OVERFLOW
> means server may have lost an IN_IGNORED event and may not get any
> more events on inode, so server should check all the watched inodes after
> overflow, notify the client of all deleted inodes and try to re-create
> the watches
> for all inodes with known path or use magic /prod/pid/fd path if that
> works (??).

Re-doing the watches sounds very painful. That means we will need to
keep track of aggregated mask in server side inode as well. As of
now we just pass mask to kernel using inotify_add_watch() and forget
about it.

/proc/pid/fd should work because I think that's how ioannis is putting
current watches on inodes. We don't send path info to server.

> 
> >
> > nodeid is managed by server. So I am assuming that FORGET messages will
> > not be sent to server for this inode till we have seen FS_IN_IGNORED
> > and FS_DELETE_SELF events?
> >
> 
> Or until the application that requested the watch calls
> inotify_rm_watch() or closes
> the inotify fd.
> 
> IOW, when fs implements remote fsnotify, the local watch keeps the local deleted
> inode object in limbo until the local watch is removed.
> When the remote fsnotify server informs that the remote watch (or remote inode)
> is gone, the local watch is removed as well and then the inotify
> application also gets
> an FS_IN_IGNORED event.

Hmm.., I guess remote server will simply send IN_DELETE event when it
gets it and forward to client. And client will have to then cleanup
this S_DEAD inode which is in limbo waiting for IN_DELETE_SELF event.
And that should trigger cleanup of marks/local-watches on the inode, IIUC.

> 
> Lifetime of local inode is complicated and lifetime of this "shared inode"
> is much more complicated, so I am not pretending to claim that I have this all
> figured out or that it could be reliably done at all.

Yes this handling of IN_DELETE_SELF is turning out to be the most
complicated piece of this proposal. I wish initial implementation
could just be designed that it does not send IN_DELETE_SELF and
IN_INGORED is generated locally. And later enhance it to support
reliable delivery of IN_DELETE_SELF.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-11-03 11:17                       ` Amir Goldstein
@ 2021-11-03 22:36                         ` Vivek Goyal
  2021-11-04  5:29                           ` Amir Goldstein
  2021-11-04 10:03                           ` Jan Kara
  0 siblings, 2 replies; 47+ messages in thread
From: Vivek Goyal @ 2021-11-03 22:36 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Ioannis Angelakopoulos, linux-fsdevel, virtio-fs-list,
	linux-kernel, Al Viro, Miklos Szeredi, Steve French

On Wed, Nov 03, 2021 at 01:17:36PM +0200, Amir Goldstein wrote:
> > > > > Hi Jan,
> > > > >
> > > > > Agreed. That's what Ioannis is trying to say. That some of the remote events
> > > > > can be lost if fuse/guest local inode is unlinked. I think problem exists
> > > > > both for shared and non-shared directory case.
> > > > >
> > > > > With local filesystems we have a control that we can first queue up
> > > > > the event in buffer before we remove local watches. With events travelling
> > > > > from a remote server, there is no such control/synchronization. It can
> > > > > very well happen that events got delayed in the communication path
> > > > > somewhere and local watches went away and now there is no way to
> > > > > deliver those events to the application.
> > > >
> > > > So after thinking for some time about this I have the following question
> > > > about the architecture of this solution: Why do you actually have local
> > > > fsnotify watches at all? They seem to cause quite some trouble... I mean
> > > > cannot we have fsnotify marks only on FUSE server and generate all events
> > > > there? When e.g. file is created from the client, client tells the server
> > > > about creation, the server performs the creation which generates the
> > > > fsnotify event, that is received by the server and forwared back to the
> > > > client which just queues it into notification group's queue for userspace
> > > > to read it.
> > > >
> > > > Now with this architecture there's no problem with duplicate events for
> > > > local & server notification marks, similarly there's no problem with lost
> > > > events after inode deletion because events received by the client are
> > > > directly queued into notification queue without any checking whether inode
> > > > is still alive etc. Would this work or am I missing something?
> > > >
> > >
> > > What about group #1 that wants mask A and group #2 that wants mask B
> > > events?
> > >
> > > Do you propose to maintain separate event queues over the protocol?
> > > Attach a "recipient list" to each event?
> >
> > Yes, that was my idea. Essentially when we see group A creates mark on FUSE
> > for path P, we notify server, it will create notification group A on the
> > server (if not already existing - there we need some notification group
> > identifier unique among all clients), and place mark for it on path P. Then
> > the full stream of notification events generated for group A on the server
> > will just be forwarded to the client and inserted into the A's notification
> > queue. IMO this is very simple solution to implement - you just need to
> > forward mark addition / removal events from the client to the server and you
> > forward event stream from the server to the client. Everything else is
> > handled by the fsnotify infrastructure on the server.
> >
> > > I just don't see how this can scale other than:
> > > - Local marks and connectors manage the subscriptions on local machine
> > > - Protocol updates the server with the combined masks for watched objects
> >
> > I agree that depending on the usecase and particular FUSE filesystem
> > performance of this solution may be a concern. OTOH the only additional
> > cost of this solution I can see (compared to all those processes just
> > watching files locally) is the passing of the events from the server to the
> > client. For local FUSE filesystems such as virtiofs this should be rather
> > cheap since you have to do very little processing for each generated event.
> > For filesystems such as sshfs, I can imagine this would be a bigger deal.
> >
> > Also one problem I can see with my proposal is that it will have problems
> > with stuff such as leases - i.e., if the client does not notify the server
> > of the changes quickly but rather batches local operations and tells the
> > server about them only on special occasions. I don't know enough about FUSE
> > filesystems to tell whether this is a frequent problem or not.
> >
> > > I think that the "post-mortem events" issue could be solved by keeping an
> > > S_DEAD fuse inode object in limbo just for the mark.
> > > When a remote server sends FS_IN_IGNORED or FS_DELETE_SELF for
> > > an inode, the fuse client inode can be finally evicted.
> > > I haven't tried to see how hard that would be to implement.
> >
> > Sure, there can be other solutions to this particular problem. I just
> > want to discuss the other architecture to see why we cannot to it in a
> > simple way :).
> >
> 
> Fair enough.
> 
> Beyond the scalability aspects, I think that a design that exposes the group
> to the remote server and allows to "inject" events to the group queue
> will prevent
> users from useful features going forward.
> 
> For example, fanotify ignored_mask could be added to a group, even on
> a mount mark, even if the remote server only supports inode marks and it
> would just work.
> 
> Another point of view for the post-mortem events:
> As Miklos once noted and as you wrote above, for cache coherency and leases,
> an async notification queue is not adequate and synchronous notifications are
> too costly, so there needs to be some shared memory solution involving guest
> cache invalidation by host.

Any shared memory solution works only limited setup. If server is remote
on other machine, there is no sharing. I am hoping that this can be
generic enough to support other remote filesystems down the line.

> 
> Suppose said cache invalidation solution would be able to set a variety of
> "dirty" flags, not just one type of dirty or to call in another way -
> an "event mask".
> If that is available, then when a fuse inode gets evicted, the events from the
> "event mask" can be queued before destroying the inode and mark -
> post mortem event issue averted...

This is assuming that that server itself got the "IN_DELETE_SELF" event
when fuse is destroying its inode. But if inode might be alive due to
other client having fd open.

Even if other client does not have fd open, this still sounds racy. By
the time we set inode event_mask (using shared memory, instead of
sending an event notifiation), fuse might have cleaned up its inode.

There is a good chance I completely misunderstood what you are suggesting
here. :-)

Thanks
Vivek


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-11-03 22:29                         ` Vivek Goyal
@ 2021-11-04  5:19                           ` Amir Goldstein
  0 siblings, 0 replies; 47+ messages in thread
From: Amir Goldstein @ 2021-11-04  5:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Ioannis Angelakopoulos, linux-fsdevel, virtio-fs-list,
	linux-kernel, Al Viro, Miklos Szeredi, Steve French

> > > If event queue becomes too full, we might drop these events. But I guess
> > > in that case we will have to generate IN_Q_OVERFLOW and that can somehow
> > > be used to cleanup such S_DEAD inodes?
> >
> > That depends on the server implementation.
> > If the server is watching host fs using fanotify filesystem mark, then
> > an overflow
> > event does NOT mean that other new events on inode may be missed only
> > that old events could have been missed.
> > Server should know about all the watched inodes, so it can check on overflow
> > if any of the watched inodes were deleted and notify the client using a reliable
> > channel.
>
> Ok. We have only one channel for notifications. I guess we can program
> the channel in such a way so that it does not drop overflow events but
> can drop other kind of events if things get crazy. If too many overflow
> events and we allocate too much of memory, I guess at some point of
> time, oom killer will kick in a kill server.
>

The kernel implementation of fsnotify events queue pre-allocates
a single overflow event and never queues more than a single overflow
event. IN_Q_OVERFLOW must be delivered reliably, but delivering one
overflow event is enough (until it is consumed).

> >
> > Given the current server implementation with inotify, IN_Q_OVERFLOW
> > means server may have lost an IN_IGNORED event and may not get any
> > more events on inode, so server should check all the watched inodes after
> > overflow, notify the client of all deleted inodes and try to re-create
> > the watches
> > for all inodes with known path or use magic /prod/pid/fd path if that
> > works (??).
>
> Re-doing the watches sounds very painful.

Event overflow is a painful incident and systems usually pay a large
penalty when it happens (e.g. full recrawl of watched tree).
If virtiofsd is going to use inotify, it is no different than any other inotify
application that needs to bear the consequence of event overflow.

> That means we will need to
> keep track of aggregated mask in server side inode as well. As of
> now we just pass mask to kernel using inotify_add_watch() and forget
> about it.
>

It costs nothing to keep the aggregated mask in server side inode
and it makes sense to do that anyway.
This allows an implementation to notify about changes that the server
itself handles even if there is no backing filesystem behind it or
host OS has no fs notification support.

> /proc/pid/fd should work because I think that's how ioannis is putting
> current watches on inodes. We don't send path info to server.
>
> >
> > >
> > > nodeid is managed by server. So I am assuming that FORGET messages will
> > > not be sent to server for this inode till we have seen FS_IN_IGNORED
> > > and FS_DELETE_SELF events?
> > >
> >
> > Or until the application that requested the watch calls
> > inotify_rm_watch() or closes
> > the inotify fd.
> >
> > IOW, when fs implements remote fsnotify, the local watch keeps the local deleted
> > inode object in limbo until the local watch is removed.
> > When the remote fsnotify server informs that the remote watch (or remote inode)
> > is gone, the local watch is removed as well and then the inotify
> > application also gets
> > an FS_IN_IGNORED event.
>
> Hmm.., I guess remote server will simply send IN_DELETE event when it
> gets it and forward to client. And client will have to then cleanup
> this S_DEAD inode which is in limbo waiting for IN_DELETE_SELF event.
> And that should trigger cleanup of marks/local-watches on the inode, IIUC.
>

In very broad lines, but the server notification must be delivered reliably.

> >
> > Lifetime of local inode is complicated and lifetime of this "shared inode"
> > is much more complicated, so I am not pretending to claim that I have this all
> > figured out or that it could be reliably done at all.
>
> Yes this handling of IN_DELETE_SELF is turning out to be the most
> complicated piece of this proposal. I wish initial implementation
> could just be designed that it does not send IN_DELETE_SELF and
> IN_INGORED is generated locally. And later enhance it to support
> reliable delivery of IN_DELETE_SELF.
>

Not allowing DELETE_SELF in the mask sounds reasonable, but
as Ioannis explained, other events can be missed on local file delete.
If you want to preserve inotify semantics, you could queue an overflow
event if a fuse inode that gets evicted still has inotify marks.
That's a bit harsh though.
Alternatively, you could document in inotify man page that IN_INGORED
could mean that some events were dropped and hope for the best...

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-11-03 22:36                         ` Vivek Goyal
@ 2021-11-04  5:29                           ` Amir Goldstein
  2021-11-04 10:03                           ` Jan Kara
  1 sibling, 0 replies; 47+ messages in thread
From: Amir Goldstein @ 2021-11-04  5:29 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Ioannis Angelakopoulos, linux-fsdevel, virtio-fs-list,
	linux-kernel, Al Viro, Miklos Szeredi, Steve French

On Thu, Nov 4, 2021 at 12:36 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Wed, Nov 03, 2021 at 01:17:36PM +0200, Amir Goldstein wrote:
> > > > > > Hi Jan,
> > > > > >
> > > > > > Agreed. That's what Ioannis is trying to say. That some of the remote events
> > > > > > can be lost if fuse/guest local inode is unlinked. I think problem exists
> > > > > > both for shared and non-shared directory case.
> > > > > >
> > > > > > With local filesystems we have a control that we can first queue up
> > > > > > the event in buffer before we remove local watches. With events travelling
> > > > > > from a remote server, there is no such control/synchronization. It can
> > > > > > very well happen that events got delayed in the communication path
> > > > > > somewhere and local watches went away and now there is no way to
> > > > > > deliver those events to the application.
> > > > >
> > > > > So after thinking for some time about this I have the following question
> > > > > about the architecture of this solution: Why do you actually have local
> > > > > fsnotify watches at all? They seem to cause quite some trouble... I mean
> > > > > cannot we have fsnotify marks only on FUSE server and generate all events
> > > > > there? When e.g. file is created from the client, client tells the server
> > > > > about creation, the server performs the creation which generates the
> > > > > fsnotify event, that is received by the server and forwared back to the
> > > > > client which just queues it into notification group's queue for userspace
> > > > > to read it.
> > > > >
> > > > > Now with this architecture there's no problem with duplicate events for
> > > > > local & server notification marks, similarly there's no problem with lost
> > > > > events after inode deletion because events received by the client are
> > > > > directly queued into notification queue without any checking whether inode
> > > > > is still alive etc. Would this work or am I missing something?
> > > > >
> > > >
> > > > What about group #1 that wants mask A and group #2 that wants mask B
> > > > events?
> > > >
> > > > Do you propose to maintain separate event queues over the protocol?
> > > > Attach a "recipient list" to each event?
> > >
> > > Yes, that was my idea. Essentially when we see group A creates mark on FUSE
> > > for path P, we notify server, it will create notification group A on the
> > > server (if not already existing - there we need some notification group
> > > identifier unique among all clients), and place mark for it on path P. Then
> > > the full stream of notification events generated for group A on the server
> > > will just be forwarded to the client and inserted into the A's notification
> > > queue. IMO this is very simple solution to implement - you just need to
> > > forward mark addition / removal events from the client to the server and you
> > > forward event stream from the server to the client. Everything else is
> > > handled by the fsnotify infrastructure on the server.
> > >
> > > > I just don't see how this can scale other than:
> > > > - Local marks and connectors manage the subscriptions on local machine
> > > > - Protocol updates the server with the combined masks for watched objects
> > >
> > > I agree that depending on the usecase and particular FUSE filesystem
> > > performance of this solution may be a concern. OTOH the only additional
> > > cost of this solution I can see (compared to all those processes just
> > > watching files locally) is the passing of the events from the server to the
> > > client. For local FUSE filesystems such as virtiofs this should be rather
> > > cheap since you have to do very little processing for each generated event.
> > > For filesystems such as sshfs, I can imagine this would be a bigger deal.
> > >
> > > Also one problem I can see with my proposal is that it will have problems
> > > with stuff such as leases - i.e., if the client does not notify the server
> > > of the changes quickly but rather batches local operations and tells the
> > > server about them only on special occasions. I don't know enough about FUSE
> > > filesystems to tell whether this is a frequent problem or not.
> > >
> > > > I think that the "post-mortem events" issue could be solved by keeping an
> > > > S_DEAD fuse inode object in limbo just for the mark.
> > > > When a remote server sends FS_IN_IGNORED or FS_DELETE_SELF for
> > > > an inode, the fuse client inode can be finally evicted.
> > > > I haven't tried to see how hard that would be to implement.
> > >
> > > Sure, there can be other solutions to this particular problem. I just
> > > want to discuss the other architecture to see why we cannot to it in a
> > > simple way :).
> > >
> >
> > Fair enough.
> >
> > Beyond the scalability aspects, I think that a design that exposes the group
> > to the remote server and allows to "inject" events to the group queue
> > will prevent
> > users from useful features going forward.
> >
> > For example, fanotify ignored_mask could be added to a group, even on
> > a mount mark, even if the remote server only supports inode marks and it
> > would just work.
> >
> > Another point of view for the post-mortem events:
> > As Miklos once noted and as you wrote above, for cache coherency and leases,
> > an async notification queue is not adequate and synchronous notifications are
> > too costly, so there needs to be some shared memory solution involving guest
> > cache invalidation by host.
>
> Any shared memory solution works only limited setup. If server is remote
> on other machine, there is no sharing. I am hoping that this can be
> generic enough to support other remote filesystems down the line.
>

I do too :)

> >
> > Suppose said cache invalidation solution would be able to set a variety of
> > "dirty" flags, not just one type of dirty or to call in another way -
> > an "event mask".
> > If that is available, then when a fuse inode gets evicted, the events from the
> > "event mask" can be queued before destroying the inode and mark -
> > post mortem event issue averted...
>
> This is assuming that that server itself got the "IN_DELETE_SELF" event
> when fuse is destroying its inode. But if inode might be alive due to
> other client having fd open.
>
> Even if other client does not have fd open, this still sounds racy. By
> the time we set inode event_mask (using shared memory, instead of
> sending an event notifiation), fuse might have cleaned up its inode.
>

There is no escape from some sort of leases design for a reliable
and efficient shared remote fs.
Unless the client has an exclusive lease on the inode, it must provide
enough grace period before cleaning the inode to wait for an update
from the server if the client cares about getting all events on inode.

> There is a good chance I completely misunderstood what you are suggesting
> here. :-)
>

There is a good chance that I am talking nonsense :-)

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-11-03 22:36                         ` Vivek Goyal
  2021-11-04  5:29                           ` Amir Goldstein
@ 2021-11-04 10:03                           ` Jan Kara
  2021-11-05 14:30                             ` Vivek Goyal
  1 sibling, 1 reply; 47+ messages in thread
From: Jan Kara @ 2021-11-04 10:03 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Amir Goldstein, Jan Kara, Ioannis Angelakopoulos, linux-fsdevel,
	virtio-fs-list, linux-kernel, Al Viro, Miklos Szeredi,
	Steve French

On Wed 03-11-21 18:36:06, Vivek Goyal wrote:
> On Wed, Nov 03, 2021 at 01:17:36PM +0200, Amir Goldstein wrote:
> > > > > > Hi Jan,
> > > > > >
> > > > > > Agreed. That's what Ioannis is trying to say. That some of the remote events
> > > > > > can be lost if fuse/guest local inode is unlinked. I think problem exists
> > > > > > both for shared and non-shared directory case.
> > > > > >
> > > > > > With local filesystems we have a control that we can first queue up
> > > > > > the event in buffer before we remove local watches. With events travelling
> > > > > > from a remote server, there is no such control/synchronization. It can
> > > > > > very well happen that events got delayed in the communication path
> > > > > > somewhere and local watches went away and now there is no way to
> > > > > > deliver those events to the application.
> > > > >
> > > > > So after thinking for some time about this I have the following question
> > > > > about the architecture of this solution: Why do you actually have local
> > > > > fsnotify watches at all? They seem to cause quite some trouble... I mean
> > > > > cannot we have fsnotify marks only on FUSE server and generate all events
> > > > > there? When e.g. file is created from the client, client tells the server
> > > > > about creation, the server performs the creation which generates the
> > > > > fsnotify event, that is received by the server and forwared back to the
> > > > > client which just queues it into notification group's queue for userspace
> > > > > to read it.
> > > > >
> > > > > Now with this architecture there's no problem with duplicate events for
> > > > > local & server notification marks, similarly there's no problem with lost
> > > > > events after inode deletion because events received by the client are
> > > > > directly queued into notification queue without any checking whether inode
> > > > > is still alive etc. Would this work or am I missing something?
> > > > >
> > > >
> > > > What about group #1 that wants mask A and group #2 that wants mask B
> > > > events?
> > > >
> > > > Do you propose to maintain separate event queues over the protocol?
> > > > Attach a "recipient list" to each event?
> > >
> > > Yes, that was my idea. Essentially when we see group A creates mark on FUSE
> > > for path P, we notify server, it will create notification group A on the
> > > server (if not already existing - there we need some notification group
> > > identifier unique among all clients), and place mark for it on path P. Then
> > > the full stream of notification events generated for group A on the server
> > > will just be forwarded to the client and inserted into the A's notification
> > > queue. IMO this is very simple solution to implement - you just need to
> > > forward mark addition / removal events from the client to the server and you
> > > forward event stream from the server to the client. Everything else is
> > > handled by the fsnotify infrastructure on the server.
> > >
> > > > I just don't see how this can scale other than:
> > > > - Local marks and connectors manage the subscriptions on local machine
> > > > - Protocol updates the server with the combined masks for watched objects
> > >
> > > I agree that depending on the usecase and particular FUSE filesystem
> > > performance of this solution may be a concern. OTOH the only additional
> > > cost of this solution I can see (compared to all those processes just
> > > watching files locally) is the passing of the events from the server to the
> > > client. For local FUSE filesystems such as virtiofs this should be rather
> > > cheap since you have to do very little processing for each generated event.
> > > For filesystems such as sshfs, I can imagine this would be a bigger deal.
> > >
> > > Also one problem I can see with my proposal is that it will have problems
> > > with stuff such as leases - i.e., if the client does not notify the server
> > > of the changes quickly but rather batches local operations and tells the
> > > server about them only on special occasions. I don't know enough about FUSE
> > > filesystems to tell whether this is a frequent problem or not.
> > >
> > > > I think that the "post-mortem events" issue could be solved by keeping an
> > > > S_DEAD fuse inode object in limbo just for the mark.
> > > > When a remote server sends FS_IN_IGNORED or FS_DELETE_SELF for
> > > > an inode, the fuse client inode can be finally evicted.
> > > > I haven't tried to see how hard that would be to implement.
> > >
> > > Sure, there can be other solutions to this particular problem. I just
> > > want to discuss the other architecture to see why we cannot to it in a
> > > simple way :).
> > >
> > 
> > Fair enough.
> > 
> > Beyond the scalability aspects, I think that a design that exposes the group
> > to the remote server and allows to "inject" events to the group queue
> > will prevent
> > users from useful features going forward.
> > 
> > For example, fanotify ignored_mask could be added to a group, even on
> > a mount mark, even if the remote server only supports inode marks and it
> > would just work.
> > 
> > Another point of view for the post-mortem events:
> > As Miklos once noted and as you wrote above, for cache coherency and leases,
> > an async notification queue is not adequate and synchronous notifications are
> > too costly, so there needs to be some shared memory solution involving guest
> > cache invalidation by host.
> 
> Any shared memory solution works only limited setup. If server is remote
> on other machine, there is no sharing. I am hoping that this can be
> generic enough to support other remote filesystems down the line.

OK, so do I understand both you and Amir correctly that you think that
always relying on the FUSE server for generating the events and just piping
them to the client is not long-term viable design for FUSE? Mostly because
caching of modifications on the client is essentially inevitable and hence
generating events from the server would be unreliable (delayed too much)?

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-11-04 10:03                           ` Jan Kara
@ 2021-11-05 14:30                             ` Vivek Goyal
  2021-11-10  6:28                               ` Amir Goldstein
  0 siblings, 1 reply; 47+ messages in thread
From: Vivek Goyal @ 2021-11-05 14:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, Ioannis Angelakopoulos, linux-fsdevel,
	virtio-fs-list, linux-kernel, Al Viro, Miklos Szeredi,
	Steve French

On Thu, Nov 04, 2021 at 11:03:16AM +0100, Jan Kara wrote:
> On Wed 03-11-21 18:36:06, Vivek Goyal wrote:
> > On Wed, Nov 03, 2021 at 01:17:36PM +0200, Amir Goldstein wrote:
> > > > > > > Hi Jan,
> > > > > > >
> > > > > > > Agreed. That's what Ioannis is trying to say. That some of the remote events
> > > > > > > can be lost if fuse/guest local inode is unlinked. I think problem exists
> > > > > > > both for shared and non-shared directory case.
> > > > > > >
> > > > > > > With local filesystems we have a control that we can first queue up
> > > > > > > the event in buffer before we remove local watches. With events travelling
> > > > > > > from a remote server, there is no such control/synchronization. It can
> > > > > > > very well happen that events got delayed in the communication path
> > > > > > > somewhere and local watches went away and now there is no way to
> > > > > > > deliver those events to the application.
> > > > > >
> > > > > > So after thinking for some time about this I have the following question
> > > > > > about the architecture of this solution: Why do you actually have local
> > > > > > fsnotify watches at all? They seem to cause quite some trouble... I mean
> > > > > > cannot we have fsnotify marks only on FUSE server and generate all events
> > > > > > there? When e.g. file is created from the client, client tells the server
> > > > > > about creation, the server performs the creation which generates the
> > > > > > fsnotify event, that is received by the server and forwared back to the
> > > > > > client which just queues it into notification group's queue for userspace
> > > > > > to read it.
> > > > > >
> > > > > > Now with this architecture there's no problem with duplicate events for
> > > > > > local & server notification marks, similarly there's no problem with lost
> > > > > > events after inode deletion because events received by the client are
> > > > > > directly queued into notification queue without any checking whether inode
> > > > > > is still alive etc. Would this work or am I missing something?
> > > > > >
> > > > >
> > > > > What about group #1 that wants mask A and group #2 that wants mask B
> > > > > events?
> > > > >
> > > > > Do you propose to maintain separate event queues over the protocol?
> > > > > Attach a "recipient list" to each event?
> > > >
> > > > Yes, that was my idea. Essentially when we see group A creates mark on FUSE
> > > > for path P, we notify server, it will create notification group A on the
> > > > server (if not already existing - there we need some notification group
> > > > identifier unique among all clients), and place mark for it on path P. Then
> > > > the full stream of notification events generated for group A on the server
> > > > will just be forwarded to the client and inserted into the A's notification
> > > > queue. IMO this is very simple solution to implement - you just need to
> > > > forward mark addition / removal events from the client to the server and you
> > > > forward event stream from the server to the client. Everything else is
> > > > handled by the fsnotify infrastructure on the server.
> > > >
> > > > > I just don't see how this can scale other than:
> > > > > - Local marks and connectors manage the subscriptions on local machine
> > > > > - Protocol updates the server with the combined masks for watched objects
> > > >
> > > > I agree that depending on the usecase and particular FUSE filesystem
> > > > performance of this solution may be a concern. OTOH the only additional
> > > > cost of this solution I can see (compared to all those processes just
> > > > watching files locally) is the passing of the events from the server to the
> > > > client. For local FUSE filesystems such as virtiofs this should be rather
> > > > cheap since you have to do very little processing for each generated event.
> > > > For filesystems such as sshfs, I can imagine this would be a bigger deal.
> > > >
> > > > Also one problem I can see with my proposal is that it will have problems
> > > > with stuff such as leases - i.e., if the client does not notify the server
> > > > of the changes quickly but rather batches local operations and tells the
> > > > server about them only on special occasions. I don't know enough about FUSE
> > > > filesystems to tell whether this is a frequent problem or not.
> > > >
> > > > > I think that the "post-mortem events" issue could be solved by keeping an
> > > > > S_DEAD fuse inode object in limbo just for the mark.
> > > > > When a remote server sends FS_IN_IGNORED or FS_DELETE_SELF for
> > > > > an inode, the fuse client inode can be finally evicted.
> > > > > I haven't tried to see how hard that would be to implement.
> > > >
> > > > Sure, there can be other solutions to this particular problem. I just
> > > > want to discuss the other architecture to see why we cannot to it in a
> > > > simple way :).
> > > >
> > > 
> > > Fair enough.
> > > 
> > > Beyond the scalability aspects, I think that a design that exposes the group
> > > to the remote server and allows to "inject" events to the group queue
> > > will prevent
> > > users from useful features going forward.
> > > 
> > > For example, fanotify ignored_mask could be added to a group, even on
> > > a mount mark, even if the remote server only supports inode marks and it
> > > would just work.
> > > 
> > > Another point of view for the post-mortem events:
> > > As Miklos once noted and as you wrote above, for cache coherency and leases,
> > > an async notification queue is not adequate and synchronous notifications are
> > > too costly, so there needs to be some shared memory solution involving guest
> > > cache invalidation by host.
> > 
> > Any shared memory solution works only limited setup. If server is remote
> > on other machine, there is no sharing. I am hoping that this can be
> > generic enough to support other remote filesystems down the line.
> 
> OK, so do I understand both you and Amir correctly that you think that
> always relying on the FUSE server for generating the events and just piping
> them to the client is not long-term viable design for FUSE? Mostly because
> caching of modifications on the client is essentially inevitable and hence
> generating events from the server would be unreliable (delayed too much)?

Hi Jan,

Actually I had not even thought about operation caching in clients. IIUC,
as of now we only have modes to support caching of buffered writes in fuse
(which can be flushed later, -o writeback). Other file operations should go
to server.

To me, it sounds reasonable for FUSE server to generate events and that's
what we are doing in this RFC proposal. So idea is that an application
is effectively watching and receiving events for changes happening at
remote server.

As of now local events will be supressed so if some operations are local
to client only, then events will not be generated or will be generated
late when server sees those changes.

I am not sure if supressing all local events will serve all use cases
in long term though. For example, Amir was mentioning about fanotify,
events on mount objects and it might make sense to generate local
events there.

So initial implementation could be about, application either get local
events or remote events (based on filesystem). Down the line more
complicated modes can emerge where some combination of local and remote
events could be generated and applications could specify it. That
probably will be extension of fanotiy/inotify API.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-11-05 14:30                             ` Vivek Goyal
@ 2021-11-10  6:28                               ` Amir Goldstein
  2021-11-11 17:30                                 ` Jan Kara
  0 siblings, 1 reply; 47+ messages in thread
From: Amir Goldstein @ 2021-11-10  6:28 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Ioannis Angelakopoulos, linux-fsdevel, virtio-fs-list,
	linux-kernel, Al Viro, Miklos Szeredi, Steve French

> > OK, so do I understand both you and Amir correctly that you think that
> > always relying on the FUSE server for generating the events and just piping
> > them to the client is not long-term viable design for FUSE? Mostly because
> > caching of modifications on the client is essentially inevitable and hence
> > generating events from the server would be unreliable (delayed too much)?

This is one aspect, but we do not have to tie remote notifications to
the cache coherency problem that is more complicated.

OTOH, if virtiofs take the route of "remote groups", why not take it one step
further and implement "remote event queue".
Then, it does not need to push event notifications from the server
into fsnotify at all.
Instead, FUSE client can communicate with the server using ioctl or a new
command to implement new_group() that returns a special FUSE file and
use FUSE POLL/READ to check/read the server's event queue.

There is already a precedent of this model with CIFS_IOC_NOTIFY
and SMB2_CHANGE_NOTIFY - SMB protocol, samba server and Windows server
support watching a directory or a subtree. I think it is a "oneshot" watch, so
there is no polling nor queue involved.

>
> So initial implementation could be about, application either get local
> events or remote events (based on filesystem). Down the line more
> complicated modes can emerge where some combination of local and remote
> events could be generated and applications could specify it. That
> probably will be extension of fanotiy/inotify API.
>

There is one more problem with this approach.
We cannot silently change the behavior of existing FUSE filesystems.
What if a system has antivirus configured to scan access to virtiofs mounts?
Do you consider it reasonable that on system upgrade, the capability of
adding local watches would go away?

I understand the desire to have existing inotify applications work out of
the box to get remote notifications, but I have doubts if this goal is even
worth pursuing. Considering that the existing known use case described in this
thread is already using polling to identify changes to config files on the host,
it could just as well be using a new API to get the job done.

If we had to plan an interface without considering existing applications,
I think it would look something like:

#define FAN_MARK_INODE                                 0x00000000
#define FAN_MARK_MOUNT                               0x00000010
#define FAN_MARK_FILESYSTEM                      0x00000100
#define FAN_MARK_REMOTE_INODE                0x00001000
#define FAN_MARK_REMOTE_FILESYSTEM     0x00001100

Then, the application can choose to add a remote mark with a certain
event mask and a local mark with a different event mask without any ambiguity.
The remote events could be tagged with a flag (turn reserved member of
fanotify_event_metadata into flags).

We have made a lot of effort to make fanotify a super set of inotify
functionality removing old UAPI mistakes and baggage along the way,
so I'd really hate to see us going down the path of ambiguous UAPI
again.

IMO, the work that Ioannis has already done to support remote
notifications with virtiofsd is perfectly valid as a private functionality
of virtiofs, just like cifs CIFS_IOC_NOTIFY.

If we want to go further and implement a "remote notification subsystem"
then I think we should take other fs (e.g.cifs) into consideration and think
about functionality beyond the plain remote watch and create an UAPI
that can be used to extend the functionality further.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-11-10  6:28                               ` Amir Goldstein
@ 2021-11-11 17:30                                 ` Jan Kara
  2021-11-11 20:52                                   ` Amir Goldstein
  0 siblings, 1 reply; 47+ messages in thread
From: Jan Kara @ 2021-11-11 17:30 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Vivek Goyal, Jan Kara, Ioannis Angelakopoulos, linux-fsdevel,
	virtio-fs-list, linux-kernel, Al Viro, Miklos Szeredi,
	Steve French

On Wed 10-11-21 08:28:13, Amir Goldstein wrote:
> > > OK, so do I understand both you and Amir correctly that you think that
> > > always relying on the FUSE server for generating the events and just piping
> > > them to the client is not long-term viable design for FUSE? Mostly because
> > > caching of modifications on the client is essentially inevitable and hence
> > > generating events from the server would be unreliable (delayed too much)?
> 
> This is one aspect, but we do not have to tie remote notifications to
> the cache coherency problem that is more complicated.
> 
> OTOH, if virtiofs take the route of "remote groups", why not take it one step
> further and implement "remote event queue".
> Then, it does not need to push event notifications from the server
> into fsnotify at all.
> Instead, FUSE client can communicate with the server using ioctl or a new
> command to implement new_group() that returns a special FUSE file and
> use FUSE POLL/READ to check/read the server's event queue.

Yes, that could work. But I don't see how you want to avoid pushing events
to fsnotify on the client side. Suppose app creates fanotify group G, it
adds mark for some filesystem F1 to it, then in adds marks for another
filesystem F2 - this time it is FUSE based fs. App wants to receive events
for both F1 and F2 but events for F2 are actually coming from the server
and somehow they have to get into G's queue for an app to read them.

> There is already a precedent of this model with CIFS_IOC_NOTIFY
> and SMB2_CHANGE_NOTIFY - SMB protocol, samba server and Windows server
> support watching a directory or a subtree. I think it is a "oneshot" watch, so
> there is no polling nor queue involved.

OK, are you aiming at completely separate notification API for userspace
for FUSE-based filesystems? Then I see why you don't need to push events
to fsnotify but I don't quite like that for a few reasons:

1) Application has to be aware whether the filesystem it operates on is
FUSE based or not. That is IMO unnecessary burden for the applications.

2) If application wants to watch for several paths potentially on multiple
filesystems, it would now need to somehow merge the event streams from FUSE
API and {fa/i}notify API.

3) It would potentially duplicate quite some parts of fsnotify API
handling.

> > So initial implementation could be about, application either get local
> > events or remote events (based on filesystem). Down the line more
> > complicated modes can emerge where some combination of local and remote
> > events could be generated and applications could specify it. That
> > probably will be extension of fanotiy/inotify API.
> 
> There is one more problem with this approach.
> We cannot silently change the behavior of existing FUSE filesystems.
> What if a system has antivirus configured to scan access to virtiofs mounts?
> Do you consider it reasonable that on system upgrade, the capability of
> adding local watches would go away?
 
I agree we have to be careful there. If fanotify / inotify is currently
usable with virtiofs, just missing events coming from host changes (which
are presumably rare for lots of setups), it is likely used by someone and
we should not regress it.

> I understand the desire to have existing inotify applications work out of
> the box to get remote notifications, but I have doubts if this goal is even
> worth pursuing. Considering that the existing known use case described in this
> thread is already using polling to identify changes to config files on the host,
> it could just as well be using a new API to get the job done.
> 
> If we had to plan an interface without considering existing applications,
> I think it would look something like:
> 
> #define FAN_MARK_INODE                                 0x00000000
> #define FAN_MARK_MOUNT                               0x00000010
> #define FAN_MARK_FILESYSTEM                      0x00000100
> #define FAN_MARK_REMOTE_INODE                0x00001000
> #define FAN_MARK_REMOTE_FILESYSTEM     0x00001100
> 
> Then, the application can choose to add a remote mark with a certain
> event mask and a local mark with a different event mask without any ambiguity.
> The remote events could be tagged with a flag (turn reserved member of
> fanotify_event_metadata into flags).
> 
> We have made a lot of effort to make fanotify a super set of inotify
> functionality removing old UAPI mistakes and baggage along the way,
> so I'd really hate to see us going down the path of ambiguous UAPI
> again.

So there's a question: Does application care whether the event comes from
local or remote side? I'd say generally no - a file change is simply a file
change and it does not matter who did it. Also again this API implies the
application has to be aware it runs on a filesystem that may generate
remote events to ask for them. But presumably knowledgeable app can always
ask for local & remote events and if that fails (fs does not support remote
events), ask only for local. That's not a big burden.

The question is why would we make remote events explicit (and thus
complicate the API and usage) when generally apps cannot be bothered who
did the change. I suppose it makes implementation and some of the
consequences on the stream of events more obvious. Also it allows
functionality such as permission or mount events which are only local when
the server does not support them which could be useful for "mostly local"
filesystems. Hum...

> IMO, the work that Ioannis has already done to support remote
> notifications with virtiofsd is perfectly valid as a private functionality
> of virtiofs, just like cifs CIFS_IOC_NOTIFY.
> 
> If we want to go further and implement a "remote notification subsystem"
> then I think we should take other fs (e.g.cifs) into consideration and think
> about functionality beyond the plain remote watch and create an UAPI
> that can be used to extend the functionality further.

So I agree that if we go for "remote notification subsystem", then we
better consider cases like other FUSE filesystems, cifs, or NFS. I haven't
yet made up my mind whether we cannot somehow seamlessly incorporate remote
events into fsnotify because it seems to me they are identical to local
events from app point of view. But I agree that if we cannot find a way to
integrate remote events without many rough edges, then it is better to make
remote events explicit.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs
  2021-11-11 17:30                                 ` Jan Kara
@ 2021-11-11 20:52                                   ` Amir Goldstein
  0 siblings, 0 replies; 47+ messages in thread
From: Amir Goldstein @ 2021-11-11 20:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, Ioannis Angelakopoulos, linux-fsdevel,
	virtio-fs-list, linux-kernel, Al Viro, Miklos Szeredi,
	Steve French

On Thu, Nov 11, 2021 at 7:30 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 10-11-21 08:28:13, Amir Goldstein wrote:
> > > > OK, so do I understand both you and Amir correctly that you think that
> > > > always relying on the FUSE server for generating the events and just piping
> > > > them to the client is not long-term viable design for FUSE? Mostly because
> > > > caching of modifications on the client is essentially inevitable and hence
> > > > generating events from the server would be unreliable (delayed too much)?
> >
> > This is one aspect, but we do not have to tie remote notifications to
> > the cache coherency problem that is more complicated.
> >
> > OTOH, if virtiofs take the route of "remote groups", why not take it one step
> > further and implement "remote event queue".
> > Then, it does not need to push event notifications from the server
> > into fsnotify at all.
> > Instead, FUSE client can communicate with the server using ioctl or a new
> > command to implement new_group() that returns a special FUSE file and
> > use FUSE POLL/READ to check/read the server's event queue.
>
> Yes, that could work. But I don't see how you want to avoid pushing events
> to fsnotify on the client side. Suppose app creates fanotify group G, it
> adds mark for some filesystem F1 to it, then in adds marks for another
> filesystem F2 - this time it is FUSE based fs. App wants to receive events
> for both F1 and F2 but events for F2 are actually coming from the server
> and somehow they have to get into G's queue for an app to read them.
>
> > There is already a precedent of this model with CIFS_IOC_NOTIFY
> > and SMB2_CHANGE_NOTIFY - SMB protocol, samba server and Windows server
> > support watching a directory or a subtree. I think it is a "oneshot" watch, so
> > there is no polling nor queue involved.
>
> OK, are you aiming at completely separate notification API for userspace
> for FUSE-based filesystems?

Yes, if the group is "remote" then all the marks are "remote" and event
queue is also remote.

> Then I see why you don't need to push events
> to fsnotify but I don't quite like that for a few reasons:
>
> 1) Application has to be aware whether the filesystem it operates on is
> FUSE based or not. That is IMO unnecessary burden for the applications.
>
> 2) If application wants to watch for several paths potentially on multiple
> filesystems, it would now need to somehow merge the event streams from FUSE
> API and {fa/i}notify API.
>
> 3) It would potentially duplicate quite some parts of fsnotify API
> handling.
>

I don't like so much either, but virtiofs developers are free to pursue this
direction without involving the fsnotify subsystem (like cifs).
One can even implement userspace wrappers for inotify library functions
that know how to utilize remote cifs/fuse watches...

> > > So initial implementation could be about, application either get local
> > > events or remote events (based on filesystem). Down the line more
> > > complicated modes can emerge where some combination of local and remote
> > > events could be generated and applications could specify it. That
> > > probably will be extension of fanotiy/inotify API.
> >
> > There is one more problem with this approach.
> > We cannot silently change the behavior of existing FUSE filesystems.
> > What if a system has antivirus configured to scan access to virtiofs mounts?
> > Do you consider it reasonable that on system upgrade, the capability of
> > adding local watches would go away?
>
> I agree we have to be careful there. If fanotify / inotify is currently
> usable with virtiofs, just missing events coming from host changes (which
> are presumably rare for lots of setups), it is likely used by someone and
> we should not regress it.
>

I don't see why it wouldn't be usable.
I think we have to assume that someone is using inotify/fanotify
on virtiofs.

> > I understand the desire to have existing inotify applications work out of
> > the box to get remote notifications, but I have doubts if this goal is even
> > worth pursuing. Considering that the existing known use case described in this
> > thread is already using polling to identify changes to config files on the host,
> > it could just as well be using a new API to get the job done.
> >
> > If we had to plan an interface without considering existing applications,
> > I think it would look something like:
> >
> > #define FAN_MARK_INODE                                 0x00000000
> > #define FAN_MARK_MOUNT                               0x00000010
> > #define FAN_MARK_FILESYSTEM                      0x00000100
> > #define FAN_MARK_REMOTE_INODE                0x00001000
> > #define FAN_MARK_REMOTE_FILESYSTEM     0x00001100
> >
> > Then, the application can choose to add a remote mark with a certain
> > event mask and a local mark with a different event mask without any ambiguity.
> > The remote events could be tagged with a flag (turn reserved member of
> > fanotify_event_metadata into flags).
> >
> > We have made a lot of effort to make fanotify a super set of inotify
> > functionality removing old UAPI mistakes and baggage along the way,
> > so I'd really hate to see us going down the path of ambiguous UAPI
> > again.
>
> So there's a question: Does application care whether the event comes from
> local or remote side? I'd say generally no - a file change is simply a file
> change and it does not matter who did it. Also again this API implies the
> application has to be aware it runs on a filesystem that may generate
> remote events to ask for them. But presumably knowledgeable app can always
> ask for local & remote events and if that fails (fs does not support remote
> events), ask only for local. That's not a big burden.
>
> The question is why would we make remote events explicit (and thus
> complicate the API and usage) when generally apps cannot be bothered who
> did the change. I suppose it makes implementation and some of the
> consequences on the stream of events more obvious. Also it allows
> functionality such as permission or mount events which are only local when
> the server does not support them which could be useful for "mostly local"
> filesystems. Hum...
>

Yeh, it is not clear at all what's the best approach.
Maybe the application would just need to specify FAN_REMOTE_EVENTS
in fanotify_init() to opt-in for any new behavior to avoid surprises and not
be any more explicit than that for the common use case.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2021-11-11 20:52 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-25 20:46 [RFC PATCH 0/7] Inotify support in FUSE and virtiofs Ioannis Angelakopoulos
2021-10-25 20:46 ` [RFC PATCH 1/7] FUSE: Add the fsnotify opcode and in/out structs to FUSE Ioannis Angelakopoulos
2021-10-26 14:56   ` Amir Goldstein
2021-10-26 18:28     ` Amir Goldstein
     [not found]     ` <CAO17o20+jiij64y7b3eKoCjG5b_mLZj6o1LSnZ7+8exN3dFYEg@mail.gmail.com>
2021-10-27  5:46       ` Amir Goldstein
2021-10-27 21:46     ` Vivek Goyal
2021-10-28  4:13       ` Amir Goldstein
2021-10-28 14:20         ` Vivek Goyal
2021-10-25 20:46 ` [RFC PATCH 2/7] FUSE: Add the remote inotify support capability " Ioannis Angelakopoulos
2021-10-25 20:46 ` [RFC PATCH 3/7] FUSE,Inotify,Fsnotify,VFS: Add the fuse_fsnotify_update_mark inode operation Ioannis Angelakopoulos
2021-10-26 15:06   ` Amir Goldstein
2021-11-01 17:49     ` Vivek Goyal
2021-11-02  7:34       ` Amir Goldstein
2021-10-25 20:46 ` [RFC PATCH 4/7] FUSE: Add the fuse_fsnotify_send_request to FUSE Ioannis Angelakopoulos
2021-10-25 20:46 ` [RFC PATCH 5/7] Fsnotify: Add a wrapper around the fsnotify function Ioannis Angelakopoulos
2021-10-26 14:37   ` Amir Goldstein
2021-10-26 15:38     ` Vivek Goyal
2021-10-25 20:46 ` [RFC PATCH 6/7] FUSE,Fsnotify: Add the fuse_fsnotify_event inode operation Ioannis Angelakopoulos
2021-10-25 20:46 ` [RFC PATCH 7/7] virtiofs: Add support for handling the remote fsnotify notifications Ioannis Angelakopoulos
2021-10-26 15:23 ` [RFC PATCH 0/7] Inotify support in FUSE and virtiofs Amir Goldstein
2021-10-26 15:52   ` Vivek Goyal
2021-10-26 18:19     ` Amir Goldstein
2021-10-26 16:18   ` Vivek Goyal
2021-10-26 17:59     ` Amir Goldstein
2021-10-26 18:27       ` Vivek Goyal
2021-10-26 19:04         ` Amir Goldstein
     [not found]         ` <CAO17o20sdKAWQN6w7Oe0Ze06qcK+J=6rrmA_aWGnY__MRVDCKw@mail.gmail.com>
2021-10-27  5:59           ` Amir Goldstein
2021-10-27 13:23             ` Jan Kara
2021-10-27 20:29               ` Vivek Goyal
2021-10-27 20:37                 ` Vivek Goyal
2021-11-02 11:09                 ` Jan Kara
2021-11-02 12:54                   ` Amir Goldstein
2021-11-02 20:34                     ` Vivek Goyal
2021-11-03  7:31                       ` Amir Goldstein
2021-11-03 22:29                         ` Vivek Goyal
2021-11-04  5:19                           ` Amir Goldstein
2021-11-03 10:09                     ` Jan Kara
2021-11-03 11:17                       ` Amir Goldstein
2021-11-03 22:36                         ` Vivek Goyal
2021-11-04  5:29                           ` Amir Goldstein
2021-11-04 10:03                           ` Jan Kara
2021-11-05 14:30                             ` Vivek Goyal
2021-11-10  6:28                               ` Amir Goldstein
2021-11-11 17:30                                 ` Jan Kara
2021-11-11 20:52                                   ` Amir Goldstein
2021-10-27 20:24             ` Vivek Goyal
2021-10-28  5:11               ` Amir Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).